### Introduction

This project aims to evaluate the performance of the Google Cloud Vision API in image recognizing, more specifically, detecting objects within a set of images. I tried OpenAI API first but it's not accessible for some reason, I searched this problem but not seeing a solution beyond same probem complains, so I decideed to shift to Microsoft (I put the progress in the end of this notebook). My dataset consists a list of 10 images, each depicting distinct objects or scenes, such as animals, vehicles, and landscapes. I put the 'ground trut'of the object in the picture base on human judgement. I hypothesize that the Vision API will accurately identify and classify the majority of the objects in the images, given its advanced machine learning models. But not necessarily perfect align with my 'ground truth' column since there could be various names, bias or confusion of the main object. This analysis will provide insights into the API's capabilities and potential areas of improvement.


I hvae 2 ground truth dataset, one is I labeled only 1 word for each image, another is I labeled a few different word for each image. Below, I will compare and analyze the accuracy of each of them against the Vision API result. The ranging from singular to multiple descriptors is to encapsulate the varied and complex nature of visual content. 

For setting up Google Vision API vikrtual environment, I referenced:

https://www.youtube.com/watch?v=kZ3OL3AN_IA&list=PL3JVwFmb_BnRY8qaG5S1MhtlxTOgE-_k0&index=1

OpenAI Vision API Error:

https://github.com/OthersideAI/self-operating-computer/issues/26 


Some image resources on Google, some others are from my own photos. The dataset comprises diverse subjects, including animals, landscapes, and urban scenes, chosen to test the API's ability to accurately recognize and label a wide array of elements within images. The motivation behind using this dataset was to assess how well the API can handle the multifaceted aspects of real-world images, with a specific hypothesis that the API might underperform on images with ambiguous or overlapping categories (e.g., an image labeled both "park" and "grass"). By employing a dataset with both singular and multiple labels, we aimed to measure the API's precision in not just identifying primary objects but also in capturing the broader context and secondary elements present in the images, thus providing a comprehensive evaluation of its capabilities and limitations.

Image resources:

4: https://www.vanorohotel.com/wp-content/uploads/2021/07/drz-vanoro_6737.jpg

5: https://thumbs.dreamstime.com/b/couple-dating-hugging-love-park-urban-sunny-day-51723735.jpg

6: https://thumbs.dreamstime.com/b/couple-dating-hugging-love-park-urban-sunny-day-51723735.jpg

7: https://climate.nasa.gov/system/internal_resources/details/original/309_ImageWall5_768px-60.jpg 

8: https://images.pexels.com/photos/5847385/pexels-photo-5847385.jpeg?cs=srgb&dl=pexels-charles-parker-5847385.jpg&fm=jpg 

9: https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/R160_E_enters_42nd_Street.jpg/300px-R160_E_enters_42nd_Street.jpg

10: https://upload.wikimedia.org/wikipedia/commons/3/35/Neckertal_20150527-6384.jpg 

In [65]:
# # install package 
# pip install google-cloud-vision pandas

In [66]:
import io
import os
import pandas as pd
from google.cloud import vision
from google_vision_ai import VisionAI, prepare_image_local

In [67]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'client_file_ai_vision_demo.json'
client = vision.ImageAnnotatorClient()

In [68]:
image_data = pd.read_csv('dataset.csv')
image_data

Unnamed: 0,filename,label
0,img1.jpg,cat
1,img2.jpg,sunflower
2,img3.jpg,lighthouse
3,img4.jpg,bedroom
4,img5.jpg,car
5,img6.jpg,couple
6,img7.jpg,earth
7,img8.jpg,street
8,img9.jpg,subway
9,img10.jpg,landscape


In [69]:
api_responses = []

# Iterate over each row in the DataFrame to call the API and store the response
for index, row in image_data.iterrows():
    image_path = f'./image/{row["filename"]}'
    image = prepare_image_local(image_path)
    va = VisionAI(client, image)
    label_detections = va.label_detection()
    api_responses.append((row['filename'], label_detections))

api_responses_df = pd.DataFrame(api_responses, columns=['Image', 'API_Labels'])

In [70]:
api_responses_df

Unnamed: 0,Image,API_Labels
0,img1.jpg,"[(Cat, 0.95), (Felidae, 0.86), (Small to mediu..."
1,img2.jpg,"[(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (..."
2,img3.jpg,"[(Sky, 0.97), (Lighthouse, 0.96), (Building, 0..."
3,img4.jpg,"[(Furniture, 0.96), (Building, 0.93), (Picture..."
4,img5.jpg,"[(Wheel, 0.98), (Tire, 0.97), (Vehicle, 0.96),..."
5,img6.jpg,"[(Smile, 0.98), (Flash photography, 0.88), (Pe..."
6,img7.jpg,"[(Atmosphere, 0.95), (World, 0.92), (Astronomi..."
7,img8.jpg,"[(Building, 0.96), (Sky, 0.96), (Skyscraper, 0..."
8,img9.jpg,"[(Train, 0.98), (Transport hub, 0.88), (Mode o..."
9,img10.jpg,"[(Cloud, 0.97), (Plant, 0.96), (Mountain, 0.96..."


In [71]:
api_responses_df['API_Label'] = api_responses_df['API_Labels'].apply(lambda x: x if x else 'None')

In [72]:
api_responses_df

Unnamed: 0,Image,API_Labels,API_Label
0,img1.jpg,"[(Cat, 0.95), (Felidae, 0.86), (Small to mediu...","[(Cat, 0.95), (Felidae, 0.86), (Small to mediu..."
1,img2.jpg,"[(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (...","[(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (..."
2,img3.jpg,"[(Sky, 0.97), (Lighthouse, 0.96), (Building, 0...","[(Sky, 0.97), (Lighthouse, 0.96), (Building, 0..."
3,img4.jpg,"[(Furniture, 0.96), (Building, 0.93), (Picture...","[(Furniture, 0.96), (Building, 0.93), (Picture..."
4,img5.jpg,"[(Wheel, 0.98), (Tire, 0.97), (Vehicle, 0.96),...","[(Wheel, 0.98), (Tire, 0.97), (Vehicle, 0.96),..."
5,img6.jpg,"[(Smile, 0.98), (Flash photography, 0.88), (Pe...","[(Smile, 0.98), (Flash photography, 0.88), (Pe..."
6,img7.jpg,"[(Atmosphere, 0.95), (World, 0.92), (Astronomi...","[(Atmosphere, 0.95), (World, 0.92), (Astronomi..."
7,img8.jpg,"[(Building, 0.96), (Sky, 0.96), (Skyscraper, 0...","[(Building, 0.96), (Sky, 0.96), (Skyscraper, 0..."
8,img9.jpg,"[(Train, 0.98), (Transport hub, 0.88), (Mode o...","[(Train, 0.98), (Transport hub, 0.88), (Mode o..."
9,img10.jpg,"[(Cloud, 0.97), (Plant, 0.96), (Mountain, 0.96...","[(Cloud, 0.97), (Plant, 0.96), (Mountain, 0.96..."


In [73]:
merged_df = pd.merge(image_data, api_responses_df, left_on='filename', right_on='Image')

In [74]:
def label_comparison(row):
    ground_truth_labels = [label.strip().lower() for label in row['label'].split(',')]
    api_labels = [label.lower() for label, _ in row['API_Label']]
    match = any(gt_label in api_labels for gt_label in ground_truth_labels)
    
    return 'Match' if match else 'Mismatch'


#### Comparing to single ground truth labels 

In [75]:
merged_df['Comparison'] = merged_df.apply(label_comparison, axis=1)
accuracy = len(merged_df[merged_df['Comparison'] == 'Match']) / len(merged_df)
print(f'Accuracy: {accuracy:.2f}')
print(merged_df[['filename', 'label', 'API_Label', 'Comparison']])


Accuracy: 0.30
    filename       label                                          API_Label  \
0   img1.jpg         cat  [(Cat, 0.95), (Felidae, 0.86), (Small to mediu...   
1   img2.jpg   sunflower  [(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (...   
2   img3.jpg  lighthouse  [(Sky, 0.97), (Lighthouse, 0.96), (Building, 0...   
3   img4.jpg     bedroom  [(Furniture, 0.96), (Building, 0.93), (Picture...   
4   img5.jpg         car  [(Wheel, 0.98), (Tire, 0.97), (Vehicle, 0.96),...   
5   img6.jpg      couple  [(Smile, 0.98), (Flash photography, 0.88), (Pe...   
6   img7.jpg       earth  [(Atmosphere, 0.95), (World, 0.92), (Astronomi...   
7   img8.jpg      street  [(Building, 0.96), (Sky, 0.96), (Skyscraper, 0...   
8   img9.jpg      subway  [(Train, 0.98), (Transport hub, 0.88), (Mode o...   
9  img10.jpg   landscape  [(Cloud, 0.97), (Plant, 0.96), (Mountain, 0.96...   

  Comparison  
0      Match  
1   Mismatch  
2      Match  
3   Mismatch  
4      Match  
5   Mismatch  
6   Mismat

The single descriptor match accuracy is 0.3

#### Now Comparing to multiple ground truth labels 

In [76]:
image_df = pd.read_csv('dataset1.csv')
image_df

Unnamed: 0,filename,label
0,img1.jpg,"cat,floor,cable"
1,img2.jpg,"sunflower,window,bridge,building,leaf"
2,img3.jpg,"lighthouse,house,eky,ground,hill,people"
3,img4.jpg,"room,bed,bedroom,couch,TV,desk"
4,img5.jpg,"car,vehicle,sportscar"
5,img6.jpg,"couple,smile,man,woman,coat"
6,img7.jpg,"earth,planet,globe"
7,img8.jpg,"street,building,crossing,traffic light"
8,img9.jpg,"subway,train,station,people"
9,img10.jpg,"landscape,tree,house,mountain,grassland"


In [77]:
image_df['label'] = image_df['label'].apply(lambda x: [label.strip().lower() for label in x.split(',')] if isinstance(x, str) else x)
image_df

Unnamed: 0,filename,label
0,img1.jpg,"[cat, floor, cable]"
1,img2.jpg,"[sunflower, window, bridge, building, leaf]"
2,img3.jpg,"[lighthouse, house, eky, ground, hill, people]"
3,img4.jpg,"[room, bed, bedroom, couch, tv, desk]"
4,img5.jpg,"[car, vehicle, sportscar]"
5,img6.jpg,"[couple, smile, man, woman, coat]"
6,img7.jpg,"[earth, planet, globe]"
7,img8.jpg,"[street, building, crossing, traffic light]"
8,img9.jpg,"[subway, train, station, people]"
9,img10.jpg,"[landscape, tree, house, mountain, grassland]"


In [115]:
merged_df1 = pd.merge(image_df, api_responses_df, left_on='filename', right_on='Image')

In [79]:
def label_comparison(row):
    ground_truth_labels = row['label']  # Already a list from previous preprocessing
    api_labels = [label.lower() for label, _ in row['API_Label']]
    
    match = any(gt_label.lower() in api_labels for gt_label in ground_truth_labels)

    return 'Match' if match else 'Mismatch'


In [123]:
merged_df1['Comparison'] = merged_df1.apply(label_comparison, axis=1)

In [81]:
merged_df1['Comparison'] = merged_df1.apply(label_comparison, axis=1)

# Calculate accuracy based on matches
accuracy = len(merged_df1[merged_df1['Comparison'] == 'Match']) / len(merged_df1)
print(f'Accuracy: {accuracy:.2f}')
print(merged_df1[['filename', 'label', 'API_Label', 'Comparison']])

Accuracy: 0.90
    filename                                           label  \
0   img1.jpg                             [cat, floor, cable]   
1   img2.jpg     [sunflower, window, bridge, building, leaf]   
2   img3.jpg  [lighthouse, house, eky, ground, hill, people]   
3   img4.jpg           [room, bed, bedroom, couch, tv, desk]   
4   img5.jpg                       [car, vehicle, sportscar]   
5   img6.jpg               [couple, smile, man, woman, coat]   
6   img7.jpg                          [earth, planet, globe]   
7   img8.jpg     [street, building, crossing, traffic light]   
8   img9.jpg                [subway, train, station, people]   
9  img10.jpg   [landscape, tree, house, mountain, grassland]   

                                           API_Label Comparison  
0  [(Cat, 0.95), (Felidae, 0.86), (Small to mediu...      Match  
1  [(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (...      Match  
2  [(Sky, 0.97), (Lighthouse, 0.96), (Building, 0...      Match  
3  [(Furniture, 

In [84]:
merged_df1

Unnamed: 0,filename,label,Image,API_Labels,API_Label,Comparison
0,img1.jpg,"[cat, floor, cable]",img1.jpg,"[(Cat, 0.95), (Felidae, 0.86), (Small to mediu...","[(Cat, 0.95), (Felidae, 0.86), (Small to mediu...",Match
1,img2.jpg,"[sunflower, window, bridge, building, leaf]",img2.jpg,"[(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (...","[(Flower, 0.98), (Plant, 0.95), (Sky, 0.91), (...",Match
2,img3.jpg,"[lighthouse, house, eky, ground, hill, people]",img3.jpg,"[(Sky, 0.97), (Lighthouse, 0.96), (Building, 0...","[(Sky, 0.97), (Lighthouse, 0.96), (Building, 0...",Match
3,img4.jpg,"[room, bed, bedroom, couch, tv, desk]",img4.jpg,"[(Furniture, 0.96), (Building, 0.93), (Picture...","[(Furniture, 0.96), (Building, 0.93), (Picture...",Mismatch
4,img5.jpg,"[car, vehicle, sportscar]",img5.jpg,"[(Wheel, 0.98), (Tire, 0.97), (Vehicle, 0.96),...","[(Wheel, 0.98), (Tire, 0.97), (Vehicle, 0.96),...",Match
5,img6.jpg,"[couple, smile, man, woman, coat]",img6.jpg,"[(Smile, 0.98), (Flash photography, 0.88), (Pe...","[(Smile, 0.98), (Flash photography, 0.88), (Pe...",Match
6,img7.jpg,"[earth, planet, globe]",img7.jpg,"[(Atmosphere, 0.95), (World, 0.92), (Astronomi...","[(Atmosphere, 0.95), (World, 0.92), (Astronomi...",Match
7,img8.jpg,"[street, building, crossing, traffic light]",img8.jpg,"[(Building, 0.96), (Sky, 0.96), (Skyscraper, 0...","[(Building, 0.96), (Sky, 0.96), (Skyscraper, 0...",Match
8,img9.jpg,"[subway, train, station, people]",img9.jpg,"[(Train, 0.98), (Transport hub, 0.88), (Mode o...","[(Train, 0.98), (Transport hub, 0.88), (Mode o...",Match
9,img10.jpg,"[landscape, tree, house, mountain, grassland]",img10.jpg,"[(Cloud, 0.97), (Plant, 0.96), (Mountain, 0.96...","[(Cloud, 0.97), (Plant, 0.96), (Mountain, 0.96...",Match


The multiple descriptor match accuracy is 0.9

With a single label, the accuracy was 30%, highlighting the challenges in matching the API's diverse predictions to a singular descriptor. However, when multiple labels were allowed, reflecting a broader understanding of each image's content, accuracy significantly increased to 90%. This improvement underscores the API's strength in recognizing a wide range of elements within an image, suggesting that it performs best when the complexity of visual content is fully acknowledged through multiple acceptable labels. This experiment's findings highlight the importance of context and the breadth of labels in evaluating image recognition technologies. The substantial difference in accuracy between the single-label and multi-label approaches emphasizes the need for flexible definitions of "correct" outcomes in AI applications. By allowing a range of correct labels, the utility and accuracy of image recognition APIs like Google Vision can be significantly enhanced, especially in applications requiring a nuanced understanding of visual data.

#### Evelauate by Other Metrics

Refcerence: https://www.linkedin.com/advice/3/what-best-practices-evaluating-performance-deep#:~:text=The%20first%20step%20in%20evaluating,%2Dscore%2C%20and%20confusion%20matrix.

In [124]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import MultiLabelBinarizer

In [126]:
def identify_matches(row):
    predicted_labels = set(label[0].lower() for label in row['API_Labels'])
    ground_truth_labels = set(label.lower() for label in row['label'])
    matches = ground_truth_labels.intersection(predicted_labels)
    return list(matches)

# Apply the function
merged_df1['Matches'] = merged_df1.apply(identify_matches, axis=1)


In [127]:
# Fit and transform the matches
y_true_matches = mlb_matches.fit_transform(merged_df1['label'])  # Ground truth
y_pred_matches = mlb_matches.transform(merged_df1['Matches'])    # Predicted matches

# Calculate metrics based on the binary format of matches
precision = precision_score(y_true_matches, y_pred_matches, average='micro')
recall = recall_score(y_true_matches, y_pred_matches, average='micro')
f1 = f1_score(y_true_matches, y_pred_matches, average='micro')

print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')

Precision: 1.00
Recall: 0.27
F1 Score: 0.43


High Precision (1.00): This suggests that the model likely predicts a label correct which indicates there are very few false positives. This aligns with my hypothesis in the begining that in most cases, model can correctly predict label.

Lower Recall (0.27): This indicates that the model is missing a significant number of true labels. There are many false negatives, where the model fails to identify true labels. This could be in some images it has a lot of different ground truth labels, and the Vision AI result is not matching a majority of all groud truth labels. While more descriptor in the ground truth labels, it gives more opportunity for AI model to have one match which increases the simplified accuracy (as we see above), it also increases the false negative rates. This indicates the limiatation of AI models detecting and interpreting objects in the image and varies from human insights besides the most significant object in the image.

F1 Score (0.43): The F1 score balances precision and recall in a single metric, and a score of 0.43 suggests that while precision is high, the overall effectiveness of the model (considering both precision and recall) is moderate. The low recall significantly impacts the F1 score.

### Reflection

 The result above demonstrated the strengths of the API in identifying diverse elements within images, also reveal its limitations in fully matching the human ability to discern and label multiple aspects of visual content. The substantial difference in accuracy between single-label and multi-label approaches emphasizes the importance of flexible, context-aware definitions of "correct" outcomes in AI applications. It suggests that for applications requiring a nuanced understanding of visual data, such as content categorization, surveillance, or aid in accessibility, leveraging a multi-label approach can significantly enhance the utility and accuracy of image recognition APIs.

In conclusion, this experiment has not only validated the hypothesis that the API can correctly predict labels in many cases but also illuminated the challenges it faces in comprehensively identifying all relevant labels. The performance of Google Vision API, as evidenced by the aggregated statistics and the calculated metrics, affirms the critical role of context and label breadth in evaluating and enhancing image recognition technologies. This analysis advocates for a nuanced application of AI in image recognition, where the richness of visual content is fully embraced, thereby aligning technological capabilities more closely with human perception and understanding.

### Failure Process for Calling OpenAI Vision API

In [82]:
from openai import OpenAI

api_key = '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'
client = OpenAI(api_key=api_key)

response = client.chat.completions.create(
  model="gpt-4-0314",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": 'https://cdn.britannica.com/79/232779-050-6B0411D7/German-Shepherd-dog-Alsatian.jpg',
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

ModuleNotFoundError: No module named 'openai'