# 02_generate_categories

## Notebook 2/4

## Gabriel del Valle
## 10/08/24
## NYC DATA SCIENCE ACADEMY


## The purpose of this project is to create a simplified context to apply content recommendation techniques in an interactive Shiny app.


### Try out the full interactive app! Read the Home page for project details and instructions

https://gabrielxdelvalle.shinyapps.io/algo_gallery/

### Read the full project details on the blog post:

https://nycdatascience.com/blog/student-works/clustering-artworks-by-ai-quantified-visual-qualities-content-recommendation-app/

### For any questions or inquieries about this project please feel free to reach out on Linkedin: 

www.linkedin.com/in/gabriel-del-valle-147616152


## This second notebook is used to apply OpenAI's CLIP image classficaiton tool to generate image scores fore each artwork, and store them in a dataframe along with the art's descriptive inforamtion (such as title, date, etc) and an image link




## Notebook Structure 

### 1. Load CLIP model and processor

### 2. Generate image scores for a set of categories, using function clip_scores( )

#### clip_scores( ) takes as inputs a list of image paths (to the locally stored image) and a list of categories (type string) and returns a dataframe with each category as a column, each image as a row, and the image scores as values

- This could also be made to work with image links using the io library


### 3. For comparison purposes, generate multiple sets of image scores with different combinations of categories using generate_clips( )

#### generate_clips takes as input a list of image paths, a list of strong categories and a list of weak categories. First it will generate scores for just the strong categories and label the dataframe "benchmark". Then it will generate each combination of a single weak category + all strong categories, and store the data in a dataframe labeled based on the weak quality. 

- the datatype output by generate_clips is a dictionary of dataframes. This can be saved using a pickle file, or one can loop through each dict value and export the individual dataframes as csv

### 4. To add image scores directly to the dataframe containing art details and image paths, generate_clips_gallery( ) takes as inputs a dataframe and a list of categories, and returns a new dataframe with the image scores as new columns


In [1]:
import torch
from transformers import CLIPProcessor, CLIPModel
from transformers import pipeline
from PIL import Image, UnidentifiedImageError
import pandas as pd
import requests
from io import BytesIO

  torch.utils._pytree._register_pytree_node(


## Load the model and processor for CLIP:

- The processor’s job is to prepare the data (image and text inputs) into the format required by the model. This includes resizing, normalizing, and converting images into tensor form and tokenizing the text prompts.

- CLIPProcessor automates several pre-processing steps to ensure consistency with the model’s training, handling aspects like image size normalization and text tokenization.

- Without the processor, one would need to manually implement these pre-processing steps to match the CLIP model’s expectations. For example, if the image dimensions or pixel values are not standardized, it would likely impact the model's performance and accuracy.

In [2]:
# Load CLIP model and processor for handling both image and text
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32", force_download=True)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", force_download=True)

Downloading config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location=map_location)


Downloading (…)rocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

## Process images with CLIP:

### If you have a CUDA-compatible GPU, transferring data and computations to the GPU can significantly speed up processing time. 
### Without this step, the code will run on the CPU, which may be much slower, especially for large batches of images.




### Logits are the raw image scores for each category and are independent by default
### Applying the softmax is what converts these values into probabilities, aka dependent, aka sum to 1

In [3]:
# Your generate_clip_scores function to generate CLIP scores for each image
def clip_scores(images, categories):
    all_scores = []
    
    for ex_image in images:
        image = Image.open(ex_image)
        
        # Process the image and candidate labels
        inputs = processor(text=categories, images=image, return_tensors="pt", padding=True)

        # Move the model to the GPU if available
        
        device = "cuda" if torch.cuda.is_available() else "cpu"
        model.to(device)
        inputs = {k: v.to(device) for k, v in inputs.items()}

        # Get the logits from the model
        with torch.no_grad():
            logits_per_image = model(**inputs).logits_per_image  # This gives the similarity score between the image and each label

        # Apply softmax to get probabilities
        probs = logits_per_image.softmax(dim=-1)

        # Collect scores for the image
        scores = [prob.item() for prob in probs[0]]
        all_scores.append(scores)
    
    # Return as a pandas DataFrame
    return pd.DataFrame(all_scores, columns=categories)

In [4]:
# Function to evaluate different combinations of strong and weak qualities by generating CLIP scores
def generate_clips(images, strong_qualities, weak_qualities):
    # Initialize a dictionary to store the results
    results = {}

    # Generate and store benchmark scores for strong qualities alone
    print("Generating CLIP scores with only strong qualities...")
    data_strong = clip_scores(images, strong_qualities)
    results["benchmark"] = data_strong
    
    # Now test adding one weak quality at a time
    for weak_quality in weak_qualities:
        print(f"\nGenerating CLIP scores for strong qualities + weak quality: {weak_quality}")
        
        # Add the current weak quality to the list of strong qualities
        categories_to_test = strong_qualities + [weak_quality]
        
        # Generate the CLIP scores for the updated set of categories
        data = clip_scores(images, categories_to_test)
        
        # Store the data with the weak quality as the key
        results[weak_quality] = data

    return results

In [6]:
modern_gallery = pd.read_csv("modern_gallery.csv")
len(modern_gallery)

974

In [8]:
image_paths = "modern_gallery_images2/" + modern_gallery['Image_path'].astype(str)

In [51]:
strong_qualities = [
    
    'Highly_Detailed',
    'Human_Subject',
    'Landscape',
    'Abstract',
    'Ornamental_Pattern',
    'Animal_Subject',
    'Architecture_Subject',
    'Many_Colors'
]

In [19]:
weak_qualities = [
    
    'Color_Complexity',
    'Color_Contrast',
    'Religious',
    'Plant_Subject',
    'Flower_Subject',
    'Many_Colors',
    'Minimalist',
    
    
]

In [20]:
trial = generate_clips(image_paths, strong_qualities, weak_qualities)

Generating CLIP scores with only strong qualities...

Generating CLIP scores for strong qualities + weak quality: Color_Complexity

Generating CLIP scores for strong qualities + weak quality: Color_Contrast

Generating CLIP scores for strong qualities + weak quality: Religious

Generating CLIP scores for strong qualities + weak quality: Plant_Subject

Generating CLIP scores for strong qualities + weak quality: Flower_Subject

Generating CLIP scores for strong qualities + weak quality: Many_Colors

Generating CLIP scores for strong qualities + weak quality: Minimalist


In [10]:
trial

{'benchmark':      Impressionist  Highly_Detailed  Human_Subject  Landscape  Abstract  \
 0         0.039867         0.004297       0.025354   0.103506  0.810500   
 1         0.223959         0.021343       0.269724   0.438612  0.013003   
 2         0.025285         0.032247       0.033692   0.118063  0.634435   
 3         0.027713         0.094089       0.117573   0.447257  0.047408   
 4         0.011274         0.042059       0.166050   0.200608  0.358758   
 ..             ...              ...            ...        ...       ...   
 969       0.168745         0.070212       0.352094   0.167473  0.080242   
 970       0.244714         0.028867       0.043010   0.445560  0.134066   
 971       0.359046         0.030887       0.146380   0.142607  0.253359   
 972       0.137914         0.015053       0.079349   0.334707  0.066345   
 973       0.011893         0.135747       0.157344   0.088316  0.129734   
 
      Ornamental_Pattern  Highly_Detailed  Animal_Subject  
 0           

### Example saving dict of dataframes to pickle:

In [11]:
import pickle

# Export the entire 'results' dictionary as a pickle file
with open('clip_scores_results.pkl', 'wb') as f:
    pickle.dump(trial, f)


In [2]:


# Load the results from the pickle file
with open('clip_scores_results.pkl', 'rb') as f:
    trial = pickle.load(f)

### Example saving each dataframe in the dict to csv

In [21]:
for quality, df in trial.items():
    df.to_csv(f"{quality}.csv", index=False)

In [22]:
trial['benchmark']

Unnamed: 0,Impressionist,Highly_Detailed,Human_Subject,Landscape,Abstract,Ornamental_Pattern,Animal_Subject,Architecture_Subject
0,0.035234,0.003798,0.022408,0.091478,0.716316,0.005974,0.004790,0.120002
1,0.209611,0.019976,0.252444,0.410513,0.012170,0.001469,0.009776,0.084040
2,0.015963,0.020358,0.021270,0.074536,0.400537,0.070094,0.008211,0.389030
3,0.029268,0.099368,0.124170,0.472352,0.050068,0.046655,0.134861,0.043257
4,0.003487,0.013007,0.051353,0.062040,0.110950,0.011327,0.044089,0.703746
...,...,...,...,...,...,...,...,...
969,0.155922,0.064877,0.325337,0.154746,0.074144,0.019239,0.064866,0.140869
970,0.244311,0.028819,0.042939,0.444826,0.133845,0.029999,0.044792,0.030468
971,0.353970,0.030451,0.144310,0.140590,0.249777,0.003942,0.032371,0.044589
972,0.135979,0.014842,0.078236,0.330012,0.065415,0.016067,0.330583,0.028867


In [9]:
modern_gallery

Unnamed: 0,Title,Artist,Nationality,Lifespan,Image_URL,Image_path,Birth,Death,Year
0,Construction,László Moholy-Nagy,Hungarian,1895 - 1946,https://raw.githubusercontent.com/Corriande/al...,Construction (1924).jpg,1895,1946.0,1924.0
1,Ohne Titel; aus; ‘Die 150 Blätter’ VII,Karl Wiener,Austrian,1901-1949,https://raw.githubusercontent.com/Corriande/al...,Ohne Titel; aus; ‘Die 150 Blätter’ VII (1940).jpg,1901,1949.0,1940.0
2,Zeilboten op een werfhelling,Reijer Stolk,Dutch,1896 - 1945,https://raw.githubusercontent.com/Corriande/al...,Zeilboten op een werfhelling (1906).jpg,1896,1945.0,1906.0
3,Why,Karl Wiener,Austrian,1901-1949,https://raw.githubusercontent.com/Corriande/al...,Why (1940).jpg,1901,1949.0,1940.0
4,Machinery,Charles Demuth,American,1883-1935,https://raw.githubusercontent.com/Corriande/al...,Machinery (1920).jpg,1883,1935.0,1920.0
...,...,...,...,...,...,...,...,...,...
969,Periphery,Mikuláš Galanda,Slovak,1895 – 1938,https://raw.githubusercontent.com/Corriande/al...,Periphery (1924).jpg,1895,1938.0,1924.0
970,Váza s kyticí a broskve (Vase of flowers and p...,Emil Filla,Czech,1882-1953,https://raw.githubusercontent.com/Corriande/al...,Váza s kyticí a broskve (Vase of flowers and p...,1882,1953.0,1932.0
971,Self-Portrait,Ernst Ludwig Kirchner,German,1880-1938,https://raw.githubusercontent.com/Corriande/al...,Self-Portrait (1928).jpg,1880,1938.0,1928.0
972,Klänge Pl.13,Wassily Kandinsky,Russian,1866 - 1944,https://raw.githubusercontent.com/Corriande/al...,Klänge Pl.13 (1913).jpg,1866,1944.0,1913.0


In [4]:
def generate_clips_gallery(qualities, df):
    
    
    
    new_df = df.copy()
    
    # Create new columns in the DataFrame for each quality
    for quality in qualities:
        new_df[quality] = pd.NA  # Initialize with NA
        
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Loop through each row in the DataFrame
    for index, row in df.iterrows():
        image_url = row['Image_URL']
        try:
            response = requests.get(image_url)
            image = Image.open(BytesIO(response.content))
            image.seek(0)  # Check if the image is valid and not corrupted

            # Process the image and text prompts
            inputs = processor(text=qualities, images=image, return_tensors="pt", padding=True)
            inputs = {k: v.to(device) for k, v in inputs.items()}

            # Get the logits from the model
            with torch.no_grad():
                logits_per_image = model(**inputs).logits_per_image
                probs = logits_per_image.softmax(dim=-1)

            # Store the results in the DataFrame
            for i, quality in enumerate(qualities):
                new_df.at[index, quality] = probs[0][i].item()

        except (requests.exceptions.RequestException, UnidentifiedImageError, IOError):
            # If the image can't be loaded or verified, leave the qualities as NA
            print(f"Image at {image_url} is not integral or couldn't be loaded.")
            # Optionally, log this issue or handle it as needed
    
    return new_df

In [17]:
color_contrast_qualities = [
    
    'Highly_Detailed',
    'Impressionist',
    'Human_Subject',
    'Landscape',
    'Abstract',
    'Ornamental_Pattern',
    'Animal_Subject',
    'Architecture_Subject',
    'Color_Contrast'
    
    
]

In [18]:
color_contrast_gallery = generate_clips_gallery(color_contrast_qualities, modern_gallery)

In [19]:
color_contrast_gallery.to_csv("color_contrast_gallery.csv", index=False)

In [43]:
algo_gallery = generate_clips_gallery(strong_qualities, modern_gallery)

In [49]:
algo_gallery.to_csv("algo_gallery.csv", index=False)

In [47]:
colors_gallery = generate_clips_gallery(strong_qualities, modern_gallery)

In [50]:
colors_gallery.to_csv("colors_gallery.csv", index=False)

In [52]:
no_impressionist = generate_clips_gallery(strong_qualities, modern_gallery)

In [None]:
no_impressionist.to_csv("no_impressionist.csv", index=False)