## Portfolio 1 - Building a simple image search algorithm

*By Sofie Mosegaard, 01-03-2024*

In this assignment I will create a simple image search algorithm using ```OpenCV```. The assignment will include the following:

1. Define a particular image that you want to work with
2. For that image
    - Extract the colour histogram
3. Extract colour histograms for all of the **other* images in the data
4. Compare the histogram of our chosen image to all of the other histograms 
      -   For this, use the ```cv2.compareHist()``` function with the ```cv2.HISTCMP_CHISQR``` metric
5. Find the five images which are most simlar to the target image
      -  Save a CSV file to the folder called ```out```, showing the five most similar images and the distance metric:


### Import libraries

In [None]:
# sudo apt-get update
# sudo apt-get install -y python3-opencv
# pip install opencv-python matplotlib

In [1]:
import os
import glob
import sys
import pandas as pd

sys.path.append("..")

# Image processing and numerical tools
import cv2 # openCV
import numpy as np

# class utils functions
from utils.imutils import jimshow as show
from utils.imutils import jimshow_channel as show_channel
import matplotlib.pyplot as plt

### Extract one color histogram for comparison

First, I will extract one histogram of all color channels from one particular image. The histogram will be normalised using MinMax:


In [4]:
filepath_f1 = os.path.join("..",
                            "..",
                            "..",
                            "..",
                            "..",
                            "cds-vis-data",
                            "flowers", 
                            "image_0001.jpg")

image_f1 = cv2.imread(filepath_f1)

In [5]:
hist_f1 = cv2.calcHist([image_f1], [0,1,2], None, [255,255,255], [0,256, 0,256,0,256])
norm_hist_f1 = cv2.normalize(hist_f1, hist_f1, 0, 1.0, cv2.NORM_MINMAX)

### Extract colour histograms for all other images and compare


In [6]:
filepath = os.path.join("..",
                        "..",
                        "..",
                        "..",
                        "..",
                        "cds-vis-data",
                        "flowers")

In [18]:
# Define a function that updates the df with new dist values
def update_distance_df(filename, distance):
    distance_df.loc[len(distance_df.index)] = [filename, distance] # .loc is used to access rows and columns by label(s)

In [24]:
# Initialize a pandas dataframe with specified column names
distance_df = pd.DataFrame(columns=("Filename", "Distance"))

# Loop through all images in the sorted order
for file in sorted(os.listdir(filepath)):
    if file != filepath_f1:
   
        individual_filepath = os.path.join(filepath, file)
        image = cv2.imread(individual_filepath)
        image_name = file.split(".jpg")[0]

        # Extract color hist
        hist = cv2.calcHist([image], [0,1,2], None, [255,255,255], [0,256, 0,256,0,256])

        # Normalise hist
        norm_hist = cv2.normalize(hist, hist, 0, 1.0, cv2.NORM_MINMAX)

        # Compare the hist of flower image 1 and the current image 
        dist = round(cv2.compareHist(norm_hist_f1, norm_hist, cv2.HISTCMP_CHISQR), 3) # 3 decimals

        """
        First, I want to append the first 5 images' names and dist values to the distance_df. When the table
        consitsts of five images, I want to compare the distance between the image with the biggest dist value
        in the distance_df ( = so the image least similar to the target image (image_f1)) and the dist value
        of the current image. If the current image has a smaller dist, I want to ipdate the df.
        """

        if len(distance_df) < 6: 
            # If there is less than 6 rows (5 + 1 target) in the table --> append image_name and dist to the df
            update_distance_df(image_name, dist)
        else:
            # Find image with highest dist in df - so the one that is least similar to the target image
            max_dist_idx = distance_df['Distance'].idxmax() # .idxmax() returns the index with the maximum value
            max_dist = distance_df.loc[max_dist_idx, 'Distance']

            # If the dist of the current image is smaller than the highest dist in df, then update
            if dist < max_dist:
                distance_df.at[max_dist_idx, 'Filename'] = image_name # Update 'Filename' column at row idx with max dist
                distance_df.at[max_dist_idx, 'Distance'] = dist # Update 'Distance' column at row idx with max dist

        # Save the table as a .csv file
        csv_outpath = os.path.join("..", "out", "output.csv")
    distance_df.to_csv(csv_outpath)  

distance_df    

Unnamed: 0,Filename,Distance
0,image_0001,0.0
1,image_0928,178.124
2,image_0773,190.081
3,image_0142,190.209
4,image_0876,188.548
5,image_1316,190.222


### Alternative method

Alternatively, one could also append all calculated distances to the distance_df and then in the end simply extract the six rows with the smallest distance value using the 'nsmallest' method:

In [None]:
distance_df = pd.DataFrame(columns=("Filename", "Distance"))

for file in sorted(os.listdir(filepath)):
    individual_filepath = os.path.join(filepath, file) 
    image = cv2.imread(individual_filepath) 
    filename = file.split(".jpg")[0]

    hist = cv2.calcHist([image], [0,1,2], None, [255,255,255], [0,256, 0,256,0,256])
    norm_hist = cv2.normalize(hist, hist, 0, 1.0, cv2.NORM_MINMAX) 
    
    # Compare the hist of flower image 1 and the current image 
    dist = round(cv2.compareHist(norm_hist_f1, norm_hist, cv2.HISTCMP_CHISQR), 3) # 3 decimals
    
    # Append the filename and distance value 
    row = [filename, dist]
    distance_df.loc[len(distance_df)] = row

# Extract 6 rows with the smallest distance value (5 + 1 target (= image_f1 ))
final_df = distance_df.nsmallest(6, ["Distance"])

csv_outpath = os.path.join("..", "out", "output.csv")
final_df.to_csv(csv_outpath)


In [41]:
final_df

Unnamed: 0,Filename,Distance
0,image_0001,0.0
927,image_0928,178.124
875,image_0876,188.548
772,image_0773,190.081
141,image_0142,190.209
1315,image_1316,190.222
