## Portfolio 1

### Building a simple image search algorithm

*By Sofie Mosegaard, 01-03-2024*

For this assignment, you'll be using ```OpenCV``` to design a simple image search algorithm.

For this exercise, you should write some code which does the following:

1. Define a particular image that you want to work with
2. For that image
    - Extract the colour histogram using ```OpenCV```
3. Extract colour histograms for all of the **other* images in the data
4. Compare the histogram of our chosen image to all of the other histograms 
      -   For this, use the ```cv2.compareHist()``` function with the ```cv2.HISTCMP_CHISQR``` metric
5. Find the five images which are most simlar to the target image
      -  Save a CSV file to the folder called ```out```, showing the five most similar images and the distance metric:


### Import libraries

In [None]:
# sudo apt-get update
# sudo apt-get install -y python3-opencv
# pip install opencv-python matplotlib

In [1]:
import os
import glob
import sys
import pandas as pd

sys.path.append("..")

# Image processing and numerical tools
import cv2 # openCV
import numpy as np

# class utils functions
from utils.imutils import jimshow as show
from utils.imutils import jimshow_channel as show_channel
import matplotlib.pyplot as plt

#### Define a particular image that you want to work with and extract a color histogram

In [4]:
# First, load in one particular flower image from the dataset "flowers"

filepath_f1 = os.path.join("..",
                            "..",
                            "..",
                            "..",
                            "cds-vis-data",
                            "flowers", 
                            "image_0001.jpg")

image_f1 = cv2.imread(filepath_f1)

In [5]:
# Create histogram of all color channels
hist_f1 = cv2.calcHist([image_f1], [0,1,2], None, [255,255,255], [0,256, 0,256,0,256])

# MinMax normalisation of the histogram
norm_hist_f1 = cv2.normalize(hist_f1, hist_f1, 0, 1.0, cv2.NORM_MINMAX)

#### Extract colour histograms for all of the other images and compare

In [6]:
filepath = os.path.join("..",
                        "..",
                        "..",
                        "..",
                        "cds-vis-data",
                        "flowers")

In [8]:
# Define a function to update the df with new distances
def update_distance_df(filename, distance):
    distance_df.loc[len(distance_df.index)] = [filename, distance] # .loc is used to access rows and columns by label(s)

In [42]:
# Initialize a pandas dataframe with specified column names
distance_df = pd.DataFrame(columns=("Filename", "Distance"))

# Loop through all images in the sorted order
for file in sorted(os.listdir(filepath)):
    if file != filepath_f1: # As I dont want to include flower image 1 once again
   
        individual_filepath = os.path.join(filepath, file) # Create individual filepath for each image

        image = cv2.imread(individual_filepath)
        
        image_name = file.split("/")[-1]

        # Extract color hist
        hist = cv2.calcHist([image], [0,1,2], None, [255,255,255], [0,256, 0,256,0,256])

        # Normalise hist
        norm_hist = cv2.normalize(hist, hist, 0, 1.0, cv2.NORM_MINMAX)

        # Compare the hist of flower image 1 and the current image 
        dist = round(cv2.compareHist(norm_hist_f1, norm_hist, cv2.HISTCMP_CHISQR), 3) # 3 decimals

        """
        First, I want to append the first 5 images and distance values to a table. When the table consitsts of
        five images, I want to compare the distance between the image with the biggest distance( = least similar)
        and the one of the "current image". If the "current image" has a smaller distance, I want to drop the
        old one with the biggest distance and append the current image:
        """

        if len(distance_df) < 7: 
            # If there is less than 5 rows in the table --> then, append image and dist to the df
            update_distance_df(image_name, dist)
        else:
            # Find image with the highest distance value in the df - so the one that are least similar
            # pd .idxmax() function returns the index with the maximum value
            max_dist_idx = distance_df['Distance'].idxmax()
            max_dist = distance_df.loc[max_dist_idx, 'Distance']

            # If the dist of the current image is smaller than the max distance in df, then...
            if dist < max_dist:
                # Drop the row with the max value
                distance_df = distance_df.drop(index = max_dist_idx)
                # Append the image name and distance value
                update_distance_df(image_name, dist)

    # Specify path to the output folder and name of the specific .csv file
    #csv_outpath = os.path.join("..", "out", "output.csv")

    #distance_df.to_csv(csv_outpath)  

distance_df
    

         Filename    Distance
0  image_0001.jpg       0.000
1  image_0002.jpg    6650.673
3  image_0004.jpg     196.455
4  image_1360.jpg  215264.638


In [21]:
len(distance_df)

973

In [40]:
# Initialize a pandas dataframe with specified column names
distance_df = pd.DataFrame(columns=("Filename", "Distance"))

# Loop through all images in the sorted order
for file in sorted(os.listdir(filepath)):

    # Create individual filepath for each image
    individual_filepath = os.path.join(filepath, file) 

    # Read image
    image = cv2.imread(individual_filepath) 

    # Get filename
    filename = file.split(".jpg")[0]

    # Extract color hist
    hist = cv2.calcHist([image], [0,1,2], None, [255,255,255], [0,256, 0,256,0,256])

    # Normalise hist
    norm_hist = cv2.normalize(hist, hist, 0, 1.0, cv2.NORM_MINMAX) 
    
    # Compare the hist of flower image 1 and the current image 
    dist = round(cv2.compareHist(norm_hist_f1, norm_hist, cv2.HISTCMP_CHISQR), 3) # 3 decimals
    
    # Append the filename and distance value 
    row = [filename, dist]
    distance_df.loc[len(distance_df)] = row

# Extract 6 rows with the smallest distance value (5 + 1 target (= image_f1 ))
final_df = distance_df.nsmallest(6, ["Distance"])

# Specify path to the output folder and name of the specific .csv file
csv_outpath = os.path.join("..", "out", "output.csv")

final_df.to_csv(csv_outpath)


In [41]:
final_df

Unnamed: 0,Filename,Distance
0,image_0001,0.0
927,image_0928,178.124
875,image_0876,188.548
772,image_0773,190.081
141,image_0142,190.209
1315,image_1316,190.222
