

# Building a simple image search algorithm

## TO DO 
- ```unzip``` the flowers file use the data!

For this exercise, you should write some code which does the following:
- **Define particular image** want to work with
  - Extract **the colour histogram** of *that particular image* using ```OpenCV``` 
- Extract **colour histograms** for all of the **other images* in the data
- **Compare** the histogram of our chosen image to all of the other histograms 
  - For this, use the
    - ```cv2.compareHist()``` function with the
    - ```cv2.HISTCMP_CHISQR``` metric
- Find the **five images** which are **most similar** to the **target image**
  - Save a **CSV file** to the folder called ```out```
    -showing the **five most similar images** and the **distance metric**:

|Filename|Distance]
|---|---|
|target|0.0|
|filename1|---|
|filename2|---|

## Some notes and additional comments

- Your code should include functions that you have written wherever possible. Try to break your code down into smaller self-contained parts, rather than having it as one long set of instructions.
- submit code either as a Jupyter Notebook, or as ```.py``` script. 

## Objective

This assignment is designed to test that you can:

1. Work with larger datasets of images
2. Extract structured information from image data using ```OpenCV```
3. Quantaitively compare images based on these features, performing *distant viewing*








# doing the work


In [7]:
#setup
# in terminal unzip the files by writing unzip "flowers.zip"
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
import csv
import fnmatch
import sys
sys.path.append(os.path.join(".." ))
from utils.imutils import jimshow
from utils.imutils import jimshow_channel

In [49]:
# Set the main directory 'flower_dir' where the files are located
flower_dir = os.path.join("..", "data", "flowers")

# Set the target image
target_filename = os.path.join("..", "data", "flowers", "image_1305.jpg")
target_image = cv2.imread(target_filename)

#calculating the histogram of the target image
target_hist = cv2.calcHist([target_image], [0, 1, 2], None, [36, 36, 36], [0,256, 0, 256, 0, 256])


In [50]:

# Create a list to store the distances between the target histogram and each image histogram
distances = []

# Loop over every image in the flower_dir and calculate the histogram
for filename in os.listdir(flower_dir):
    # Check if the filename is the same as the target filename
    if filename == os.path.basename(target_filename):
        continue  # skip this image and continue with the next one
    if not fnmatch.fnmatch(filename, "*.jpg"):
        continue
    # Read the image
    image = cv2.imread(os.path.join(flower_dir, filename))
    # Calculate the histogram of the image
    hist = cv2.calcHist([image], [0, 1, 2], None, [36, 36, 36], [0, 256, 0, 256, 0, 256])
    
    # Calculate the distance between the target histogram and the image histogram
    # using the Bhattacharyya distance
    # The Bhattacharyya distance is the 
    distance= cv2.compareHist(target_hist, hist, cv2.HISTCMP_BHATTACHARYYA)
    
    # Add the filename and distance to the distances list
    distances.append((filename, distance))

# Sort the distances list by distance in ascending order
distances.sort(key=lambda x: x[1])

# Create a list to store the top 5 images in
top_5 = []

# Add the target image to the top 5 list with a distance of 0.0
top_5.append((os.path.basename(target_filename), 0.0))

# Loop over the 5 images with the smallest distance and add them to the top 5 list
for i in range(5):
    filename, distance = distances[i]
    top_5.append((filename, distance))

output_file = os.path.join ("..", "out", "similiar_images_bins36.csv")
# Save the top 5 list to a CSV file
with open(output_file, "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Filename", "Distance"])
    for item in top_5:
        writer.writerow(item)

It was mentioned in-class to try and set the bin-count down. However it seems to produced the best images (at least for this image) in the 256. Why that is, I don't know. But it does recognize that the image 1322, which is the exact same image as the target image (1305)
I've tried it with 32 bins (the squeare root of the amount of pictures, don't know why that works )