In [7]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import seaborn as sns
import re
import zipfile
import os
from PIL import Image
from collections import Counter
import imagehash
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications.resnet50 import ResNet50,  decode_predictions
from tensorflow.keras.models import Model
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity

# Working with Images Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll work with images of felines (cats), which have been classified according to their taxonomy. Each subfolder contains images of a particular species. The dataset is located [here](https://www.kaggle.com/datasets/datahmifitb/felis-taxonomy-image-classification) but it's also provided to you in the `data/` folder.

### Problem 1. Some exploration (1 point)
How many types of cats are there? How many images do we have of each? What is a typical image size? Are there any outliers in size?

1. Let's first see how many types of cats are there based on the folders we have. We will use OS to locate the directory and create a list of the species and count their number

In [2]:
root_dir = 'data'
species_list = [species for species in os.listdir(root_dir) if os.path.isdir(os.path.join(root_dir, species))]
num_species = len(species_list)

In [3]:
species_list

['african-wildcat',
 'blackfoot-cat',
 'chinese-mountain-cat',
 'domestic-cat',
 'european-wildcat',
 'jungle-cat',
 'sand-cat']

In [4]:
num_species 

7

2. Now let's see the number of images we have per species.

In [5]:
images_per_species = {species: len(os.listdir(os.path.join(root_dir, species))) for species in species_list}

In [6]:
images_per_species 

{'african-wildcat': 91,
 'blackfoot-cat': 79,
 'chinese-mountain-cat': 42,
 'domestic-cat': 64,
 'european-wildcat': 85,
 'jungle-cat': 86,
 'sand-cat': 72}

3. Now we know the number of pictures there are for every species of cats.
4. Now let's see the typical image size using PIL library. We will take all of the pictures's dimensions and calculate the mean and median and the most common image sizes. If the mean is high then we have images that are with different dimension, either wider or taller.

In [7]:
# Initialize lists to store image sizes
widths = []
heights = []
data = []

# Traverse the directory and collect image sizes
for species in species_list:
    species_dir = os.path.join(root_dir, species)
    for img in os.listdir(species_dir):
        img_path = os.path.join(species_dir, img)
        try:
            with Image.open(img_path) as image:
                widths.append(image.width)
                heights.append(image.height)
                data.append((img_path, species, image.width, image.height))
        except Exception as e:
            print(f"Error processing image {img_path}: {e}")

# Create a DataFrame
df = pd.DataFrame(data, columns=['image_path', 'species', 'width', 'height'])

# Calculate the typical image size (mean and median)
mean_width = np.mean(widths)
mean_height = np.mean(heights)
median_width = np.median(widths)
median_height = np.median(heights)

print(f'Mean image size: {mean_width}x{mean_height}')
print(f'Median image size: {median_width}x{median_height}')

Mean image size: 406.55298651252406x310.9499036608863
Median image size: 275.0x194.0


In [8]:
most_common_size = Counter(zip(widths, heights)).most_common(1)[0]
print(f'Most common image size: {most_common_size[0]} with {most_common_size[1]} occurrences')

Most common image size: (275, 183) with 108 occurrences


In [9]:
df

Unnamed: 0,image_path,species,width,height
0,data\african-wildcat\af (1).jpg,african-wildcat,265,190
1,data\african-wildcat\af (10).jpg,african-wildcat,274,184
2,data\african-wildcat\af (11).jpg,african-wildcat,275,183
3,data\african-wildcat\af (12).jpg,african-wildcat,263,192
4,data\african-wildcat\af (13).jpg,african-wildcat,230,219
...,...,...,...,...
514,data\sand-cat\sd (70).jpg,sand-cat,309,163
515,data\sand-cat\sd (71).jpg,sand-cat,273,184
516,data\sand-cat\sd (72).jpg,sand-cat,225,225
517,data\sand-cat\sd (8).jpg,sand-cat,220,158


5. As it can be seen the mean is bigger than the most common which means there are images that have either bigger height or width. The most common is 275x183 and the median is 275x194 which is close.
6. Now let's see which we could count as outliers. We will do that by finding the middle 50 % of the quartile range(Interquartile range) for witdth and height. Outliers are the data points that fall outside of the typical range. We will use a statistical technique for identifying outliers.This approach is particularly effective because it considers the spread of the middle 50% of the data, which is represented by the interquartile range (IQR). And all images that fall under or above the treshold we will count as outliers.

In [10]:
Q1_width = np.percentile(widths, 25)
Q3_width = np.percentile(widths, 75)
IQR_width = Q3_width - Q1_width

Q1_height = np.percentile(heights, 25)
Q3_height = np.percentile(heights, 75)
IQR_height = Q3_height - Q1_height

# Define outlier thresholds
lower_bound_width = Q1_width - 1.5 * IQR_width
upper_bound_width = Q3_width + 1.5 * IQR_width

lower_bound_height = Q1_height - 1.5 * IQR_height
upper_bound_height = Q3_height + 1.5 * IQR_height

# Identify outliers
outliers = df[(df['width'] < lower_bound_width) | (df['width'] > upper_bound_width) | 
              (df['height'] < lower_bound_height) | (df['height'] > upper_bound_height)]

print(f'Number of outliers: {len(outliers)}')
print(outliers[['image_path', 'width', 'height']])

Number of outliers: 157
                           image_path  width  height
5    data\african-wildcat\af (14).jpg    183     275
12   data\african-wildcat\af (20).jpg    195     258
17   data\african-wildcat\af (25).jpg    191     264
18   data\african-wildcat\af (26).jpg    195     258
26   data\african-wildcat\af (33).jpg    195     259
..                                ...    ...     ...
467         data\sand-cat\sd (28).jpg    177     284
472         data\sand-cat\sd (32).jpg    189     267
473         data\sand-cat\sd (33).jpg    195     258
476         data\sand-cat\sd (36).jpg    194     259
502          data\sand-cat\sd (6).jpg    194     259

[157 rows x 3 columns]


7. As we can see there are pictures that we can count as outliers. Most of them as i can see are too tall in height and are narrow in width.

### Problem 2. Duplicat(e)s (1 point)
Find a way to filter out (remove) identical images. I would recommnend using file hashes, but there are many approaches. Keep in mind that during file saving, recompression, etc., a lot of artifacts can change the file content (bytes), but not visually.

1. We will use pillow to handle images and use imagehash from pillow for perceptual hashing. We will check if the hash already exists in the dictionary. If it does, mark the image as a duplicate. If not, store the hash. We will use a dictionary to keep track of hashes and their corresponding image paths and append duplicates to a list and show their paths to verify. We will store unique images in a DataFrame for further processing or analysis.
2. Let's first install imagehash to take the hash of the files and compare them with that as it's the safest way to see whether they are the same

In [31]:
pip install imagehash pillow

Collecting imagehash
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading ImageHash-4.3.1-py2.py3-none-any.whl (296 kB)
   ---------------------------------------- 0.0/296.5 kB ? eta -:--:--
   - -------------------------------------- 10.2/296.5 kB ? eta -:--:--
   -- ------------------------------------ 20.5/296.5 kB 217.9 kB/s eta 0:00:02
   ---- ---------------------------------- 30.7/296.5 kB 262.6 kB/s eta 0:00:02
   ------------ -------------------------- 92.2/296.5 kB 521.8 kB/s eta 0:00:01
   -------------------------------------- - 286.7/296.5 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 296.5/296.5 kB 1.3 MB/s eta 0:00:00
Installing collected packages: imagehash
Successfully installed imagehash-4.3.1
Note: you may need to restart the kernel to use updated packages.


3. Now let's assing the directory for the files.

In [11]:
root_dir = 'data'

4.Let's create a function to compute image hashes and idnetify duplicates. It will take the directory path as an argument. In it will store the hashes in a dictionary and the unique and duplicates in a list. It will go thourgh all folders and their images. It will compute the perceptual hashes for each image and it will check for duplicate hashes and it will store the duplicates and unique ones seperately

In [12]:
def compute_image_hashes(directory):
    """
    Compute perceptual hashes for images in a directory and identify duplicates.

    Args:
        directory (str): Path to the root directory containing species folders with images.

    Returns:
        tuple: A tuple containing:
            - unique_images (list of tuples): List of tuples with unique image data (image_path, species, image_hash).
            - duplicates (list of str): List of file paths of duplicate images.
    """
    # Dictionary to store unique image hashes
    hashes = {}
    # Lists to store unique images and duplicates
    unique_images = []
    duplicates = []

    # Iterate through each species folder in the root directory
    for species in os.listdir(directory):
        species_dir = os.path.join(directory, species)
        # Check if the item is a directory
        if os.path.isdir(species_dir):
            # Iterate through each image file in the species directory
            for img in os.listdir(species_dir):
                img_path = os.path.join(species_dir, img)
                # Check if the item is a file
                if os.path.isfile(img_path):
                    try:
                        # Open the image file
                        with Image.open(img_path) as image:
                            # Compute perceptual hash
                            img_hash = imagehash.phash(image)
                            # Check if the hash already exists in the dictionary
                            if img_hash in hashes:
                                # Add to duplicates if hash exists
                                duplicates.append(img_path)
                                print(f"Duplicate found: {img_path} is a duplicate of {hashes[img_hash]}")
                            else:
                                # Add to unique images and update the hash dictionary
                                hashes[img_hash] = img_path
                                unique_images.append((img_path, species, img_hash))
                    except Exception as e:
                        print(f"Error processing image {img_path}: {e}")

    return unique_images, duplicates

# Compute hashes and identify duplicates
unique_images_data, duplicates = compute_image_hashes(root_dir)

# Create a DataFrame for unique images
df_unique = pd.DataFrame(unique_images_data, columns=['image_path', 'species', 'image_hash'])


Duplicate found: data\african-wildcat\af (32).jpg is a duplicate of data\african-wildcat\af (27).jpg
Duplicate found: data\african-wildcat\af (37).jpg is a duplicate of data\african-wildcat\af (11).jpg
Duplicate found: data\african-wildcat\af (61).jpg is a duplicate of data\african-wildcat\af (50).jpg
Duplicate found: data\african-wildcat\af (74).jpg is a duplicate of data\african-wildcat\af (16).jpg
Duplicate found: data\blackfoot-cat\bc (63).jpg is a duplicate of data\blackfoot-cat\bc (5).jpg
Duplicate found: data\chinese-mountain-cat\ch (20).jpg is a duplicate of data\chinese-mountain-cat\ch (10).jpg
Duplicate found: data\chinese-mountain-cat\ch (32).jpg is a duplicate of data\chinese-mountain-cat\ch (29).jpg
Duplicate found: data\chinese-mountain-cat\ch (39).jpg is a duplicate of data\chinese-mountain-cat\ch (25).jpg
Duplicate found: data\chinese-mountain-cat\ch (42).jpg is a duplicate of data\chinese-mountain-cat\ch (13).jpg
Duplicate found: data\chinese-mountain-cat\ch (9).jpg is

5. Let's see the dataframe with unique pictures without duplicates

In [13]:
df_unique

Unnamed: 0,image_path,species,image_hash
0,data\african-wildcat\af (1).jpg,african-wildcat,cbecb45b8c4af209
1,data\african-wildcat\af (10).jpg,african-wildcat,cefa3070c16c63c7
2,data\african-wildcat\af (11).jpg,african-wildcat,cba4349b0cca75ab
3,data\african-wildcat\af (12).jpg,african-wildcat,988665d06ef92bd8
4,data\african-wildcat\af (13).jpg,african-wildcat,ad71928d7c22e96c
...,...,...,...
460,data\sand-cat\sd (70).jpg,sand-cat,c7ac38912c5b397c
461,data\sand-cat\sd (71).jpg,sand-cat,8e456130fc57395e
462,data\sand-cat\sd (72).jpg,sand-cat,d385522d0e6be61b
463,data\sand-cat\sd (8).jpg,sand-cat,d403637cba1ca2dd


6. Let's see the number of duplicates.

In [14]:
len(duplicates)

54

7. 54 out of 519 are duplicates

### Problem 3. Loading a model (2 points)
Find a suitable, trained convolutional neural network classifier. I recommend `ResNet50` as it's small enough to run well on any machine and powerful enough to make reasonable predictions. Most ready-made classifiers have been trained for 1000 classes.

You'll need to install libraries and possibly tinker with configurations for this task. When you're done, display the total number of layers and the total number of parameters. For ResNet50, you should expect around 50 layers and 25M parameters.

1. Let's load the model ResNet50

In [15]:
model = ResNet50(weights='imagenet')

2. Now let's count the total number of layers amd parameters.

In [16]:
total_layers = len(model.layers)
print(f"Total number of layers: {total_layers}")

# Count the total number of parameters
total_params = model.count_params()
print(f"Total number of parameters: {total_params}")

Total number of layers: 177
Total number of parameters: 25636712


3. We can see there are more than 25 million parameters and 177 layers, which is not what we expect as the ResNet50 is known as 50 layer model. The number of layers reported by Keras (177) includes all the operational layers within the network, not just the primary convolutional and fully connected layers that are commonly referred to in the architecture description (50). This detailed accounting provides a more comprehensive view of the network's structure and all the transformations applied to the data as it passes through the network.Using model.summary() we can see all of the layers, but this will make this exercise pretty long to read.

### Problem 4. Prepare the images (1 point)
You'll need to prepare the images for passing to the model. To do so, they have to be resized to the same dimensions. Most available models have a specific requirement for sizes. You may need to do additional preprocessing, depending on the model requirements. These requirements should be easily available in the model documentation.

1. Let's first create a function to preprocess the images. The image size for the ResNet50 is 224x224 so we will convert them all to this size. And make them into a numpy array taht reoresents the processed image.

In [17]:
# Define the target size for the images
TARGET_SIZE = (224, 224)

# Function to preprocess a single image
def preprocess_image(image_path):
    """
    Preprocess a single image for ResNet50.
    
    Args:
    image_path (str): Path to the image file.
    
    Returns:
    np.array: Preprocessed image array.
    """
    image = Image.open(image_path) #  Use the Python Imaging Library (PIL) to open the image file from the given path.
    image = image.resize(TARGET_SIZE) # Resize the image to the dimensions expected by ResNet50 (224x224 pixels).
    image_array = img_to_array(image) #  Convert the PIL image object to a numpy array.
    image_array = np.expand_dims(image_array, axis=0) # Add an extra dimension to the array to represent the batch size.
    image_array = preprocess_input(image_array) # Apply the same preprocessing that was used during the training of ResNet50.
    return image_array

2. Now let's create a function to processes images in batches, preprocesses them, and saves the preprocessed images and their labels to disk. We will make an output directory where to save the npz files. It will prepare lists to collect images and labels, and a counter for batches. Then loop through each subdirectory and each image file, preprocess the images, and collect them in batches.Once a batch is complete, save it to disk and reset the lists. Save any remaining images after the loop.

In [18]:
def preprocess_and_save_batches(root_directory, output_directory, batch_size=4):
    """
    Process images in batches and save the preprocessed arrays to disk.
    
    Args:
    root_directory (str): Path to the root directory containing subdirectories with images.
    output_directory (str): Path to the directory where preprocessed arrays will be saved.
    batch_size (int): Number of images to process in each batch.
    """
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    
    all_images = []
    labels = []
    batch_counter = 0

    # Loop through each subdirectory and process images
    for subdir in os.listdir(root_directory):
        subdir_path = os.path.join(root_directory, subdir)
        if os.path.isdir(subdir_path):
            for filename in os.listdir(subdir_path):
                file_path = os.path.join(subdir_path, filename)
                if os.path.isfile(file_path) and filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
                    try:
                        image_array = preprocess_image(file_path)
                        all_images.append(image_array)
                        labels.append(subdir)  # Use the subdir name as the label
                        # Save batch if batch size is reached
                        if len(all_images) == batch_size:
                            batch_filename = f'batch_{batch_counter}.npz'
                            batch_filepath = os.path.join(output_directory, batch_filename)
                            np.savez(batch_filepath, images=np.vstack(all_images), labels=np.array(labels))
                            all_images = []
                            labels = []
                            batch_counter += 1
                    except Exception as e:
                        print(f"Error processing image {file_path}: {e}")
    
    # Save any remaining images that did not fill up a complete batch
    if all_images:
        batch_filename = f'batch_{batch_counter}.npz'
        batch_filepath = os.path.join(output_directory, batch_filename)
        np.savez(batch_filepath, images=np.vstack(all_images), labels=np.array(labels))

root_directory_path = 'data'
output_directory_path = 'data/batch'
preprocess_and_save_batches(root_directory_path, output_directory_path)

### Problem 5. Load the images efficiently (1 point)
Now that you've seen how to prepare the images for passing to the model... find a way to do it efficiently. Instead of loading the entire dataset in the RAM, read the images in batches (e.g. 4 images at a time). The goal is to read these, preprocess them, maybe save the preprocessed results in RAM.

If you've already done this in one of the previous problems, just skip this one. You'll get your point for it.

\* Even better, save the preprocessed image arrays (they will not be valid .jpg file) as separate files, so you can load them "lazily" in the following steps. This is a very common optimization to work with large datasets.

1. Now let's load the already preprocessed bathces. We will create a function to load the image batches "lazily". And then see the batches to check whether they are all processed.

In [19]:
def load_batches(output_directory):
    """
    Generator function to load preprocessed image batches lazily.
    
    Args:
    output_directory (str): Path to the directory where preprocessed arrays are saved.
    
    Yields:
    tuple: A tuple containing:
        - images (np.array): Array of preprocessed images.
        - labels (np.array): Array of labels corresponding to the images.
    """
    for batch_filename in os.listdir(output_directory):
        if batch_filename.lower().endswith('.npz'):
            batch_filepath = os.path.join(output_directory, batch_filename)
            with np.load(batch_filepath) as data:
                images = data['images']
                labels = data['labels']
                yield images, labels

In [20]:
for images, labels in load_batches(output_directory_path):
    print(images.shape)  # Should print (batch_size, 224, 224, 3)
    print(labels)  # Should print the list of labels for the batch

(4, 224, 224, 3)
['african-wildcat' 'african-wildcat' 'african-wildcat' 'african-wildcat']
(4, 224, 224, 3)
['african-wildcat' 'african-wildcat' 'african-wildcat' 'african-wildcat']
(4, 224, 224, 3)
['african-wildcat' 'african-wildcat' 'african-wildcat' 'african-wildcat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)
['jungle-cat' 'jungle-cat' 'jungle-cat' 'jungle-cat']
(4, 224, 224, 3)


2. As it can be seen all the batches contain 4 files in them all preprocessed to the size 224x224 or 3 channels. There is only one batch with 3 files that is the remaining files that needed processing. We have 129 four-file batches and 1 three-file batch

### Problem 6. Predictions (1 point)
Finally, you're ready to get into the meat of the problem. Obtain predictions from your model and evaluate them. This will likely involve manual work to decide how the returned classes relate to the original ones.

Create a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to evaluate the classification.

1. Let's load the model that is pre-trained on the imagenet dataset

In [57]:
model = ResNet50(weights='imagenet')

2. Now let's load the batches and make the model predict the class of the image based on the imagenet dataset. And then get the information that is readible for the people

In [21]:
predictions = []

In [22]:
for images, _ in load_batches(output_directory_path):
    preds = model.predict(images)
    decoded_preds = decode_predictions(preds, top=1)
    predictions.extend(decoded_preds)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 507ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 639ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 527ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 547ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 486ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 485ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 497ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 495ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 637ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 643ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 652ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 558ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

3. Now it's loaded and predicted let's see the top 5 classifications it did and see how many classes and what classes it classified our pictures to

In [23]:
num_images_to_display = 5

for i, pred in enumerate(predictions[:num_images_to_display]):
    print(f"Image {i+1}:")
    class_id, class_name, score = pred[0]  # Only the top prediction
    print(f" - {class_id}: {class_name} (score: {score:.4f})")
    print()

# Display a summary of the total number of images and categories
print(f"Displayed top {num_images_to_display} predictions out of {len(predictions)} total images.")
unique_categories = set()
for pred in predictions:
    for class_id, class_name, score in pred:
        unique_categories.add(class_name)
print(f"Total unique predicted categories: {len(unique_categories)}")

Image 1:
 - n02127052: lynx (score: 0.8168)

Image 2:
 - n02123045: tabby (score: 0.4581)

Image 3:
 - n02127052: lynx (score: 0.6821)

Image 4:
 - n02127052: lynx (score: 0.7579)

Image 5:
 - n02124075: Egyptian_cat (score: 0.3797)

Displayed top 5 predictions out of 519 total images.
Total unique predicted categories: 41


4. As we can see the dataset did predict the pictures not as the same classes but atleast it's type of cats. As this dataset ImageNet doesn't contain these exact classes it's hard for it to work properly.
5. So there are total of 41 categories let's see what they are.

In [24]:
unique_categories = set()
for pred in predictions:
    for class_id, class_name, score in pred:
        unique_categories.add(class_name)

print("Unique predicted categories:")
for category in unique_categories:
    print(category)

Unique predicted categories:
file
great_grey_owl
wombat
coyote
shopping_basket
wood_rabbit
tabby
patas
cougar
brown_bear
wallaby
white_wolf
Arctic_fox
Siamese_cat
hyena
jaguar
Angora
ibex
window_screen
kit_fox
indri
mongoose
tiger
leopard
bighorn
gazelle
grey_fox
Egyptian_cat
lion
African_hunting_dog
dingo
snow_leopard
book_jacket
prairie_chicken
hare
doormat
tiger_cat
dhole
fox_squirrel
lynx
timber_wolf


6. Well as it can be seen our cats have been classified in many many different classes, many are different types of cats(not ours) but others are whole other things from dogs, owls to window screens

### Problem 7. Grayscale (1 point)
Converting the images to grayscale should affect the classification negatively, as we lose some of the color information.

Find a way to preprocess the images to grayscale (using what you already have in Problem 4 and 5), pass them to the model, and compare the classification results to the previous ones.

In [33]:
model = ResNet50(weights='imagenet')

In [38]:
# Define the target size for the images
TARGET_SIZE = (224, 224)

# Function to preprocess a single image to grayscale
def preprocess_image_grayscale(image_path):
    """
    Preprocess a single image to grayscale for ResNet50.
    
    Args:
    image_path (str): Path to the image file.
    
    Returns:
    np.array: Preprocessed grayscale image array.
    """
    image = Image.open(image_path).convert('L')  # Convert image to grayscale
    image = image.resize(TARGET_SIZE)  # Resize image to 224x224 pixels
    image_array = img_to_array(image)  # Convert image to numpy array
    image_array = np.stack((image_array,)*3, axis=-1).squeeze()  # Duplicate the grayscale channel to create a 3-channel image
    image_array = preprocess_input(image_array)  # Preprocess input as done for ResNet50
    return image_array

In [39]:
def preprocess_and_save_batches_grayscale(root_directory, output_directory, batch_size=4):
    """
    Process images in batches to grayscale and save the preprocessed arrays to disk.
    
    Args:
    root_directory (str): Path to the root directory containing subdirectories with images.
    output_directory (str): Path to the directory where preprocessed arrays will be saved.
    batch_size (int): Number of images to process in each batch.
    """
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    
    all_images = []
    labels = []
    batch_counter = 0

    for subdir in os.listdir(root_directory):
        subdir_path = os.path.join(root_directory, subdir)
        if os.path.isdir(subdir_path):
            for filename in os.listdir(subdir_path):
                file_path = os.path.join(subdir_path, filename)
                if os.path.isfile(file_path) and filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
                    try:
                        image_array = preprocess_image_grayscale(file_path)
                        all_images.append(image_array)
                        labels.append(subdir)
                        if len(all_images) == batch_size:
                            batch_filename = f'batch_{batch_counter}.npz'
                            batch_filepath = os.path.join(output_directory, batch_filename)
                            np.savez(batch_filepath, images=np.array(all_images), labels=np.array(labels))
                            all_images = []
                            labels = []
                            batch_counter += 1
                    except Exception as e:
                        print(f"Error processing image {file_path}: {e}")
    
    if all_images:
        batch_filename = f'batch_{batch_counter}.npz'
        batch_filepath = os.path.join(output_directory, batch_filename)
        np.savez(batch_filepath, images=np.array(all_images), labels=np.array(labels))

# Example usage
root_directory_path = 'data'
output_directory_path_grayscale = 'data/batch_grayscale'
preprocess_and_save_batches_grayscale(root_directory_path, output_directory_path_grayscale)


In [41]:
def load_batches(output_directory):
    """
    Generator function to load preprocessed image batches lazily.
    
    Args:
    output_directory (str): Path to the directory where preprocessed arrays are saved.
    
    Yields:
    tuple: A tuple containing:
        - images (np.array): Array of preprocessed images.
        - labels (np.array): Array of labels corresponding to the images.
    """
    for batch_filename in os.listdir(output_directory):
        if batch_filename.lower().endswith('.npz'):
            batch_filepath = os.path.join(output_directory, batch_filename)
            with np.load(batch_filepath) as data:
                images = data['images']
                labels = data['labels']
                yield images, labels

# Path to the directory where grayscale batches are saved
output_directory_path_grayscale = 'data/batch_grayscale'

# Make predictions
predictions = []
for images, _ in load_batches(output_directory_path_grayscale):
    preds = model.predict(images)
    decoded_preds = decode_predictions(preds, top=1)
    predictions.extend(decoded_preds)

# Display some of the predictions
num_images_to_display = 5
for i, pred in enumerate(predictions[:num_images_to_display]):
    print(f"Image {i+1}:")
    class_id, class_name, score = pred[0]
    print(f" - {class_id}: {class_name} (score: {score:.4f})")
    print()

# Display a summary of the total number of images and categories
print(f"Displayed top {num_images_to_display} predictions out of {len(predictions)} total images.")
unique_categories = set()
for pred in predictions:
    for class_id, class_name, score in pred:
        unique_categories.add(class_name)
print(f"Total unique predicted categories: {len(unique_categories)}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 478ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 630ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 616ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 488ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 587ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 596ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 552ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 567ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 794ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 717ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 801ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 564ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

### Problem 8. Deep image features (1 point)
Find a way to extract one-dimensional vectors (features) for each (non-grayscale) image, using your model. This is typically done by "short-circuiting" the model output to be an intermediate layer, while keeping the input the same. 

In case the outputs (also called feature maps) have different shapes, you can flatten them in different ways. Try to not create huge vectors; the goal is to have a relatively short sequence of numbers which describes each image.

You may find a tutorial like [this](https://towardsdatascience.com/exploring-feature-extraction-with-cnns-345125cefc9a) pretty useful but note your implementation will depend on what model (and framework) you've decided to use.

It's a good idea to save these as one or more files, so you'll spare yourself a ton of preprocessing.

1. We will use the ResNet50 model with imagenet trained dataset and pool the features by average and output them

In [3]:
base_model = ResNet50(weights='imagenet', include_top=False, pooling='avg')

# Create a new model that outputs features from the 'avg_pool' layer
model = Model(inputs=base_model.input, outputs=base_model.get_layer('avg_pool').output)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m94765736/94765736[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 0us/step


2. Then we will preprocess the image to have the size needed by the model 224x224 and make the image into array

In [4]:
TARGET_SIZE = (224, 224)

# Function to preprocess a single image
def preprocess_image(image_path):
    image = Image.open(image_path)
    image = image.resize(TARGET_SIZE)
    image_array = img_to_array(image)
    image_array = np.expand_dims(image_array, axis=0)
    image_array = preprocess_input(image_array)
    return image_array

3. Then we will extract the features(only from the image files skipping the npz batches i already have) and we will create a new folder that contains batches of the features

In [5]:
def extract_features(image_directory, output_directory, batch_size=4):
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    
    all_features = []
    labels = []
    batch_counter = 0

    for subdir in os.listdir(image_directory):
        subdir_path = os.path.join(image_directory, subdir)
        if os.path.isdir(subdir_path):
            for filename in os.listdir(subdir_path):
                file_path = os.path.join(subdir_path, filename)
                if os.path.isfile(file_path) and filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
                    try:
                        image_array = preprocess_image(file_path)
                        features = model.predict(image_array)
                        all_features.append(features)
                        labels.append(subdir)
                        if len(all_features) == batch_size:
                            batch_filename = f'batch_{batch_counter}.npz'
                            batch_filepath = os.path.join(output_directory, batch_filename)
                            np.savez(batch_filepath, features=np.vstack(all_features), labels=np.array(labels))
                            all_features = []
                            labels = []
                            batch_counter += 1
                    except Exception as e:
                        print(f"Error processing image {file_path}: {e}")
    
    if all_features:
        batch_filename = f'batch_{batch_counter}.npz'
        batch_filepath = os.path.join(output_directory, batch_filename)
        np.savez(batch_filepath, features=np.vstack(all_features), labels=np.array(labels))

# Example usage
image_directory_path = 'data'
output_directory_path = 'data/feature_batches'
extract_features(image_directory_path, output_directory_path)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 195ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 188ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 194ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 201ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 210ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 185ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 220ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 209ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 208ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 200ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 189ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 214ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

### Problem 9. Putting deep image features to use (1 points)
Try to find similar images, using a similarity metric on the features you got in the previous problem. Two good metrics are `mean squared error` and `cosine similarity`. How do they work? Can you spot images that look too similar? Can you explain why?

\* If we were to take Fourier features (in a similar manner, these should be a vector of about the same length), how do they compare to the deep features; i.e., which features are better to "catch" similar images?

1. Let's load the feature npz files.

In [8]:
def load_features(output_directory):
    features = []
    labels = []
    for batch_filename in os.listdir(output_directory):
        if batch_filename.lower().endswith('.npz'):
            batch_filepath = os.path.join(output_directory, batch_filename)
            with np.load(batch_filepath) as data:
                features.append(data['features'])
                labels.extend(data['labels'])
    features = np.vstack(features)
    return features, labels

# Load the features
output_directory_path = 'data/feature_batches'
features, labels = load_features(output_directory_path)

print(f"Loaded features with shape: {features.shape}")
print(f"Labels: {labels[:10]}")  # Display first 10 labels

Loaded features with shape: (519, 2048)
Labels: ['african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat', 'african-wildcat']


2. Now let's calculate the mean squared error MSE and compare the first image with others to see how similar they are. MSE is a measure of the average squared difference between the predicted and actual values. In the context of image similarity, it calculates the squared differences between the pixel values or feature vectors of two images and then averages those differences. MSE measures the average of the squares of the differences between corresponding elements of two vectors. Smaller MSE values indicate higher similarity. Larger MSE values indicate less similarity. 

In [12]:
def calculate_mse(vectors):
    num_images = vectors.shape[0]
    mse_matrix = np.zeros((num_images, num_images))
    for i in range(num_images):
        for j in range(num_images):
            mse_matrix[i, j] = np.mean((vectors[i] - vectors[j]) ** 2)
    return mse_matrix


mse_matrix = calculate_mse(features)
print(f"MSE between first image and others: {mse_matrix[0]}")

MSE between first image and others: [0.         0.37877962 0.07573757 0.38796917 0.22309396 0.39943331
 0.30580473 0.59959036 0.56260014 0.23024234 0.36158589 0.39420623
 0.37033033 0.33620864 0.54860342 0.57835162 0.6093694  0.50424129
 0.52705228 0.55588794 0.35978228 0.30754554 0.36636949 0.47303694
 0.48573199 0.4300279  0.50452161 0.28388995 0.39043629 0.61490542
 0.48971128 0.57558048 0.50684738 0.47189206 0.58065957 0.35639206
 0.66727912 0.41568285 0.44240475 0.48756316 0.62976539 0.60244602
 0.42003947 0.59128141 0.43566886 0.48390022 0.57347167 0.42933863
 0.57445586 0.47184175 0.5835216  0.36274266 0.42533714 0.25585485
 0.33016104 0.374672   0.48383433 0.58440197 0.7091161  0.64668798
 0.46402848 0.43040338 0.50751209 0.35953662 0.45913994 0.61401737
 0.34311393 0.48211497 0.33352411 0.47592479 0.39059421 0.41667575
 0.37069044 0.4488906  0.40772903 0.33125859 0.39303729 0.32425994
 0.48499852 0.33496648 0.46760896 0.4195644  0.40029937 0.45730662
 0.56774509 0.42447639 0.3

3. For MSE we know that the bigger the value the less similar they are. So let's look at the first picture compared with the sceond and third from all of them(Just for clarity). The MSE with the second picture is 0.379 which indicates some level of difference between them. And the first and third is 0.075 which indicates that they are more similar to eachother than the first and second images are.

4. Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. It is calculated as the dot product of the vectors divided by the product of their magnitudes. Cosine similarity ranges from -1 to 1, where 1 indicates that the vectors (images) are identical, 0 indicates orthogonality (no similarity), and -1 indicates opposite directions (completely dissimilar). In most image similarity tasks, values are between 0 and 1.

In [10]:
def calculate_cosine_similarity(feature_vectors):
    return cosine_similarity(feature_vectors)

cosine_sim_matrix = calculate_cosine_similarity(features)
print(f"Cosine similarity between first image and others: {cosine_sim_matrix[0]}")

Cosine similarity between first image and others: [1.         0.6143305  0.93541    0.68991345 0.78166956 0.6171157
 0.7498529  0.52535856 0.5231792  0.7751142  0.7174376  0.6535353
 0.62916476 0.66737723 0.47035486 0.5810168  0.56562316 0.6183199
 0.5992762  0.58026534 0.64165026 0.70869184 0.62582105 0.5951696
 0.5908753  0.5611137  0.54961604 0.7120267  0.62362903 0.51876044
 0.56814414 0.53951246 0.48970497 0.55752224 0.53652287 0.6356283
 0.5678461  0.6051136  0.5883862  0.59970117 0.5034972  0.51694435
 0.5906159  0.46608248 0.57741666 0.5782247  0.49412194 0.5842365
 0.4631915  0.5757779  0.54363394 0.7402617  0.6042743  0.7681972
 0.75123215 0.6035656  0.62449384 0.47191623 0.48299083 0.5521125
 0.51562595 0.5668598  0.5332289  0.6716937  0.65260637 0.60637426
 0.649773   0.53463644 0.6581549  0.59100807 0.58421504 0.6004367
 0.6518882  0.6043795  0.65605956 0.6718945  0.60351336 0.6874949
 0.5559423  0.67538327 0.5765457  0.5896606  0.61568856 0.5625969
 0.56360215 0.61154896 

5. Let's compare the first with the second and third picture. As it can be seen the similarity between the first and second is 0.65 which indicates some differences, but it's relatively high. Between the first and third there is 0.93 similarity which is pretty high and definately better than the first and second.

In [14]:
def find_similar_images(similarity_matrix, labels, top_n=5, metric='cosine'):
    """
    Finds the most similar image pairs based on the given similarity matrix.

    Args:
    similarity_matrix (np.array): The similarity matrix.
    labels (list): List of labels for the images.
    top_n (int): Number of similar images to find for each image.
    metric (str): Metric used for similarity ('cosine' or 'mse').

    Returns:
    list: List of tuples containing similar image pairs and their similarity scores.
    """
    num_images = similarity_matrix.shape[0]
    similar_pairs = []

    for i in range(num_images):
        if metric == 'cosine':
            similar_indices = np.argsort(similarity_matrix[i])[-top_n-1:-1]  # Exclude the image itself, descending order
        elif metric == 'mse':
            similar_indices = np.argsort(similarity_matrix[i])[:top_n]  # Ascending order for MSE
        else:
            raise ValueError("Metric must be 'cosine' or 'mse'")

        for j in similar_indices:
            similar_pairs.append((i, j, similarity_matrix[i, j]))

    # Sort pairs based on similarity score
    if metric == 'cosine':
        similar_pairs.sort(key=lambda x: -x[2])  # Descending order for cosine similarity
    elif metric == 'mse':
        similar_pairs.sort(key=lambda x: x[2])  # Ascending order for MSE

    return similar_pairs

# Example usage:

# Assuming you have your cosine similarity and MSE matrices
cosine_sim_matrix = np.random.random((10, 10))  # Replace with actual cosine similarity matrix
mse_matrix = np.random.random((10, 10))  # Replace with actual MSE matrix
labels = [f'Image {i}' for i in range(10)]  # Example labels

# Find similar images using cosine similarity
similar_pairs_cosine = find_similar_images(cosine_sim_matrix, labels, metric='cosine')
print(f"Top 5 similar image pairs (Cosine Similarity): {similar_pairs_cosine[:5]}")

# Find similar images using MSE
similar_pairs_mse = find_similar_images(mse_matrix, labels, metric='mse')
print(f"Top 5 similar image pairs (MSE): {similar_pairs_mse[:5]}")

Top 5 similar image pairs (Cosine Similarity): [(6, 8, 0.8554852095716958), (1, 3, 0.8535020725827196), (5, 3, 0.8400757725011545), (0, 1, 0.832379748624364), (1, 2, 0.8157922306725823)]
Top 5 similar image pairs (MSE): [(4, 7, 0.00849165402986618), (7, 9, 0.012013392024179281), (5, 7, 0.02295646799599449), (3, 8, 0.02564538755445278), (2, 3, 0.032509797812538155)]


6. Here we can see pairs of pictures with highest simiarities based on the two metrics.
7. There are many similar pictures as it can be seen from above, this could be brought down to many reasons.
8. For MSE images that are visually similar will have similar pixel values or feature representations, resulting in low MSE.Images from similar categories or with similar textures, shapes, and colors will have feature vectors that are close to each other, leading to lower MSE.
9. For Cosine the similarity considers the orientation (direction) of the vectors. If two images have feature vectors pointing in the same direction, they will have high cosine similarity. Images that share similar patterns, textures, and overall structures will have feature vectors that are oriented similarly, leading to higher cosine similarity.
10. We can say it's because it's all pictures of cats and some of them are from the same category. Could be that cats have similar poses and posture or the backgrounds are similar.

### * Problem 10. Explore, predict, and evaluate further
You can do a ton of things here, at your desire. For example, how does masking different areas of the image affect classification - a method known as **saliency map** ([info](https://en.wikipedia.org/wiki/Saliency_map))? Can we detect objects? Can we significantly reduce the number of features (keeping the quality) that we get? Can we reliably train a model to predict our own classes? We'll look into these in detail in the future.