# Image Analysis - Topic 2 - Real Datasets - Part 01

## Topic Objectives

Use a real dataset and explore: label distribution, deliver an image montage, conduct average image, image variability and contrast between 2 average image studies.

### Import Packages for Learning

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")

# Lets extract the image files we will be using
from zipfile import ZipFile

with ZipFile('Chess.zip', 'r') as chessZip:
   chessZip.extractall()

## Image Analysis - Real Datasets - Part 01

Real image datasets are different from the toy datasets we find in the ML libraries.

Real images, unlike toy datasets, can vary significantly in size. It's unrealistic to expect all images to conform to a standard size, such as 100 x 50 pixels.
One possible approach is to arrange downloaded images in a folder. Each folder contains a set of sub-folders related to the image labels, and the image files are found in each sub-folder.

Ultimately, we want to have three folders: Train, Validation and Test. However, your dataset may come with:

One folder (with subfolders as the labels)
Two folders (like Train and Test)
Or three folders (with Train, Validation and Test)
Let's imagine our dataset is called Animals_Image, and there are three labels: Dog, Cat and Parrot. The data could be in one of three formats below:

Imagine there aren't only two distinct images in each folder. Instead, there is a set of images for each label or in each folder.

We want our data to be in a three-directory structure. We will need to move files between directories and do that programmatically. We will cover how to do that in the next topic.

In this topic, we will use a dataset that comes with one directory only.

### Image Analysis

We will use the following workflow to start our image analysis study in this unit:

1 - Set data directory
2 - Delete non-image files
3 - Assess Labels Distribution
4 - Build an Image Montage
5 - Calculate Average Image and Image Variability
6 - Contrast Between 2 Labels

### Set Data Directory

You will locate your data directory

That is the root path of your data. In this case, the sub-folder option is the dataset name folder: Chess

In [4]:
my_data_dir = 'Chess'
my_data_dir

'Chess'

The labels are assessed based on the folder names in the Chess folder. This task is done with the command: os.listdir(), where the argument is: 'Chess'. The documentation is found here.

In [5]:
import os
labels = os.listdir(my_data_dir)
labels

['Bishop', 'King', 'Knight', 'Pawn', 'Queen', 'Rook']

### Labels Distribution

We create custom code that:

Stores in a DataFrame: the name of the set (in this case, Chess), the label and its frequency
We plot the DataFrame in a barplot showing the frequencies.

In [6]:
df_freq = pd.DataFrame([])
for folder in ["Chess"]:
    # think 'Chess' as a Set Folder.
    # Ideally we want a Train Set, Val Set and Test Set
    for label in labels:
        df_freq = df_freq.append(
            pd.Series(
                data={
                    "Set": folder,
                    "Label": label,
                    "Frequency": len(os.listdir(folder + "/" + label)),
                }
            ),
            ignore_index=True,
        )


df_freq

AttributeError: 'DataFrame' object has no attribute 'append'

We plot the DataFrame using a barplot, where x is the Set (in this case is only Chess), y is the Frequency, and hue is the Label

We notice the label's frequencies are different across all labels. There are sections where one is much less than another.

In [None]:
print("\n")
sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')
plt.show()

### Image Montage

Similarly to the previous topic, we want to do an Image Montage on the labels to start understanding the dataset.

The difference is that the dataset is located in directories, not in an array. One key logic difference of this function is that it loops through the image names across different directories and "loads and plots" the images in a Figure.
Check the pseudo-code to understand the logic
It is normal and okay if you need help understanding all the code in the function below. The central point is making sense of the pseudo-code and understanding the function parameters.

In [None]:
import itertools
import random
from matplotlib.image import imread


def image_montage(dir_path, label_to_display, nrows, ncols, figsize=(15, 10)):
    """
    Display a montage of images from a specified directory.

    Parameters:
    - dir_path (str): The path to the directory containing the images.
    - label_to_display (str): The label of the images to display.
    - nrows (int): The number of rows in the montage.
    - ncols (int): The number of columns in the montage.
    - figsize (tuple, optional): The size of the figure. Defaults to (15, 10).

    Returns:
    - None

    Logic:
    - Check if the specified label exists in the folder.
    - Check if the montage space is greater than the number of images in the subset.
    - Create a list of axes indices based on nrows and ncols.
    - Create a figure and display the images.
    - Load and plot each image in the loop.

    """
    sns.set_style("white")

    labels = os.listdir(dir_path)

    # Check if the specified label exists in the folder
    if label_to_display in labels:

        # Check if the montage space is greater than the number of images in the subset
        images_list = os.listdir(dir_path + "/" + label_to_display)
        if nrows * ncols < len(images_list):
            img_idx = random.sample(images_list, nrows * ncols)
        else:
            print(
                f"Decrease nrows or ncols to create your montage. \n"
                f"There are {len(images_list)} images in your subset. "
                f"You requested a montage with {nrows * ncols} spaces."
            )
            return

        # Create a list of axes indices based on nrows and ncols
        list_rows = range(0, nrows)
        list_cols = range(0, ncols)
        plot_idx = list(itertools.product(list_rows, list_cols))

        # Create a figure and display the images
        fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
        for x in range(0, nrows * ncols):
            img = imread(dir_path + "/" + label_to_display + "/" + img_idx[x], 0)
            img_shape = img.shape
            axes[plot_idx[x][0], plot_idx[x][1]].imshow(img)
            axes[plot_idx[x][0], plot_idx[x][1]].set_title(
                f"Width {img_shape[1]}px x Height {img_shape[0]}px"
            )
            axes[plot_idx[x][0], plot_idx[x][1]].set_xticks([])
            axes[plot_idx[x][0], plot_idx[x][1]].set_yticks([])
        plt.tight_layout()
        plt.show()

    else:
        print("The label you selected doesn't exist.")
        print(f"The existing options are: {labels}")

We create the logic by looping over the labels and creating an image montage for each.

Note also the dimensions of the images are different.

In [None]:
for label in labels:
    print(label)
    image_montage(
        dir_path=my_data_dir, label_to_display=label, nrows=2, ncols=3, figsize=(10, 15)
    )
    print("\n")

### Average Image and Image Variability per Label

To compute an average image, all images must be the same size. First, we need to determine the average image size so that we can load all images in a uniform array.

We loop over the train_path, load each image and store the height and width in dim1 and dim2. After, we plot the size of the images in a scatterplot and indicate the average image value for width and height.

In [None]:
dim1, dim2 = [], []
for label in labels:
    for image_filename in os.listdir(my_data_dir + "/" + label):
        img = imread(my_data_dir + "/" + label + "/" + image_filename, 0)
        img_shape = img.shape
        dim1.append(img_shape[0])  # image height
        dim2.append(img_shape[1])  # image width

sns.set_style("whitegrid")
fig, axes = plt.subplots()
sns.scatterplot(x=dim2, y=dim1, alpha=0.2)
axes.set_xlabel("Width (pixels)")
axes.set_ylabel("Height (pixels)")
dim1_mean = int(np.array(dim1).mean())
dim2_mean = int(np.array(dim2).mean())
axes.axvline(x=dim2_mean, color="r", linestyle="--")
axes.axhline(y=dim1_mean, color="r", linestyle="--")
plt.show()
print(f"Width average: {dim2_mean} \nHeight average: {dim1_mean}")

We need to load all images into a uniform array.

We create a custom function that loops over a directory. In this directory, we find the possible labels as subfolders. For each label (subfolder), we load the image, resize it to the average width and height we computed earlier, and store it in an array.
In the end, we have X and y arrays, where X stores the image pixels and y labels for each image.
It is normal and okay if you don't initially understand all the code in the function below. The central point is to make sense of the function parameters.

In [None]:
sns.set_style("white")
from tensorflow.keras.preprocessing import image


def load_image_as_array(my_data_dir, new_size=(50, 50), images_amount=20):
    """
    Load images from a directory and convert them into a numpy array.

    Args:
        my_data_dir (str): The directory path where the images are located.
        new_size (tuple, optional): The desired size of the images. Defaults to (50, 50).
        images_amount (int, optional): The maximum number of images to load per label. Defaults to 20.

    Returns:
        tuple: A tuple containing the loaded images as a numpy array (X) and their corresponding labels (y).
    """

    X, y = np.array([], dtype="int"), np.array([], dtype="object")
    labels = os.listdir(my_data_dir)

    for label in labels:
        counter = 0
        for image_filename in os.listdir(my_data_dir + "/" + label):
            if counter < images_amount:

                img = image.load_img(
                    my_data_dir + "/" + label + "/" + image_filename,
                    target_size=new_size,
                )
                if image.img_to_array(img).max() > 1:
                    img_resized = image.img_to_array(img) / 255
                else:
                    img_resized = image.img_to_array(img)

                X = np.append(X, img_resized).reshape(
                    -1, new_size[0], new_size[1], img_resized.shape[2]
                )
                y = np.append(y, label)
                counter += 1

    return X, y

The function parameters are:

my_data_dir, we provide train_path (/content/chess_dataset/Chessman-image-dataset/Chess),
new_size, which is the average image dimension from this dataset, and
When setting the images_amount parameter, remember that loading, resizing, and storing image data will have a significant computing cost. You can load the same amount of images per label and set a value that will not take much time to load. However, be mindful that you may face memory issues when loading a large number of images, depending on your memory availability.
It may take 2 or 3 minutes to load all images.

In [None]:
X, y = load_image_as_array(
    my_data_dir=my_data_dir, new_size=(dim1_mean, dim2_mean), images_amount=2
)

We will once again use the function image_avg_and_variability_data_as_array() to understand the average image and image variability for this dataset. This is the same function we used in the last topic, so you should be quite familiar with it by now.

In [None]:
def image_avg_and_variability_data_as_array(X, y, figsize=(12, 5)):
    """
    Calculate the average and variability of images for each label in the dataset and display them.

    Parameters:
    X (numpy.ndarray): The input array of images.
    y (numpy.ndarray): The labels corresponding to each image.
    figsize (tuple, optional): The size of the figure to display the images. Defaults to (12, 5).

    Returns:
    None

    The pseudo-code for the function is:
    * Loop through all labels
    * Subset an array for a given label
    * Calculate the average and standard deviation
    * Create a Figure displaying the average and variability image
    """

    sns.set_style("white")

    for label_to_display in np.unique(y):

        y = y.reshape(-1, 1, 1)
        boolean_mask = np.any(y == label_to_display, axis=1).reshape(-1)
        arr = X[boolean_mask]

        avg_img = np.mean(arr, axis=0)
        std_img = np.std(arr, axis=0)
        print(f"==== Label {label_to_display} ====")
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
        axes[0].set_title(f"Average Image for label {label_to_display}")
        axes[0].imshow(avg_img, cmap="gray")
        axes[1].set_title(f"Image Variability for label {label_to_display}")
        axes[1].imshow(std_img, cmap="gray")
        plt.show()
        print("\n")

We parse X and y to understand average image and image variability per label.

You will likely notice typical patterns/shapes for pieces like kings or knights; however, the images (average and variability) may be too blurred.
This blurring happens because we only load a few images since we just wanted to show the use case. If you want, you can go back to load_image_as_array(), set a higher value images_amount, and rerun image_avg_and_variability_data_as_array()
To help the interpretation, consider the following guide:

Check for the patterns where the colour is darker or lighter
For Average Image, we notice the general patterns for a given label
For Image Variability, the lighter area indicates higher variability across images from the same label in that area.

In [None]:
image_avg_and_variability_data_as_array(X=X, y=y, figsize=(12,5))

Note: There will be datasets where the images in a given label will have distinct shapes or patterns, and an average and variability study may not give the same amount of insights as we see in mnist dataset

For example, your dataset may contain images of fish and birds from multiple species.

Eventually, when you subset fishes and calculate an average image, the result will be a combination of patterns from multiple fish species that may confuse a user unfamiliar with the context.

### Contrast between 2 Labels

We may be at a point in our project where we want to know the differences between 2 classes

We will use the contrast_between_2_labels_data_as_array() function, which is designed to highlight the differences between two classes, as we move forward in our analysis.

In [None]:
def subset_image_label(X, y, label_to_display):
    """
    Subsets the input data based on a specific label.

    Parameters:
    X (array-like): The input data array.
    y (array-like): The labels array.
    label_to_display: The label to subset the data on.

    Returns:
    array-like: The subset of the input data where the labels match the specified label_to_display.
    """
    y = y.reshape(-1, 1, 1)
    boolean_mask = np.any(y == label_to_display, axis=1).reshape(-1)
    df = X[boolean_mask]
    return df


def contrast_between_2_labels_data_as_array(X, y, label_1, label_2, figsize=(12, 5)):
    """
    Calculate the contrast between two labels in a dataset and plot the difference, average of label 1, and average of label 2.

    Parameters:
    - X (array-like): The input data array.
    - y (array-like): The target labels array.
    - label_1 (int or str): The first label to compare.
    - label_2 (int or str): The second label to compare.
    - figsize (tuple, optional): The size of the figure. Default is (12, 5).

    Returns:
    None
    """
    sns.set_style("white")

    if (label_1 not in np.unique(y)) or (label_2 not in np.unique(y)):
        print(f"Either label {label_1} or label {label_2} is not in {np.unique(y)} ")
        return

    # calculate the mean from label 1
    images_label = subset_image_label(X, y, label_1)
    label1_avg = np.mean(images_label, axis=0)

    # calculate the mean from label 2
    images_label = subset_image_label(X, y, label_2)
    label2_avg = np.mean(images_label, axis=0)

    # calculate the difference and plot the difference, average of label 1, and average of label 2
    contrast_mean = label1_avg - label2_avg
    fig, axes = plt.subplots(nrows=1, ncols=3, figsize=figsize)
    axes[0].imshow(contrast_mean, cmap="gray")
    axes[0].set_title(f"Difference Between Averages: {label_1} & {label_2}")
    axes[1].imshow(label1_avg, cmap="gray")
    axes[1].set_title(f"Average {label_1}")
    axes[2].imshow(label2_avg, cmap="gray")
    axes[2].set_title(f"Average {label_2}")
    plt.show()

Let's compare King and Knight.

For a comprehensive interpretation, rely on the following guide:

You are comparing label_1 to label_2
In the Difference Between Averages plot, the darker area shows where the average images are similar, and the lighter area shows where they are different.
In this dataset, the contrast may provide little insight since there is a small number of images per label.

In [None]:
contrast_between_2_labels_data_as_array(
    X=X, y=y, label_1="King", label_2="Knight", figsize=(15, 20)
)

The same note from the previous section applies here:

There will be datasets where the images in a given label will have distinct shapes or patterns, and a contrast-from-averages study may not provide the same amount of insights as we see in the chess dataset.