# Project for UDA Assessment 4
## Author: 02179784 (Mark Roberts)

<B>Project Description:</B><BR>
The intention of this project is to investigate how practical it is to use a "generic" model for image classification.  The idea is to take general JPEG image files and transform them into a "standard" format using the steps:
 - Converted image from colour to greyscale (pixels have a value in the range: 0-255).
 - Resize images to a standard size (i.e. 350,350).

and then train a simple classification model on these "standardized" images.

This analysis has been, like this notebook, split into the following sections:
 - Benchmark the classification method using the original, unaltered colour images.
 - Standardize the image data.
 - Building the model
 - Determining accuracy of the model
 - Checking how accuracy of the model varies with image size.

Details of the dataset, taken from Kaggle, can be found here: [Monkey Images](https://www.kaggle.com/datasets/slothkong/10-monkey-species).<BR><BR>
It should be noted that only 4 of the 10 original monkey image categories have been used.  This is due to the limitations that have been placed on the dataset size (100MB).

Note that the category of monkey is given by the directory in which the picture is located (e.g. /n0).  The mapping from directories to monkey type is given here:

| Label | Latin Name | Common Name | Train Images | Validation Images |
| --- | --- | --- | --- | --- |
| n0 | alouatta_palliata | mantled_howler | 105 | 26 |
| n1 | erythrocebus_patas | patas_monkey | 111 | 28 |
| n2 | cacajao_calvus | bald_uakari | 110 | 27 |
| n3 | macaca_fuscata | japanese_macaque | 122 | 30 |

Further details of the project can be found in the PDF associated with the assessment.

    
<B>General Information:</B><BR>
 - This script was run on a Macbook Pro. with 32 GB of RAM.
 - The total run time of the Jupyter notebook was (approx) 120 seconds.
 - The Github repository for all code and datasets can be obtained by cloning the repository:
[GitHub-Repo](git@github.com:Mark12481632/UDA_Assessment_4_02179784_Roberts.git)
 - <B>The environment running the note book must allow for the creation of directories!!</B>
    
https://www.kaggle.com/datasets/niteshfre/chessman-image-dataset

In [10]:
# Import the libraries we will use:

import numpy as np
import os
import matplotlib.pyplot as plt

from PIL import Image
from skimage import data, io, color
from IPython.display import display # to display images
from sklearn.metrics import plot_confusion_matrix
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix

In [77]:
# Common variables:

# Directories for datasets:
photo_root = "./monkeys"

source_training_photos = photo_root + "/source_dataset/training"
dest_training_photos = photo_root + "/derived_dataset/training"

source_validation_photos = photo_root + "/source_dataset/validation"
dest_validation_photos = photo_root + "/derived_dataset/validation"

# Restrict our analysis to following monkey categories:
monkey_groups = ["n0", "n1"]

In [84]:
# Functions:

def image_as_np_vector(file):
    """
    """
    img_data = io.imread(file)
    vec = np.reshape(img_data, -1)
    return(vec)


def create_bw_photo(infile, outfile, height=350, width=350):
    """
    This 
    """
    img = Image.open(infile)

    img = img.convert("L").resize((height, width))
    bw_image = np.array(img.getdata(), dtype = np.uint8).reshape(height, width)

    # Sanity check
    if img.size != bw_image.shape:
        print(f"ISSUE when sizing {in_dir + file}")

    io.imsave(outfile, bw_image)


def read_in_images(img_loc, group_list=monkey_groups):
    """
    """
    data_list = []
    label_list = []
    for group in group_list:
        group_files = os.listdir(img_loc + "/" + group)

        for file in group_files:
            img_vec = image_as_np_vector(img_loc + "/" + group + "/" + file)
            data_list.append(img_vec)
            label_list.append(group)

    data = np.concatenate(data_list, axis=0)#.reshape(len(label_list),-1)
    labels = np.array(label_list)

    print('A.', len(label_list), label_list)
    print('B.', len(data_list), data_list[1].shape)
    print('C.', len(data.tobytes()))

    return((data, labels))

## Step 1
In order to get a basline metric on how well we can categorize the monkey images let's first see how well the
model can work on the original colour images of the monkeys.

The simplest classifier, that didn't involke Deep Learning, was the SGDClassifier.  This has 

In [85]:
# Read in images and label them, ready for training and validation:

# Train data:
train_data, train_labels = read_in_images(source_training_photos)

# Shuffle data..
np.random.seed(1234) # Ensure reproducable...
shuffle_index = np.random.choice(len(label_list), size=len(label_list), replace=False)

train_data = train_data[shuffle_index]
train_labels = train_labels[shuffle_index]

print('1', len(train_data), train_data.shape, type(data_list[1]), data_list[1].shape)
print('2', len(label_list))
print('3', train_data[0:5, 0:5])

# Test data:
test_data, test_labels = read_in_images(source_validation_photos)

print(test_data[0:5, 0:5])
print(test_labels[0:5])

A. 216 ['n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n0', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n1', 'n

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

In [76]:
img_data = io.imread("./monkeys/source_dataset/training/n0/n0018.jpg")
img_data.shape

(367, 550, 3)

In [52]:
# Train using the colour images
sgd_clf = SGDClassifier(random_state=123, max_iter=25000, tol=1e-5)

# Now look at how well it works on test data
sgd_clf.fit(train_data, train_labels)

SGDClassifier(max_iter=25000, random_state=123, tol=1e-05)

In [53]:
predicted_labels = sgd_clf.predict(test_data)

print(len(predicted_labels))
print((predicted_labels==test_labels).sum())

111
63


## Step 2
In this step we will convert all coloured monkey photos:
 - into black & white photos
 - set to fixed size (150, 150)

This step took less than 20 seconds to complete.<BR><BR>
Kaggle provides two directories, one containing training monkey images and one containing test monkey images.

In [None]:

for group in monkey_groups:
    # Process "training" dataset:
    # Create B&W output directory
    os.makedirs(dest_training_photos + group, exist_ok=True)
        
    # What are the colour photo files
    group_files = os.listdir(source_training_photos + group)
    print(f"DEBUG: Processing Group (Training): {group}, with {len(group_files)} files")

    in_dir = source_training_photos + group + "/"
    out_dir = dest_training_photos + group + "/"

    for file in group_files:
        create_bw_photo(in_dir + file, out_dir + file)


    # Process "validation" dataset:
    # Create B&W output directory
    os.makedirs(dest_validation_photos + group, exist_ok=True)

    # What are the colour photo files        
    group_files = os.listdir(source_validation_photos + group)
    print(f"DEBUG: Processing Group (Validation): {group}, with {len(group_files)} files")

    in_dir = source_validation_photos + group + "/"
    out_dir = dest_validation_photos + group + "/"

    for file in group_files:
        create_bw_photo(in_dir + file, out_dir + file)

In [None]:
# Here we can see an example of a colour image of a monkey and the corresponding black & white image:
orig_img = io.imread(source_training_photos + "/n0/n0018.jpg")

# Show "standardized" version of same image.
new_img = io.imread(dest_training_photos + "/n0/n0018.jpg")

fig = plt.figure(figsize=(8, 5))
fig.suptitle('Corresponding Colour and Black & White Images of a Monkey')
ax_1 = fig.add_subplot(2,2,1)
io.imshow(orig_img)
ax_2 = fig.add_subplot(2,2,2)
io.imshow(new_img)


In [None]:
# Consolidate BW images into labelled dataset:

np.random.seed(1234) # Ensure reproducable...

def image_as_np_vector(file):
    """
    """
    img_data = io.imread(file)
    vec = np.reshape(img_data, -1)
    return(vec)
    
# Train data:
data_list = []
label_list = []
for group in monkey_groups:
    group_files = os.listdir(dest_training_photos + group)
    for file in group_files:
        img_vec = image_as_np_vector(dest_training_photos + group + "/" + file)
        data_list.append(img_vec.tolist())
        label_list.append(group[1:]) # Remove preceeding '/'

# Shuffle data..
shuffle_index = np.random.choice(len(label_list), size=len(label_list), replace=False)

train_data = np.concatenate(data_list, axis=0)[shuffle_index]
train_labels = np.array(label_list)[shuffle_index]

print(len(train_data), train_data.shape, type(data_list[1]))
print(train_data[0:5])
print(train_labels[0:5])

"""
# Test data:
data_list = []
label_list = []
for group in monkey_groups:
    group_files = os.listdir(dest_validation_photos + group)
    for file in group_files:
        img_vec = image_as_np_vector(dest_validation_photos + group + "/" + file)
        data_list.append(img_vec)
        label_list.append(group[1:]) # Remove preceeding '/'

# Shuffle data..
shuffle_index = np.random.choice(len(label_list), size=len(label_list), replace=False)

test_data = np.concatenate(data_list, axis=0)[shuffle_index]
test_labels = np.array(label_list)[shuffle_index]

print(test_data[0:5, 0:5])
print(test_labels[0:5])
"""

In [None]:
train_data.shape

## Step 2
In this step we need to create a model and train it.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix

sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=1e-3)
sgd_clf.fit(train_data, train_labels)

In [None]:
predicted_labels = sgd_clf.predict(test_data)

print(len(predicted_labels))
print((predicted_labels==test_labels).sum())

In [None]:
all_labels = ['n0', 'n1', 'n2', 'n3']

conf_mat = confusion_matrix(test_labels, predicted_labels)
print(conf_mat)

