# Objectives



* To explore transfer learning from Week 9 lecture and apply it in pistachio image classification.
> **Remember**: It is your responsibility as a machine learning scientist to read documentations for each library function in the code to thoroughly understand what it is doing, how it serves the purpose highlighted in the code comments, and other parameters that could be set.

# Section 1 - Load Pistachio Image dataset


1. <ins>Dataset information</ins>

This week, we will use the *Pistachio Image dataset*, which contains raw images and their labels.
>See the publication below for some info on the dataset:
  * OZKAN IA., KOKLU M. and SARACOGLU R. (2021). Classification of Pistachio Species Using Improved K-NN Classifier. Progress in Nutrition, Vol. 23, N. 2. https://doi.org/10.23751/pn.v23i2.9686.


2. <ins>Dataset download</ins>

You need to download the data before you can get started. Download from https://www.muratkoklu.com/datasets/. The link for the *Pistachio_Image_Dataset* is in the first table on the page.

3. <ins> Dataset upload</ins>

You would then have downloaded a *Pistachio_Image_Dataset.zip* file. This file can be uploaded to your Colab directory using the File menu in Google Colab. Once upload is complete, you should be able to see the file on the listed contents of your Colab directory.

4. <ins> Dataset folder structure</ins>

The folder has three sets of data (and so, three subfolders):

* Pistachio_Image_Dataset - containing image data as jpg files split between two folders named with the classification label of interest.

* Pistachio_16_Features_Dataset - containing 16 hand-crafted (i.e. engineered) features extracted from the image data, represented as a feature vector and labels provided in different file formats including xls.
>Read about the 16 features here: https://www.researchgate.net/profile/Ridvan-Saracoglu/publication/353121533_Classification_of_Pistachio_Species_Using_Improved_k-NN_Classifier/links/60e8213930e8e50c01f0e73f/Classification-of-Pistachio-Species-Using-Improved-k-NN-Classifier.pdf

* Pistachio_28_Features_Dataset - containing 28 hand-crafted (i.e. engineered) features extracted from the image data, represented as a feature vector and labels provided in different file formats including xls. (It is not very clear what how the 28 features were extracted, but I expect that it includes the 16 features above.).

5. <ins> Loading the dataset</ins>

You can use the code below to unzip the folder (first code cell), load the hand-crafted features data, with labels (second code cell), and preload the raw image data with labels (third code cell).
>Only the path addresses of the images are loaded here ('pre-load') as loading all the images into memory at once would not be efficient on the memory.

>Note that the two data versions (hand-crafted features and image) are loaded such that they are matched for the same data instance. (At least there is an attempt to do so based on what is understood of the data arrangement.)

>You may want to browse through the files yourself to see what they contain.


**Unzipping**

In [None]:
from zipfile import ZipFile

# specify the full path for the uploaded zipped dataset file
data_folder_full_path = "/content/Pistachio_Image_Dataset.zip"


# load the zip file and create a zip object
with ZipFile(data_folder_full_path, 'r') as datasetFolderObject:

    # and then extracting the contents to the main directory
    datasetFolderObject.extractall(path="/content")

# lists the content of your main Colab directory
!ls  /content

**Loading the *Pistachio_28_Features_Dataset* data**

In [None]:
import numpy
import pandas



# specify the full path for the unzipped xls file for the 28_features version
# of the dataset
feat_data_file_full_path = "/content/Pistachio_Image_Dataset/Pistachio_28_Features_Dataset/Pistachio_28_Features_Dataset.xls"

# load the data from the xls file
feat_28_data_pandas = pandas.read_excel(feat_data_file_full_path)
feat_28_data = feat_28_data_pandas.to_numpy()

# output the shape of the loaded data to the screen
print("\n The dataset has shape: "+str(feat_28_data.shape))


# get the features and the labels from the loaded data
# note that the label column is the last column of the file
feat_col = numpy.arange(0, feat_28_data.shape[1]-1)
label_col = feat_28_data.shape[1]-1

feats_28 = feat_28_data[:, feat_col]
labels = feat_28_data[:, label_col]

print("\n A peek at the 28-features dataset features: \n"+str(feats_28))
print("\n A peek at the 28-features dataset labels: \n"+str(labels))

# recode the nominal labels as numeric
numpy.place(labels, labels=='Kirmizi_Pistachio', '0')
numpy.place(labels, labels=='Siirt_Pistachio', '1')
print("\n The recoded labels: ", str(numpy.unique(labels)))

# convert from string type to integer type
labels = labels.astype(int)
print("\n A peek at the recoded 28-features dataset labels: \n"+str(labels))


**Pre-load the *Pistachio_Image_Dataset* data**

In [None]:
import os
from torchvision.io.image import read_image
from torchvision.transforms.functional import to_pil_image
import matplotlib.pyplot as plt

random_seed = 1

# specify the full path for the folders containing
# the image version of the dataset
image_source_dir_kirmizi = "/content/Pistachio_Image_Dataset/Pistachio_Image_Dataset/Kirmizi_Pistachio"
image_source_dir_siirt = "/content/Pistachio_Image_Dataset/Pistachio_Image_Dataset/Siirt_Pistachio"


# create a method for getting the path addresses for the (image) contents
# of the image source folders
# as we also want to relate the image instances to the data instances (rows)
# in the hand-crafted features version of the dataset, we also need to get
# the instance id in the file names of the images.
# (assuming that the row number corresponds to the matching instance id
# for the hand-crafted features version of the dataset)
#
# NOTE that some filenames have the instance id repeated in bracket
# for another instance, so we need to address that.
def get_filepaths(image_source_dir, num_id_bracketed, limit=0):
  image_files = []
  instance_ids = []
  count_instances = limit

  # Loop through all the files in the given folder
  # and get the full filepath for each file
  # and also get the corresponding instance id from its filename.
  #
  # We need to account for the ids repeated in brackets
  # (see NOTE above),
  # in order to get unique ids. For the sake of this lab,
  # we will assume that the instances with the bracketed ids
  # are earlier instances than instances without brackets
  # i.e. (1), (2) are earlier than 1, for example
  for filename in os.listdir(image_source_dir):
    image_files.append(os.path.join(image_source_dir, filename))

    instance = filename.split(' ')
    #print(instance)
    instance = instance[1].split('.')
    #print(instance)
    instance = instance[0]
    #print(instance)
    if instance[0]=='(':
      instance = instance[1:]
      instance = instance[:len(instance)-1]
      instance = int(instance)
    else:
      instance = int(instance)+num_id_bracketed
    #print(instance)
    instance_ids.append(instance-1+limit)

    count_instances += 1

  return image_files, instance_ids, count_instances


# call the above method to get the path address for each image in the folders
#
# Since the Kirmizi instances appear as the first rows in the hand-crafted features
# version of the dataset, we start the instance id ordering from the folder for this label.
#
# Remember that we want to be able to match instances in the hand-crafted features data
# with instances in the raw image data.
image_files_kirmizi, instance_ids_kirmizi, count_instances_kirmizi = get_filepaths(image_source_dir_kirmizi, num_id_bracketed=65)
# Since the Siirt instances appear as the last rows in the hand-crafted features*,
# version of the dataset, we continue the instance id ordering based on the count
# of instances in the Kirmizi folder
image_files_siirt, instance_ids_siirt, _ = get_filepaths(image_source_dir_siirt, num_id_bracketed=50, limit=count_instances_kirmizi)
# parameter values 50 and 65 above correspond to the number of image files
# with brackets in their instance ids
# so in essence, the non-bracketed ids are shifted by those numbers
# to make sure that ids are unique numbers (when brackets are removed)


# combine the lists of files for both image source folders
image_files = []
image_files.extend(image_files_kirmizi)
image_files.extend(image_files_siirt)


# combine the list of instance ids for both image source folders
# the ids are needed to match with the hand-crafted features data
image_instance_ids = []
image_instance_ids.extend(instance_ids_kirmizi)
image_instance_ids.extend(instance_ids_siirt)

print("\n A peek at the image files: \n"+str(image_files))
print("\n A peek at the instance ids: \n"+str(image_instance_ids))


# double-check that instance ids are indeed unique
# if true, the number of unique instance ids should match
# the number of instances in the hand-crafted features data
print('\n Check:')
print('Number of images preloaded:', len(image_instance_ids))
print('Number of unique instance ids for the images:', numpy.unique(numpy.array(image_instance_ids)).shape)
print('Number of labels loaded:', labels.shape)


# method to visualize sample images
def plot_image(img_file, img_name):

  print()
  img = read_image(img_file)
  print('Image shape:', img.size())
  pil_img = to_pil_image(img)
  plt.figure()
  plt.imshow(numpy.asarray(pil_img))
  plt.title(img_name)
  plt.show()



# for a peek, randomly select one example from the kirmizi class to visualize
rng =  numpy.random.default_rng(random_seed)
kirmizi_id = rng.choice(instance_ids_kirmizi)
# for a peek, randomly select one example from the siirt class to visualize
siirt_id = rng.choice(instance_ids_siirt)


plot_image(image_files[kirmizi_id], 'Kirmizi')
plot_image(image_files[siirt_id], 'Siirt')

# Section 2 - Split into training, validation, and test sets



In [None]:
from sklearn.model_selection import train_test_split


all_ids = numpy.arange(0, feats_28.shape[0])

random_seed = 1

# First randomly split the data into 70:30 to get the training set
# split to have similar distributions of the class labels.
train_set_ids, rem_set_ids = train_test_split(all_ids, test_size=0.3, train_size=0.7,
                                 random_state=random_seed, shuffle=True, stratify=labels)


# Then further split the remaining data 50:50 into validation and test sets
# split to have similar distributions of the class labels.
val_set_ids, test_set_ids = train_test_split(rem_set_ids, test_size=0.5, train_size=0.5,
                                 random_state=random_seed, shuffle=True)


# create a method for a histogram plot
# that shows the distribution of each label class
# (classes '0' and '1' in this case)
# for a given set of label instances
def plot_label_distr(labels, plot_title):
  print()
  plt.figure()
  the_bin_centres = numpy.unique(labels)
  plt.hist(labels, bins=the_bin_centres.shape[0], range=(the_bin_centres[0]-0.5, the_bin_centres[the_bin_centres.shape[0]-1]+0.5))
  plt.xticks(the_bin_centres)
  plt.title(plot_title)
  plt.show()



# check the distribution of the labels in the training, validation, and test sets
plot_label_distr(labels[train_set_ids], 'Class frequencies for training set')
plot_label_distr(labels[val_set_ids], 'Class frequencies for validation set')
plot_label_distr(labels[test_set_ids], 'Class frequencies for test set')



# Section 3 - Scale (i.e. normalize) the hand-crafted features




In [None]:
from sklearn.preprocessing import StandardScaler

# create a function for normalizing the feature vector
# using a standard scaling (results in mean=0 and standard deviation=1)
def scale_feats(feat_vec):
  # Scaling the features to the same range of values
  scaler = StandardScaler()
  scaler.fit(feat_vec)
  scaled_feat_vec = scaler.transform(feat_vec)
  print("\n A peek at the scaled dataset features: \n"+str(scaled_feat_vec))

  return scaled_feat_vec

# normalize the feature vector
scaled_feats_28 = scale_feats(feats_28)



# Section 4 - Extract new (transfer learning) features using a pre-trained model

We will use a pre-trained *VGG-16* model. The VGG-16 is a 16-layer convolutional neural network (CNN).
>Have a look at the basic details of the model here: https://pytorch.org/vision/stable/models/generated/torchvision.models.vgg16.html#torchvision.models.vgg16.

>You can read more about the VGG-16 model architecture here: https://arxiv.org/abs/1409.1556.

>From the documentation and publication, note:
  * the number of parameters (weights) that the model has
  * the dataset that it was trained on (You can read more about the dataset here: https://www.image-net.org/)
  * the different kinds of objects etc that it was trained to differentiate, i.e. its classes

In [None]:
from enum import auto
from torchvision.io import read_image
from torchvision import datasets, models, transforms
import torch.optim as optim
from torch.optim import lr_scheduler

# load and view the VGG16 model
model_vgg16 = models.vgg16()
print("\n 'View' the model architecture:\n")
print(model_vgg16)


# Extract features using the feature extraction layers of the pretrained model

### Step 1: Initialize model with the best available weights
model_vgg16_featlayers = models.vgg16(weights="IMAGENET1K_FEATURES").features
model_vgg16_featlayers.eval()


### Step 2: Initialize the inference transforms
# This does some preprocessing behind the scenes, including:
# 1) Resizing the input to resize_size=[256];
# 2) Followed by a central cropping of crop_size=[224];
# 3) And finally the values are first rescaled to [0.0, 1.0]
# and then normalized using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225].
# The transform function expects either:
# a batched (B, C, H, W) or single (C, H, W) image, as torch.Tensor objects.
preprocess = models.VGG16_Weights.IMAGENET1K_FEATURES.transforms(antialias=True)



# method to get features learnt in the pretrained VGG16
# extracting these features for our own list of images
def get_img_feats(img_path):


  img = read_image(img_path)
  #print("Fetching image...")

  # Step 3: Apply the preprocessing transforms
  batch = preprocess(img).unsqueeze(0)

  # Step 4: Use the feature extraction layers of the pretrained model
  # to obtain the features
  auto_feats = model_vgg16_featlayers(batch).squeeze(0)
  auto_feats = auto_feats.detach().numpy()
  auto_feats = numpy.mean(auto_feats, axis=1, keepdims=False)
  auto_feats = numpy.mean(auto_feats, axis=1, keepdims=False)

  return auto_feats




# Section 5 - Train a MLP using the new (transfer learning) features

* Train and evaluate a MLP for classifying pistachio images into the two classes based on the new features extracted using the VGG-16

# Section 6 - Train a MLP using the hand-crafted features

* Train and evaluate a MLP for classifying pistachio images into the two classes based on the original hand-crafted features that came with the dataset.

* How does performance compare with that in Section 5?