# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



## Learning Objectives

At the end of the experiment,  you will be able to :

* Understand how to derive Eigen faces using PCA
* Use the PCA features for classification purpose

In [None]:
#@title Experiment Walkthrough Video

from IPython.display import HTML
HTML("""<video width="854" height="480" controls>
<source src="https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/aiml_2018_b7_hyd/experiment_details_backup/pca_with_eigen_faces.mp4" type="video/mp4">
</video>""")



## Dataset

### Description 

The dataset chosen for this experiment is a preprocessed excerpt of the “Labeled Faces in the Wild”, aka LFW. 

Labeled Faces in the Wild, a database of face photographs designed for studying the problem of unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the data set. The only constraint on these faces is that they were detected by the Viola-Jones face detector. 

## AI / ML Technique

### Eigen Faces

Eigenfaces is the name given to a set of eigenvectors when they are used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby (1987) and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.

### Importing Required Packages

In [None]:
from time import time
import matplotlib.pyplot as plt

# Importing Sklearn Packages
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

### Download the data

Load the Labeled Faces in the Wild (LFW) people dataset, To know more about LFW people dataset refer [link](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html)

In [None]:
# Loading and Downloading data from sklearn 
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Storing images arrays shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

`lfw_poeple` provides various attributes, where in `lfw_people.images` is the features of 3-dimensional shape and `lfw_people.data` holds the same information of images which is flattened array of images (1-dimensional array)

### Assigning lfw_people data to the X variable, by using the 'data' attribute.

In [None]:
X = lfw_people.data

# Data is stored in X, where the shape of X is (no.of samples, no.of features)
n_features = X.shape[1] 
X.shape[0], X.shape[1]

In [None]:
# The label is the id of the person
y = lfw_people.target

# Loading the target names (Label names)
target_names = lfw_people.target_names

# Checkinq no.of classes
n_classes = target_names.shape[0]

print("Target names:", target_names)
print("\nTotal dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

### Split into a training and testing set using train_test_split sklearn function


In [None]:
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
X_train.shape, X_test.shape

## Performing PCA on the face images, otherwise known as Eigenfaces

### Compute the PCA

In PCA, a parameter Whiten = True, will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

Whitening just makes our resulting data have a unit variance, which has been shown to produce better results

In [None]:
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction

n_components = 150

print("Extracting the top %d eigenfaces from %d faces"
      % (n_components, X_train.shape[0]))

# Starting the timer
t0 = time()

# Trying to extract PCA features using PCA function from sklearn
pca = PCA(n_components=n_components, whiten=True).fit(X_train)

# Printing the time taken to extract the features
print("done in %0.3fs" % (time() - t0))

# Storing the eigen faces and reshaping to 3 dimensions
# pca.components_ is the set of all eigenvectors of the covariance matrix (one eigenvector for each principal component)
eigenfaces = pca.components_.reshape((n_components, h, w))

In [None]:
pca.components_.shape

In [None]:
eigenfaces.shape

### Projecting the input data on the eigenfaces orthonormal basis


In [None]:
t0 = time()
# Transforming the data
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))

In [None]:
# Checking for the shape of the original and pca data
X_train.shape, X_train_pca.shape, X_test.shape, X_test_pca.shape

## Train a Linear Classifier

### Fitting the classifier to the training set

In [None]:
t0 = time()

from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()

In [None]:
# Fit the data
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))

###  Quantitative evaluation of the model quality on the test set

In [None]:
print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test_pca)

print("done in %0.3fs" % (time() - t0))
print("accuracy is", accuracy_score(y_test,y_pred))

print("\nclassification report")
print(classification_report(y_test, y_pred, target_names=target_names))

### Qualitative evaluation of the predictions using matplotlib

Below function `plot_gallery()` takes images, titles , height, width and plots each image in subplot 


In [None]:
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits
      out of all images passed it will plot only 12 images as rows and columns defined 3 and 4"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap='gray')
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

Below get_Title() function is to extract target names for predictions and actual values to plot along with images

In [None]:
# Getting the last name of each target name
def get_Title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].split(' ')[-1]
    true_name = target_names[y_test[i]].split(' ')[-1]
    return 'predicted: %s\ntrue: %s' % (pred_name, true_name)

In [None]:
# Get the prediction and actual titles and store in a list
prediction_titles = []
for i in range(y_pred.shape[0]):
  title =  get_Title(y_pred, y_test, target_names, i)
  prediction_titles.append(title)

# Plot the result of the prediction on a portion of the test set
plot_gallery(X_test, prediction_titles, h, w, n_row=5, n_col= 2)

### Plotting the eigen faces

In [None]:
# Plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)