## SVM and Ensemble methods

In this question, we used "Labeled Faces in the Wild" dataset. This is a public benchmark for face verification. You can see the dataset size, features, classes and target names below.

In [None]:
from time import time
import logging
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.svm import SVC


print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')


# #############################################################################
# Download the data, if not already on disk and load it as numpy arrays

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
2021-04-14 19:44:34,054 Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012


Automatically created module for IPython interactive environment


Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
2021-04-14 19:44:37,509 Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
2021-04-14 19:44:40,085 Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015
2021-04-14 19:44:44,997 Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015


Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7


In [None]:
target_names

array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
       'Gerhard Schroeder', 'Hugo Chavez', 'Tony Blair'], dtype='<U17')

Here, we splitted data to train and test, and used PCA (eigenfaces) on the face dataset (treated as unlabeled dataset)to do unsupervised feature extraction / dimensionality reduction.

In [None]:
# #############################################################################
# Split into a training set and a test set using a stratified k fold

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)

# #############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150

print("Extracting the top %d eigenfaces from %d faces"
      % (n_components, X_train.shape[0]))
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
          whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))

eigenfaces = pca.components_.reshape((n_components, h, w))

print("Projecting the input data on the eigenfaces orthonormal basis")
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))

Extracting the top 150 eigenfaces from 966 faces
done in 0.449s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.057s


# Q1. 
Try to find the best estimator for SVC. Your input should be: X_train_pca 

Hint: use grid search
 <br>
<br>
* For Kernel : linear , 'C': [1e3, 5e3, 1e4, 5e4, 1e5],             <br>              
* For Kernel : rbf , 'C': [1e3, 5e3, 1e4, 5e4, 1e5], 'gamma': [[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]

In [None]:
#  Train a SVM classification model
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
linear_grid = {
    'C': [1e3, 5e3, 1e4, 5e4, 1e5],
    'kernel': ['linear']
}

rbf_grid = {
    'C': [1e3, 5e3, 1e4, 5e4, 1e5],
    'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
    'kernel': ['rbf']
}
# Create a based model
svc = SVC()

# Instantiate the grid search model
linear_search = GridSearchCV(estimator = svc, param_grid = linear_grid, cv = 3)
rbf_search = GridSearchCV(estimator = svc, param_grid = rbf_grid, cv = 3)
linear_search.fit(X_train_pca,y_train)
rbf_search.fit(X_train_pca,y_train)
print("Linear Best Estimator:", linear_search.best_estimator_)
print("RBF Best Estimator:", rbf_search.best_estimator_)

Linear Best Estimator: SVC(C=1000.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
RBF Best Estimator: SVC(C=1000.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.005, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


# Q2.

Try to find the best estimator for Random Forest Classifier. Your input should be: X_train_pca


In [None]:
#  Train a Random Forest classification model
from sklearn.ensemble import RandomForestClassifier
rf_grid = {
    'n_estimators':[50,100,200,300,400]
}
rf = RandomForestClassifier()
rf_search = GridSearchCV(estimator = rf, param_grid = rf_grid, cv = 3)
rf_search.fit(X_train_pca,y_train)
print("Random Forest Best Estimator:", rf_search.best_estimator_)

Random Forest Best Estimator: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)


# Q3

Evaluate your model on X_test_pca with best SVM and Ensemble models model. (clf.best_estimator_)

In [None]:
#  Quantitative evaluation of the model quality on the test set
print("Linear:",linear_search.best_estimator_.score(X_test_pca,y_test))
print("RBF:",rbf_search.best_estimator_.score(X_test_pca,y_test))
print("Random Forest:",rf_search.best_estimator_.score(X_test_pca,y_test))

Linear: 0.7763975155279503
RBF: 0.860248447204969
Random Forest: 0.5900621118012422


# Q4 
Compare the above models, which one is the best classifier for this dataset? Why?

The best classifier on the dataset was the RBF kernel. It recieved the highest accuracy score in comparison with the other classifiers. The most likely reason for this classifiers performance is due to the shape of the data.