## Classification - Olivetti Faces
**Author: John Saja**

In this notebook, we will explore 2 classification methods on the Olivetti Faces dataset - Linear SVC and Random Forest.
Additionally, this notebook presents a third party technology named TPOT that was used to select Linear SVC as the first classification model.

## Import dependencies

In [15]:
import os
import sys

spark_path = "C:\spark\spark-2.1.0-bin-hadoop2.7"

os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")

import types
import collections

import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext
from pyspark import SparkConf, SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)

import pyspark.sql
from pyspark.sql.functions import col, avg
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StandardScaler
from sklearn import tree
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.patches as mpatches
import pickle
from sklearn import datasets
from pyspark.ml import Pipeline
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import VectorUDT
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.sql.types import *

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-12-e23d2e98a37f>:25 

## Import our Dataset - Olivetti Faces (sklearn)

In [16]:
olivettiFaces = datasets.fetch_olivetti_faces(data_home=None, shuffle=False, random_state=0, download_if_missing=True)


X = olivettiFaces.data
y = olivettiFaces.target

numRows, numCols = X.shape

#Center faces
centered = X - X.mean(axis=0)
centered -= centered.mean(axis=1).reshape(numRows, -1)
X = centered

print("Dataset consists of %d faces" % numRows)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

Dataset consists of 400 faces


## Plot a sample of our Training data

In [17]:
image_shape = (64, 64)

n_row, n_col = 5, 10
n_components = n_row * n_col

def plot_gallery(title, images, n_col=n_col, n_row=n_row):
    plt.figure(figsize=(2. * n_col, 2.26 * n_row))
    plt.suptitle(title, size=32)
    for i, comp in enumerate(images):
        plt.subplot(n_row, n_col, i + 1)
        vmax = max(comp.max(), -comp.min())
        plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,
                   interpolation='nearest',
                   vmin=-vmax, vmax=vmax)
        plt.xticks(())
        plt.yticks(())
    plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.)
plot_gallery("First 5 centered Olivetti faces (10 examples each)", centered[:n_components])

## Classification method # 1 - Linear SVC (Support Vector Classification)

I was persuaded to use this by the results generated from using TPOT on this dataset - a classification tool that takes in raw input data and completely takes care of feature selection, preprocessing, feature construction, model selection, and parameter optimization, as can be seen by the graphic below.


![title](https://raw.githubusercontent.com/rhiever/tpot/master/images/tpot-ml-pipeline.png)

Source: http://rhiever.github.io/tpot/

The following code for the model is directly from the output of the TPOT pipeline. 


## Pipeline code to generate best performing model
tpot = TPOTClassifier(generations=1, population_size = 10, verbosity = 2, n_jobs=1)

tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))

#Exports model code to python file

tpot.export('tpot_exported_pipeline.py')

**Linear SVC Scalability**

The documentation for this model states that the time complexity is > O(N^4) for any more than a few 10,000 samples. Thus as it currently stands, this model will scale absolutely *terribly*.

In general, one way that SVM Classification scales to big data sizes is through implementations of online algorithms. Though there is no such algorithm implemented in Python that has a better performance than the sklearn library, a close competitor would be the LaSVM, written in C but wrappable in Python. LaSVM uses online approximation and only a single pass through the training set, relying on the Sequential Minimal Optimization solution to the problem of quadratic programming. This means that it spends less time training, but achieves results similar to algorithms in popular libraries such as sklearn.

Source: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Source: http://leon.bottou.org/projects/lasvm

The code below is what was generated from the code above

In [18]:
from sklearn.svm import LinearSVC

svcModel = LinearSVC(C=10.0, dual=True, loss="squared_hinge", penalty="l2", tol=0.0001)

svcModel.fit(X_train, y_train)

# evaluate accuracy
scores = cross_val_score(svcModel, X_test, y_test, scoring='accuracy')
print("5-fold CV accuracy of optimized linearSVC model: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

#Estimate test error
pred = svcModel.predict(X_test)
acc = accuracy_score(y_test, pred) * 100
print("\nGeneralization score = ", acc, "%")



5-fold CV accuracy of optimized linearSVC model: 0.68 (+/- 0.21)

Generalization score =  97.5 %


## Classification Method # 2 - Random Forest Classifier (Pyspark)

**Random Forest Classifier Scalability**

The scalability for the current implementation of a random forest classifier is not very straightforward. However, one of the biggest bottlenecks lies in the fact that the decision trees are fully grown in a random forest. However, there have been implementations of this classifier through older versions of spark that were specifically made to be scalable.

In one presentation, a scalable implementation of the random forest classifier is as follows (each part explained up to the point where the algorithm presents its enhancement):

**Building Trees**

1) Build multiple decision trees in the driver node

2) Every time a new tree is built, an executor node will collect certain partition statistics used to generate future splits

3) The executors then discretizes features (binning), which will be useful for aggregation back to the driver node later.

**Training Trees**

1) Set a certain threshold for child tree size

2) Nodes get trained breadth first

3) The Driver node will aggregate the partition statistics from the executors in order to generate efficient splits.

4) If the size of a child tree for a split doesn't exceeds the aforementioned threshold, then train locally.

Source: https://spark-summit.org/2014/wp-content/uploads/2014/07/Sequoia-Forest-Random-Forest-of-Humongous-Trees-Sung-Chung.pdf


In [19]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

#Create vectors from the numpy array containing the Olivetti Faces data
vecs_train = []
vecs_test = []

for item in X_train:
    vecs_train.append(Vectors.dense(item))
    
for item in X_test:
    vecs_test.append(Vectors.dense(item))

#Create Dictionary for denseVector features and their corresponding labels
X_train_dict = dict(features= vecs_train, label=y_train)
X_test_dict = dict(features=vecs_test, label=y_test)

#Create Pandas DataFrame from the dictionary
X_train_pdf = pd.DataFrame(X_train_dict)
X_test_pdf = pd.DataFrame(X_test_dict)

#Create training/test Pyspark DataFrame from pandas --- ERROR IN CREATION
X_train_df = sqlContext.createDataFrame(X_train_pdf)
X_test_df = sqlContext.createDataFrame(X_test_pdf)

#Typecast Bigint type to smallint to speed up furhter computation.
X_train_df = X_train_df.withColumn("label", X_train_df["label"].cast("smallint"))
X_test_df = X_test_df.withColumn("label", X_test_df["label"].cast("smallint"))

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(X_train_df)

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(X_train_df)
    
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=25)

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",
                               labels=labelIndexer.labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, rf, labelConverter])

# Train model.  This also runs the indexers.
model = pipeline.fit(X_train_df)

# Make predictions.
predictions = model.transform(X_test_df)

# Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(10)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))
print("Generalized accuracy on test set:", accuracy)

rfModel = model.stages[2]
print(rfModel)  # summary only


+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|            20|   20|[-0.1617844998836...|
|            28|   28|[-0.0939981415867...|
|            20|    3|[0.04951822757720...|
|            21|   21|[-0.1100569367408...|
|             9|    9|[0.28998464345932...|
|             8|    8|[0.10590159893035...|
|            32|   32|[-0.1558575183153...|
|             9|    9|[-0.1934410631656...|
|            26|   26|[-0.0818718224763...|
|            16|   12|[-0.1234745830297...|
+--------------+-----+--------------------+
only showing top 10 rows

Test Error = 0.475
Generalized accuracy on test set: 0.525
RandomForestClassificationModel (uid=rfc_00550df6e046) with 25 trees


In [20]:
sc.stop()