# Dimensionsreduktion und Ensemble-Methoden

In dieser Übung werden wir uns der wichtigen Hauptkomponentenanalyse (*Principal Component Analysis* PCA) widmen und verschiedene Ensemble-Methoden verwenden, um ein Modell zur Gesichtserkennung zu entwickeln.
Dazu verwenden wir das *Olivetti-Faces* Datenset, welches aus 400 verschiedenen Bildern von 40 verschiedenen Personen besteht. Jedes Bild hat $64 \times 64$ Pixel.

In [2]:
from sklearn.datasets import fetch_olivetti_faces
faces, labels = fetch_olivetti_faces(return_X_y=True, shuffle=True, random_state=42)

Um ein Bild zu plotten können wir `imshow` verwenden. Der folgende Code plottet jeweils das erste Gesicht der 40 verschiedenen Personen.

In [3]:
import numpy as np
import matplotlib.pyplot as plt

def _plot_face(face):
    if face.shape != (64, 64):
        face = face.reshape(64, 64)
    plt.imshow(face, cmap='gray')
    plt.axis('off')

    
def plot_faces(faces, cols=4):
    faces = np.array(faces)
    if len(faces.shape) == 1:
        faces = faces[None, :]
    m = faces.shape[0]
    
    rows = m // cols
    if m % cols != 0:
        rows += 1
    fig, axes = plt.subplots(rows, cols, figsize=(3*cols, 3*rows))
    if len(axes.shape) == 1:
        axes = axes[None, :]
    
    for i in range(rows):
        for j in range(cols):
            try:
                plt.sca(axes[i, j])  # set current axes
                face = faces[i * cols + j]  # get face
            except IndexError:
                plt.axis('off')
                continue
            _plot_face(face)
    return fig, axes

# plot all distinct persons
_, idx = np.unique(labels, return_index=True)
plot_faces(faces[idx], cols=5);

## a) Eigengesichter
- Plotte die ersten 20 Hauptachsen (Eigenvektoren der Kovarianzmatrix). 

Code for the lambda Plot: 

In [4]:
from sklearn.decomposition import PCA

n_components = 20
comp = n_components
pca = PCA(comp)
X2D = pca.fit_transform(faces)

eokm = pca.components_.reshape((comp, 64,64))
plot_faces(eokm, cols = 4)
print("Lambda Plot of the first 20 axes:")
plt.show()

Code for the numeric Lambda Values:

- Wie groß sind die zugehörigen Eigenwerte (Varianzen) dieser Eigengesichter?

In [5]:
print("The numeric Lambda values(Variances) of the EI´ are:")
i = 0
for k in pca.explained_variance_:
    print("Komponente {} : 'Varianz: {}".format(i,k))
    i += 1



- Plotte den Anteil der erklärten Varianz in Abhängigkeit der verwendeten Hauptkomponenten. Dazu kannst du das Attribut `explained_variance_ratio_` verwenden.

Code for the portion of explained variance of the main components 

In [6]:
evr =  pca.explained_variance_ratio_
gesevr = round(sum(list(pca.explained_variance_ratio_))*100,2)
print('Varianz der einzelnen Komponenten', evr)
print('\n Gesamtvarianz:', gesevr)
print("Plot for the portion of explained variance of the main components:")
plt.plot(evr,'-o')
#plt.plot(gesevr)


## b) Inverse Transformation

Berechne eine PCA und plotte die Rekonstruktion von 5 Gesichter basierend auf $5, 10, 20, 50, 100, 200, 300$ und $ 400$ Hauptkomponenten. Dazu kannst du die Methode `inverse_transform` benutzen.

Code for PCA evaluation with a list of different components and reconstruction Plot:

In [7]:
from sklearn.decomposition import KernelPCA

X = faces
n = [5,10,20,50,100,200,300,400]

def _plot_faces(face):
    if face.shape != (64,64):
        face = face.reshape(64,64)
        plt.imshow(face,cmap = 'bone')
        plt.axis('off')
        
X_plot = []

for jj in n:
    pca_n = PCA(n_components = jj)
    X2D = pca_n.fit_transform(X)
    X_recovered = pca_n.inverse_transform(X2D)
    for kk in range(5):
        X_plot.append(X_recovered[kk])
        
fig, ax = plt.subplots(8,5, figsize = (10,10))
for jj in range (0,8):
    for kk in range(0,5):
        plt.subplot(8,5,1+5*jj+kk)
        _plot_faces(X_plot[5*jj+kk])

print("Reconstruction plot:")
plt.show()

## c) Feature Importance
- Berechne die *Gini Feature Importance*. Du kannst `imshow` für die Visualisierung verwenden.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(faces, labels, test_size=0.2, stratify=labels, random_state=42)

Evaluation of the Gini Feature Importance with imshow

In [9]:
rnd_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)
rnd_clf.fit(X_train,y_train)

#Class Affiliation propability - cap
#Class Forecast - cf
cap= rnd_clf.predict_proba(X_train)
cf = rnd_clf.predict(X_train)
print(cap)
print(cf)

gini = rnd_clf.feature_importances_.reshape(64,64)
print("Plot Gini:")
plt.imshow(gini)
plt.show()
print("Plot class affiliation probability:")
plt.imshow(cap)
plt.show()

## d) Gesichtserkennung

Erstelle ein Modell zur Gesichtserkennung. Du kannst dazu Methoden deiner Wahl verwenden. Experimentiere mit verschiedenen Modellen (`SVC`, `RandomForestClassifier`, ...). 
- Erstelle auch einen `VotingClassifier` basierend auf verschiedenen Modellen. 
- Probiere auch eine `PCA` als Preprocessingschritt. Was macht der Parameter `whiten` in der PCA?
- Wie hoch ist der Accuracy Score auf dem Trainings- und Testset? Kannst du einen Score von 1 auf dem Testset erreichen? Falls nicht, welche Personen werden verwechselt? Plotte die Geichter dieser Personen.


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR,SVC
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
import timeit

Finale Version.

Model1: Support Vector Classifier

In [11]:
pipe_svc = Pipeline([
    ("scale", StandardScaler()),
    ("SVC", SVC(probability = True))
])

params = [{
    "SVC__C" : np.linspace(1e-6,0.0006, 5),
    "SVC__gamma" : np.linspace(0.14,0.19,10),
    "SVC__kernel" : ["poly","rbf","sigmoid"]
}]

best_params = [{
    "SVC__C" : [1e-6],
    "SVC__gamma" : [0.14],
    "SVC__kernel" : ["rbf"]
    
}]

SVC_mod = GridSearchCV(pipe_svc, best_params, n_jobs =-1,cv = 4)

launch = timeit.default_timer()
SVC_mod.fit(X_train,y_train)
land = timeit.default_timer()

M1_score = SVC_mod.best_score_
M1_params = SVC_mod.best_params_
print(M1_score,M1_params)
print("Model 1 took {:.2f}seconds".format(land-launch))

In [12]:
M1y_pred_train = SVC_mod.predict(X_train)
M1y_pred_test = SVC_mod.predict(X_test)

M1a_train = accuracy_score(y_train,M1y_pred_train)
M1a_test = accuracy_score(y_test,M1y_pred_test)

print("The Accuracy score on the training data of Model 1 is : {}".format(M1a_train))
print("The Accuracy score on the test data of Model 1 is : {}".format(M1a_test))

2)Model RandomForrestClassifier

In [13]:
pipe_rndfor = Pipeline([
    ("rand_forest", RandomForestClassifier(random_state=42))
])

rand_params = [{
    "rand_forest__n_estimators": [1500]
}]

rand_best_params = [{
    "rand_forest__n_estimators" : [1500]
}]

rand_mod = GridSearchCV(pipe_rndfor,rand_params, n_jobs=-1, cv=4)

go = timeit.default_timer()
rand_mod.fit(X_train,y_train)
fin = timeit.default_timer()

M2_score = rand_mod.best_score_
M2_params = rand_mod.best_params_

print(M2_score,M2_params)
print("Model 2 took {:.2f}seconds".format(fin-go))

In [14]:
M2y_pred_train = rand_mod.predict(X_train)
M2y_pred_test = rand_mod.predict(X_test)

M2a_train = accuracy_score(y_train,M2y_pred_train)
M2a_test = accuracy_score(y_test,M2y_pred_test)

print("The Accuracy score on the training data in Model 2 is : {}".format(M2a_train))
print("The Accuracy score on the test data in Model 2 is : {}".format(M2a_test))


In [15]:
votes = VotingClassifier(estimators=[("SVC",SVC_mod),("RFC",rand_mod)], voting="soft")
go = timeit.default_timer()
votes.fit(X_train,y_train)
fin = timeit.default_timer()
print("Model 3 took{: 2f}seconds".format(fin-go))

In [16]:
votesy_pred_train = votes.predict(X_train)
votesy_pred_test = votes.predict(X_test)

M3a_train = accuracy_score(y_train,votesy_pred_train)
M3a_test = accuracy_score(y_test,votesy_pred_test)

print("Voting Classifier:")
print("The Accuracy score on the test data in Model 2 is : {}".format(M3a_train))
print(("The Accuracy score on the test data in Model 2 is : {}".format(M3a_test)))

In [17]:
id_f = y_test != votesy_pred_test
f_pred = X_test[id_f]
f = f_pred.shape[0]

print("There are {} faces that are classified with (f)".format(f))
print("Right classification: {}".format(y_test[id_f]))

plot_faces(f_pred,cols = 5)
plt.show()

f_mis = []
for a in votesy_pred_test[id_f]:
    id = 0
    for b in y_train:
        if(a==b):
            f_mis.append(id)
            break
        id +=1

print("Here are the wrong labels : {}".format(votesy_pred_test[id_f]))

plot_faces(X_train[f_mis],cols = 5)
plt.show()


Preprocessing


In [18]:
prepca = PCA(20,whiten=True)
faces_projected = prepca.fit_transform(faces)

X_train, X_test, y_train, y_test = train_test_split(faces_projected, labels, test_size = 0.2, stratify=labels, random_state=42)

In [19]:
plot_faces(prepca.inverse_transform(X_train[f_mis]),cols=5)
plt.show()