![Facial Recognition](facialrecognition.jpg)

You are a member of an elite group of data scientists, specialising in advanced facial recognition technology, this firm is dedicated to identifying and safeguarding prominent individuals from various spheres—ranging from entertainment and sports to politics and philanthropy. The team's mission is to deploy AI-driven solutions that can accurately distinguish between images of notable personalities and the general populace, enhancing the personal security of such high-profile individuals. You're to focus on Arnold Schwarzenegger, a figure whose accomplishments span from bodybuilding champion to Hollywood icon, and from philanthropist to the Governor of California. 

### **The Data**
The `data/lfw_arnie_nonarnie.csv` dataset contains processed facial image data derived from the "Labeled Faces in the Wild" (LFW) dataset, focusing specifically on images of Arnold Schwarzenegger and other individuals not identified as him. This dataset has been prepared to aid in the development and evaluation of facial recognition models. There are 40 images of Arnold Schwarzenegger and 150 of other people.

| Column Name | Description |
|-------------|-------------|
| PC1, PC2, ... PCN | Principal components from PCA, capturing key image features. |
| Label | Binary indicator: `1` for Arnold Schwarzenegger, `0` for others. |

## Data Preparation
- Dataset: lfw_arnie_nonarnie.csv
- Predictors: All columns except Label
- Class Label: Label

## Data Splitting
- Method: Train-test split
- Test Size: 20%
- Random State: 21
- Stratify: By Label to ensure balanced classes

In [84]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
import seaborn as sns
import matplotlib.pyplot as plt

# Read the CSV file 
df = pd.read_csv("data/lfw_arnie_nonarnie.csv")

# Seperate the predictor and class label
X = df.drop('Label', axis=1)
y = df['Label'] 

# Split the data into training and testing sets using stratify to balance the class
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21, stratify=y)

## Model Selection and Training
### Models Considered
- Random Forest Classifier
- K-Nearest Neighbors
- Logistic Regression

### Preprocessing
- StandardScaler: Used to standardize features

### Cross-Validation
- Method: K-Fold Cross-Validation
- Number of Splits: 15
- Shuffle: True
- Random State: 21

### Pipeline and hyperparameter tuning
- Steps: Scaling and Classification
- Grid Search
     -- Estimator: Pipeline
     -- Scoring: Accuracy



In [85]:
# Choosing models
rf = RandomForestClassifier()
logreg = LogisticRegression()
knn = KNeighborsClassifier()
scaler = StandardScaler()
# Model list
models = {'RandomForestClassifier' : rf, 
          'KNeighborsClassifier' : knn,
          'Logistic Regression' : logreg 
         }

kf = KFold(n_splits=15, shuffle=True, random_state=21)
steps = [('scaler', scaler), ('classifier', rf)] # Default classifier, can be changed during GridSearchCV
pipeline = Pipeline(steps)
params = [
    {
        'classifier' : [rf],
        'classifier__criterion' : ['gini', 'entropy'], # HyperparameterS for rf
        'classifier__min_samples_leaf' : [1, 5, 10],
        'classifier__n_estimators' : [50, 100, 150],
        'classifier__max_depth': [None, 10, 20] ,
        'classifier__random_state' : [21],
        'classifier__n_jobs' : [-1]
    },
    {
        'classifier' : [knn],                       # Hyperparameters for KNN
        'classifier__leaf_size' : [10, 30, 50],
        'classifier__n_neighbors' : [1, 5, 10],
        'classifier__n_jobs' : [-1]
    },
    {
        'classifier' : [logreg],                   # Hyperparameters for Logistic Regression
        'classifier__C' : [0.1, 0.5, 1.0],
        'classifier__solver' : ['lbfgs', 'liblinear', 'saga'],
        'classifier__max_iter' : [50, 100, 150],
        'classifier__random_state' : [21],
        'classifier__n_jobs' : [-1]
    }
] 
# Performing GridSearch Cross Validation
grid_cv = GridSearchCV(estimator= pipeline, param_grid= params, cv=kf, n_jobs=-1, scoring='accuracy')
grid_cv.fit(X_train, y_train)
# Selecting the best model
best_model = grid_cv.best_estimator_
print(best_model)

## Model Evaluation
### Best Model
- Model: Logistic Regression
- Best Hyperparameters:
- Test Set Evaluation, F1 Score, Recall, Precision and Accuracy.

Predicted Labels: Generated using the best model

In [87]:
# Calculating accuracy and Making predictions 
best_model_info = grid_cv.best_params_
best_model_cv_score = grid_cv.best_score_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy
best_model_name = "Logistic Regression"

Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier',
                 LogisticRegression(C=0.5, max_iter=50, n_jobs=-1,
                                    random_state=21, solver='saga'))])


In [88]:
# Claculating Precision, Recall, F1 score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
score = accuracy
print(f"The best model is {best_model_name}", '\n',
     f"The Precision score is: {precision}", '\n', 
     f"The Accuracy score is: {accuracy}", '\n',
     f"The recall score is: {recall}", '\n',
     f"The F1 score is: {f1}", '\n',
     f"The best model parameter is:\n {best_model_info}", '\n', 
     f"The best cross validation score is: {best_model_cv_score}") 

The best model is Logistic Regression 
 The Precision score is: 1.0 
 The Accuracy score is: 0.8157894736842105 
 The recall score is: 0.125 
 The F1 score is: 0.2222222222222222 
 The best model parameter is:
 {'classifier': LogisticRegression(C=0.5, max_iter=50, n_jobs=-1, random_state=21,
                   solver='saga'), 'classifier__C': 0.5, 'classifier__max_iter': 50, 'classifier__n_jobs': -1, 'classifier__random_state': 21, 'classifier__solver': 'saga'} 
 The best cross validation score is: 0.8157575757575759
