# Model Development

This workload is supposed to fetch the feature_engineered, preprocessed, scaled features from each individual person, alongside their gender, emotional state, and unique identifier, in order to train various models, which could also involve hyperparameter tuning based on the initial results of the model.

In [2]:
import pandas as pd

model_ready_df = "../feature_engineering/model_ready.csv"
model_df = pd.read_csv(model_ready_df)

In [3]:
model_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,0,1,m001,1,0,0,0,-0.95324,-1.874377,0.465741,0.941153,-0.57094,2.860458,-0.249554
1,1,1,m001,0,1,0,0,-0.091681,-1.670062,-0.539163,-0.156945,0.134949,1.786591,-0.078427
2,2,1,m001,0,0,1,0,0.522945,-1.397896,-0.119616,0.345342,-0.138899,2.419592,0.009484
3,3,1,m001,0,0,0,1,-1.428806,-0.907017,0.297125,-0.16686,-1.371568,2.683468,-0.556948
4,4,1,m002,1,0,0,0,-0.222307,0.665548,-1.172176,-0.262928,-1.070341,0.898074,0.366138


## Train Test Split

Because we are dealing with such little data, we have to figure out what exactly to quantify as our training dataset. If we decide to not include person as part of our training, then the model will not be able to identify them correctly at all, so at the bare minimum, at least one emotion from each person should be within the training dataset.

I've come to the conclusion that because we are dealing with face identification as our project (that could potentially be deployed for authentication scenarios / context), I believe it is best if the model is overfit to all of the training_data so we can ensure those who should be authenticated are able to, but for new users, the model should automatically reject them. We want this model to work extremely well with our initial 136 persons, but not necsesarily for those who are not part of the original dataset. In addition, we harm our model's performance overall if we decide to split this extremely small dataset into many portions as we will run out of data to effectively train various models.

In [4]:
model_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,0,1,m001,1,0,0,0,-0.95324,-1.874377,0.465741,0.941153,-0.57094,2.860458,-0.249554
1,1,1,m001,0,1,0,0,-0.091681,-1.670062,-0.539163,-0.156945,0.134949,1.786591,-0.078427
2,2,1,m001,0,0,1,0,0.522945,-1.397896,-0.119616,0.345342,-0.138899,2.419592,0.009484
3,3,1,m001,0,0,0,1,-1.428806,-0.907017,0.297125,-0.16686,-1.371568,2.683468,-0.556948
4,4,1,m002,1,0,0,0,-0.222307,0.665548,-1.172176,-0.262928,-1.070341,0.898074,0.366138


In [5]:
# TODO: Figure out ideal train_test_split (deciding against due to small dataset)
X = model_df.drop(['person_id', 'index'],axis=1)
y = model_df['person_id']

## Initial Model Development

Parameters are chosen arbitrarily based on previous experience of what usually performed best in other classification scenarios. These parameters will be tuned if the initial models result with a great accuracy score. As this is classification, the cost function / accuracy result is based off of cross-entropy loss as we are dealing with a multinomial classification label problem.

In [16]:
from sklearn.metrics import classification_report

In [21]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors = 1)
knn_model.fit(X, y)
knn_model.score(X, y)

knn_pred = knn_model.predict(X)
print(classification_report(y, knn_pred, target_names=knn_model.classes_))

              precision    recall  f1-score   support

        m001       1.00      1.00      1.00         4
        m002       1.00      1.00      1.00         4
        m003       1.00      1.00      1.00         4
        m004       1.00      1.00      1.00         4
        m005       1.00      1.00      1.00         4
        m006       1.00      1.00      1.00         4
        m007       1.00      1.00      1.00         4
        m008       1.00      1.00      1.00         4
        m009       1.00      1.00      1.00         4
        m010       1.00      1.00      1.00         4
        m011       1.00      1.00      1.00         4
        m012       1.00      1.00      1.00         4
        m013       1.00      1.00      1.00         4
        m014       1.00      1.00      1.00         4
        m015       1.00      1.00      1.00         4
        m016       1.00      1.00      1.00         4
        m017       1.00      1.00      1.00         4
        m018       1.00    

In [19]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X, y)
clf.score(X, y)

lr_pred = clf.predict(X)
print(classification_report(y, lr_pred, target_names=clf.classes_))

              precision    recall  f1-score   support

        m001       0.67      1.00      0.80         4
        m002       1.00      0.50      0.67         4
        m003       1.00      0.50      0.67         4
        m004       0.00      0.00      0.00         4
        m005       0.60      0.75      0.67         4
        m006       0.40      0.50      0.44         4
        m007       1.00      0.25      0.40         4
        m008       0.67      1.00      0.80         4
        m009       0.00      0.00      0.00         4
        m010       0.75      0.75      0.75         4
        m011       0.00      0.00      0.00         4
        m012       0.40      0.50      0.44         4
        m013       1.00      0.75      0.86         4
        m014       0.57      1.00      0.73         4
        m015       0.50      0.75      0.60         4
        m016       1.00      0.50      0.67         4
        m017       0.50      0.50      0.50         4
        m018       1.00    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [18]:
from sklearn.neural_network import MLPClassifier

m_clf = MLPClassifier(random_state=1, max_iter=300).fit(X, y)
m_clf.score(X, y)

m_pred = m_clf.predict(X)
print(classification_report(y, m_pred, target_names=m_clf.classes_))

              precision    recall  f1-score   support

        m001       1.00      1.00      1.00         4
        m002       1.00      1.00      1.00         4
        m003       1.00      1.00      1.00         4
        m004       0.80      1.00      0.89         4
        m005       1.00      1.00      1.00         4
        m006       1.00      1.00      1.00         4
        m007       1.00      0.75      0.86         4
        m008       1.00      1.00      1.00         4
        m009       1.00      1.00      1.00         4
        m010       1.00      1.00      1.00         4
        m011       1.00      1.00      1.00         4
        m012       1.00      1.00      1.00         4
        m013       1.00      1.00      1.00         4
        m014       1.00      1.00      1.00         4
        m015       0.80      1.00      0.89         4
        m016       1.00      1.00      1.00         4
        m017       1.00      1.00      1.00         4
        m018       1.00    



In [17]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X, y)
print(nb.score(X, y))

nb_pred = nb.predict(X)
print(classification_report(y, nb_pred, target_names=nb.classes_))

0.593320235756385
              precision    recall  f1-score   support

        m001       1.00      1.00      1.00         4
        m002       1.00      0.75      0.86         4
        m003       0.40      0.50      0.44         4
        m004       1.00      0.50      0.67         4
        m005       1.00      0.50      0.67         4
        m006       1.00      0.25      0.40         4
        m007       1.00      0.25      0.40         4
        m008       1.00      1.00      1.00         4
        m009       0.33      0.25      0.29         4
        m010       0.75      0.75      0.75         4
        m011       0.20      0.25      0.22         4
        m012       1.00      0.50      0.67         4
        m013       0.67      0.50      0.57         4
        m014       1.00      1.00      1.00         4
        m015       1.00      0.75      0.86         4
        m016       0.67      0.50      0.57         4
        m017       1.00      0.25      0.40         4
        m

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [22]:
from sklearn.svm import SVC

svm_clf = SVC()
svm_clf.fit(X, y)
print(svm_clf.score(X, y))

svm_pred = svm_clf.predict(X)
print(classification_report(y, svm_pred, target_names=svm_clf.classes_))

0.6090373280943026
              precision    recall  f1-score   support

        m001       1.00      1.00      1.00         4
        m002       0.80      1.00      0.89         4
        m003       0.60      0.75      0.67         4
        m004       1.00      0.25      0.40         4
        m005       1.00      0.50      0.67         4
        m006       0.50      1.00      0.67         4
        m007       0.75      0.75      0.75         4
        m008       0.80      1.00      0.89         4
        m009       0.00      0.00      0.00         4
        m010       0.50      0.75      0.60         4
        m011       0.00      0.00      0.00         4
        m012       0.50      0.75      0.60         4
        m013       1.00      0.50      0.67         4
        m014       1.00      0.75      0.86         4
        m015       0.50      0.75      0.60         4
        m016       0.75      0.75      0.75         4
        m017       1.00      0.75      0.86         4
        

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
