# Model Development

This workload is supposed to fetch the feature_engineered, preprocessed, scaled features from each individual person, alongside their gender, emotional state, and unique identifier, in order to train various models, which could also involve hyperparameter tuning based on the initial results of the model.

In [1]:
import pandas as pd

model_ready_df = "../feature_engineering/model_ready.csv"
model_df = pd.read_csv(model_ready_df)

In [2]:
model_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,0,1,0,1,0,0,0,-0.95324,-1.874377,0.465741,0.941153,-0.57094,2.860458,-0.249554
1,1,1,0,0,1,0,0,-0.091681,-1.670062,-0.539163,-0.156945,0.134949,1.786591,-0.078427
2,2,1,0,0,0,1,0,0.522945,-1.397896,-0.119616,0.345342,-0.138899,2.419592,0.009484
3,3,1,0,0,0,0,1,-1.428806,-0.907017,0.297125,-0.16686,-1.371568,2.683468,-0.556948
4,4,1,1,1,0,0,0,-0.222307,0.665548,-1.172176,-0.262928,-1.070341,0.898074,0.366138


## Train Test Split

Because we are dealing with such little data, we have to figure out what exactly to quantify as our training dataset. If we decide to not include person as part of our training, then the model will not be able to identify them correctly at all, so at the bare minimum, at least one emotion from each person should be within the training dataset.

In [4]:
model_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,0,1,0,1,0,0,0,-0.95324,-1.874377,0.465741,0.941153,-0.57094,2.860458,-0.249554
1,1,1,0,0,1,0,0,-0.091681,-1.670062,-0.539163,-0.156945,0.134949,1.786591,-0.078427
2,2,1,0,0,0,1,0,0.522945,-1.397896,-0.119616,0.345342,-0.138899,2.419592,0.009484
3,3,1,0,0,0,0,1,-1.428806,-0.907017,0.297125,-0.16686,-1.371568,2.683468,-0.556948
4,4,1,1,1,0,0,0,-0.222307,0.665548,-1.172176,-0.262928,-1.070341,0.898074,0.366138


In [8]:
# TODO: Figure out ideal train_test_split
X = model_df.drop(['person_id', 'index'],axis=1)
y = model_df['person_id']

Unnamed: 0,gender,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,1,1,0,0,0,-0.953240,-1.874377,0.465741,0.941153,-0.570940,2.860458,-0.249554
1,1,0,1,0,0,-0.091681,-1.670062,-0.539163,-0.156945,0.134949,1.786591,-0.078427
2,1,0,0,1,0,0.522945,-1.397896,-0.119616,0.345342,-0.138899,2.419592,0.009484
3,1,0,0,0,1,-1.428806,-0.907017,0.297125,-0.166860,-1.371568,2.683468,-0.556948
4,1,1,0,0,0,-0.222307,0.665548,-1.172176,-0.262928,-1.070341,0.898074,0.366138
...,...,...,...,...,...,...,...,...,...,...,...,...
504,0,0,0,0,1,-1.434533,2.006732,0.902047,-0.612087,-0.142235,1.326757,0.478381
505,0,0,0,1,0,1.031459,0.243960,1.121054,0.860891,-0.322065,0.817728,-0.690946
506,0,0,0,0,1,-1.169589,2.003535,-0.003857,-0.495557,-1.542027,2.102731,-1.438994
507,0,0,0,1,0,0.562756,-0.856631,1.190556,0.172493,0.617674,-0.912238,1.214776


## Initial Model Development

Parameters are chosen arbitrarily based on previous experience of what usually performed best in other classification scenarios. These parameters will be tuned if the initial models result with a great accuracy score. As this is classification, the cost function / accuracy result is based off of cross-entropy loss as we are dealing with a multinomial classification label problem.

In [16]:
from sklearn.neighbors import KNeighborsClassifier

test_knn_model = KNeighborsClassifier(n_neighbors = 5)
test_knn_model.fit(X, y)
test_knn_model.score(X, y)

0.21414538310412573

In [17]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X, y)
clf.score(X, y)

0.6168958742632613

In [26]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

m_clf = MLPClassifier(random_state=1, max_iter=300).fit(X, y)
m_clf.score(X, y)



0.9882121807465619

In [27]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X, y)
nb.score(X, y)

0.593320235756385