# Model Development

This workload is supposed to fetch the feature_engineered, preprocessed, scaled features from each individual person, alongside their gender, emotional state, and unique identifier, in order to train various models, which could also involve hyperparameter tuning based on the initial results of the model.

In [5]:
import pandas as pd

model_ready_df = "../feature_engineering/model_ready.csv"
model_df = pd.read_csv(model_ready_df)

In [6]:
model_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,0,1,m061,0,1,0,0,-0.904499,0.321063,-0.184364,-0.415108,0.036263,-0.734205,-1.083244
1,1,1,m061,1,0,0,0,-0.689475,1.268312,-0.771988,1.115821,-1.054635,-0.322105,-0.30632
2,2,1,m061,0,0,0,1,1.022126,3.352544,-0.672417,1.034812,-1.164627,0.849298,-0.517595
3,3,1,m061,0,0,1,0,2.560423,2.394853,-0.756831,0.292277,-0.47488,1.57225,0.852805
4,4,1,m066,1,0,0,0,-1.417437,-0.020343,-1.160558,4.45384,-0.721856,-1.571687,-1.40342


## Train Test Split

Because we are dealing with such little data, we have to figure out what exactly to quantify as our training dataset. If we decide to not include person as part of our training, then the model will not be able to identify them correctly at all, so at the bare minimum, at least one emotion from each person should be within the training dataset.

I've come to the conclusion that because we are dealing with face identification as our project (that could potentially be deployed for authentication scenarios / context), I believe it is best if the model is overfit to all of the training_data so we can ensure those who should be authenticated are able to, but for new users, the model should automatically reject them. We want this model to work extremely well with our initial 136 persons, but not necsesarily for those who are not part of the original dataset. In addition, we harm our model's performance overall if we decide to split this extremely small dataset into many portions as we will run out of data to effectively train various models.

In [7]:
model_df.head()

Unnamed: 0,index,gender,person_id,neutral,smile,anger,left_light,EyeLengthRatio,EyeDistanceRatio,NoseRatio,LipSizeRatio,LipLengthRatio,EyeBrowLengthRatio,AggressiveRatio
0,0,1,m061,0,1,0,0,-0.904499,0.321063,-0.184364,-0.415108,0.036263,-0.734205,-1.083244
1,1,1,m061,1,0,0,0,-0.689475,1.268312,-0.771988,1.115821,-1.054635,-0.322105,-0.30632
2,2,1,m061,0,0,0,1,1.022126,3.352544,-0.672417,1.034812,-1.164627,0.849298,-0.517595
3,3,1,m061,0,0,1,0,2.560423,2.394853,-0.756831,0.292277,-0.47488,1.57225,0.852805
4,4,1,m066,1,0,0,0,-1.417437,-0.020343,-1.160558,4.45384,-0.721856,-1.571687,-1.40342


In [8]:
# TODO: Figure out ideal train_test_split (deciding against due to small dataset)
X = model_df.drop(['person_id', 'index'],axis=1)
y = model_df['person_id']

## Initial Model Development

Parameters are chosen arbitrarily based on previous experience of what usually performed best in other classification scenarios. These parameters will be tuned if the initial models result with a great accuracy score. As this is classification, the cost function / accuracy result is based off of cross-entropy loss as we are dealing with a multinomial classification label problem.

In [9]:
from sklearn.neighbors import KNeighborsClassifier

test_knn_model = KNeighborsClassifier(n_neighbors = 2)
test_knn_model.fit(X, y)
test_knn_model.score(X, y)

0.5265225933202358

In [10]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X, y)
clf.score(X, y)

0.6168958742632613

In [13]:
from sklearn.neural_network import MLPClassifier

m_clf = MLPClassifier(random_state=1, max_iter=300).fit(X, y)
m_clf.score(X, y)



0.9901768172888016

In [12]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X, y)
nb.score(X, y)

0.593320235756385