# Build Classification Model

In [33]:
import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()

Unnamed: 0.1,Unnamed: 0,cuisine,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,indian,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,indian,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [34]:
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()


0    indian
1    indian
2    indian
3    indian
4    indian
Name: cuisine, dtype: object

In [35]:
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()

Unnamed: 0,almond,angelica,anise,anise_seed,apple,apple_brandy,apricot,armagnac,artemisia,artichoke,...,whiskey,white_bread,white_wine,whole_grain_wheat_flour,wine,wood,yam,yeast,yogurt,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = cuisines_feature_df
y = cuisines_label_df
X_train , X_test ,y_train, y_test  = train_test_split(X,y,test_size=0.3,random_state=355)

max_depth: The max_depth of a tree in Random Forest is defined as the longest path between the root node and the leaf node.

In [37]:
for max_depth in range(1,30,1):
    RFC = RandomForestClassifier(max_depth=max_depth)
    RFC.fit(X_train,y_train)
    print(RFC.score(X_test,y_test))

0.5896580483736447
0.6722268557130943
0.6605504587155964
0.6914095079232694
0.6922435362802335
0.7155963302752294
0.7247706422018348
0.7339449541284404
0.7381150959132611
0.7489574645537949
0.7422852376980817
0.755629691409508
0.7581317764804003
0.7723102585487907
0.7689741451209341
0.7873227689741451
0.7856547122602169
0.7914929107589658
0.7998331943286072
0.7931609674728941
0.7981651376146789
0.8040033361134279
0.8206839032527106
0.7998331943286072
0.8190158465387823
0.8240200166805671
0.8248540450375312
0.8306922435362802
0.8298582151793161


As you can see the depth of the tree is not effecting that much after the value of 24 until that depth value and score have an inclined relationship

min_sample_split: Parameter that tells the decision tree in a random forest the minimum required number of observations in any given node to split it

In [38]:
for min_samples_split in range(2,10):
    RFC = RandomForestClassifier(min_samples_split=min_samples_split)
    RFC.fit(X_train,y_train)
    print(RFC.score(X_test,y_test))

0.8290241868223519
0.8265221017514596
0.8315262718932444
0.8223519599666389
0.8298582151793161
0.8265221017514596
0.8223519599666389
0.8248540450375312


evidently, the minimum required number of observations does not make any huge difference

min_samples_leaf: This Random Forest parameter specifies the minimum number of samples that should be present in the leaf node after splitting a node.

In [39]:
for min_samples_leaf in range(1,20):
    RFC = RandomForestClassifier(min_samples_leaf=min_samples_leaf)
    RFC.fit(X_train,y_train)
    print(RFC.score(X_test,y_test))

0.8265221017514596
0.8148457047539617
0.7939949958298582
0.7898248540450375
0.7564637197664721
0.7489574645537949
0.7489574645537949
0.7397831526271893
0.7464553794829024
0.7272727272727273
0.7331109257714762
0.7272727272727273
0.7197664720600501
0.7239366138448707
0.7331109257714762
0.7281067556296914
0.7222685571309424
0.7247706422018348
0.7172643869891576


Clearly the minimum number of leaf in a tree and score have declined relationship. That also means in every iteration it is getting worse than last iteration.

max_leaf_nodes: This hyperparameter sets a condition on the splitting of the nodes in the tree and hence restricts the growth of the tree.

In [40]:
for max_leaf_nodes in range(2,30):
    RFC = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes)
    RFC.fit(X_train,y_train)
    print(RFC.score(X_test,y_test))

0.6088407005838199
0.6380316930775647
0.6563803169307757
0.6572143452877398
0.6622185154295246
0.6889074228523769
0.6947456213511259
0.6939115929941618
0.6989157631359466
0.7105921601334445
0.7122602168473728
0.7214345287739783
0.7064220183486238
0.7147623019182652
0.7114261884904087
0.7206005004170142
0.7356130108423686
0.7139282735613011
0.725604670558799
0.7356130108423686
0.7347789824854045
0.7264386989157632
0.74395329441201
0.7456213511259383
0.7322768974145121
0.7356130108423686
0.7281067556296914
0.7506255212677231


The max number of leaves has a bit of influence over the tree however it is a bit weak and it does not cause any difference after the 21st value.


n_estimators: Number of trees in the forest.

In [41]:
for n_estimators in range(100,150):
    RFC = RandomForestClassifier(n_estimators=n_estimators)
    RFC.fit(X_train,y_train)
    print(RFC.score(X_test,y_test))

0.8240200166805671
0.8265221017514596
0.8248540450375312
0.823185988323603
0.8298582151793161
0.8290241868223519
0.8273561301084237
0.823185988323603
0.8281901584653878
0.8265221017514596
0.8281901584653878
0.8265221017514596
0.8290241868223519
0.8273561301084237
0.8256880733944955
0.8306922435362802
0.8265221017514596
0.8240200166805671
0.8306922435362802
0.8265221017514596
0.8198498748957465
0.8315262718932444
0.8315262718932444
0.8248540450375312
0.8290241868223519
0.8290241868223519
0.835696413678065
0.8290241868223519
0.8323603002502085
0.8248540450375312
0.8281901584653878
0.8273561301084237
0.8290241868223519
0.8315262718932444
0.8281901584653878
0.8298582151793161
0.8215179316096747
0.8265221017514596
0.8256880733944955
0.8348623853211009
0.8240200166805671
0.8248540450375312
0.8298582151793161
0.8298582151793161
0.823185988323603
0.8323603002502085
0.823185988323603
0.8281901584653878
0.8198498748957465
0.8265221017514596


The number of trees in the forest has a bit of influence over the tree however it is a bit weak and it does not cause any difference overall

max_features: This resembles the number of maximum features provided to each tree in a random forest.



In [42]:
RFC = RandomForestClassifier(max_features="log2")
RFC.fit(X_train,y_train)
print(RFC.score(X_test,y_test))

RFC = RandomForestClassifier(max_features="auto")
RFC.fit(X_train,y_train)
print(RFC.score(X_test,y_test))

RFC = RandomForestClassifier(max_features="sqrt")
RFC.fit(X_train,y_train)
print(RFC.score(X_test,y_test))

0.8348623853211009
0.8290241868223519
0.8340283569641368


criterion: The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.



In [43]:
RFC = RandomForestClassifier(criterion="gini")
RFC.fit(X_train,y_train)
print(RFC.score(X_test,y_test))

RFC = RandomForestClassifier(criterion="entropy")
RFC.fit(X_train,y_train)
print(RFC.score(X_test,y_test))

print


0.8273561301084237
0.8273561301084237
