# Final Project
## Supervised Learning - AY 2022-2023 by Oleg Lastocichin



The goal of this project is to analyze a dataset representing the characteristics of some
Gormiti creatures. We want to build a model that can classify the land which a Gormiti
belongs to (Rockland, Iceland, Fireland, Windland).

The provided dataset includes the following information:
1. ID: unique identifier of a Gormiti creature.
2. Gormiti_Type: type of the Gormiti creature.
3. Nature: nature of the Gormiti creature.
4. Deadly: boolean representing if the Gormiti creature is very dangerous.
5. Growth_rate: growth rate of the Gormiti creature.
6. Strength: strength of the Gormiti creature.
7. Ray: type of ray that gave the Gormiti creature its superpowers.
7. Ability: specific ability of the Gormiti creature.
7. Attack: boolean representing if the Gormini creature attacks of not.
7. Against_Fire: average score of the Gormiti creature on battles against Gormitis of type Fire.
7. Against_Meka: average score of the Gormiti creature on battles against Gormitis of type Meka.
7. Against_Lava: average score of the Gormiti creature on battles against Gormitis of type Lava.
7. Against_Wind: average score of the Gormiti creature on battles against Gormitis of type Wind.
7. Against_Rock: average score of the Gormiti creature on battles against Gormitis of type Rock.
7. Gormiti_Land: the land the Gormiti creature belongs to (target feature).

In order to build the desired predictive model, develop the following tasks and answer
the following questions.

In [37]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing as prepro
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report


import warnings


In [38]:
warnings.filterwarnings("ignore")


# Questions and Tasks


## 1. Load and explore the dataset. Eventually perform data engineering (handling missing values, encoding categorical values, ...).

In [39]:
df_train_raw = pd.read_csv('dataset/Gormiti_train.csv')
df_test_raw = pd.read_csv('dataset/Gormiti_test.csv')

In [40]:
df_train_raw.duplicated(keep=False).sum()

0

In [41]:
df_train_raw.isnull().mean()


Unnamed: 0      0.0
ID              0.0
Gormiti_Type    0.0
Nature          0.0
Deadly          0.0
Growth_rate     0.0
Attack          0.0
Strength        0.0
Ray             0.0
Ability         0.0
Against_Fire    0.0
Against_Meka    0.0
Against_Lava    0.0
Against_Wind    0.0
Against_Rock    0.0
Gormiti_Land    0.0
dtype: float64

In [42]:
df_test_raw.duplicated(keep=False).sum()

0

In [43]:
df_test_raw.isnull().mean()


Unnamed: 0      0.0
ID              0.0
Gormiti_Type    0.0
Nature          0.0
Deadly          0.0
Growth_rate     0.0
Attack          0.0
Strength        0.0
Ray             0.0
Ability         0.0
Against_Fire    0.0
Against_Meka    0.0
Against_Lava    0.0
Against_Wind    0.0
Against_Rock    0.0
Gormiti_Land    0.0
dtype: float64

In [44]:
df_train_raw

Unnamed: 0.1,Unnamed: 0,ID,Gormiti_Type,Nature,Deadly,Growth_rate,Attack,Strength,Ray,Ability,Against_Fire,Against_Meka,Against_Lava,Against_Wind,Against_Rock,Gormiti_Land
0,0,580a08ea-4db6-46ba-b5f1-21f8dfcfa2b0,Fire,Evil,Yes,Slow,Yes,59.0,Omega,stalagmites,-1.020907,-0.198581,-2.994943,-1.596865,-0.164760,Iceland
1,1,efdc2c72-834e-44d3-acf1-8b83a6bd0760,Fire,Evil,No,Medium,Yes,40.0,Beta,Dark bullets,-1.671520,-0.516204,0.600385,0.116257,0.550372,Windland
2,2,e3878403-c13b-4f6f-8e87-a1e72ea2e71c,Wind,Evil,Yes,Fast,Yes,52.0,Alpha,Dark bullets,1.374496,-0.498522,0.097014,2.162580,-1.017195,Iceland
3,3,1d5c102b-47a9-40fd-80b9-96ab8d7b2b7b,Wind,Evil,No,Medium,No,36.0,Alpha,Meka energy,-0.340917,-1.741547,-1.269091,-1.211137,-2.840029,Rockland
4,4,8093fb82-8692-4352-b213-e4d18074fc18,Rock,Good,No,Medium,Yes,58.0,Gamma,whirlwind,1.472178,-1.313746,-1.670818,-2.643774,0.991630,Fireland
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,7995,0c50046e-8397-4ac1-a528-f87654a912ef,Aoki,Good,No,Medium,No,41.0,Gamma,Burning energy,0.723469,-1.909741,0.671805,0.080248,-0.690939,Rockland
7996,7996,64ceeaa4-a419-4e46-b1cc-b8befa447eb3,Sea,Good,Yes,Medium,No,41.0,Gamma,whirlwind,2.070754,0.348883,-1.141936,-1.845884,1.022508,Fireland
7997,7997,e28f5bfe-c2a2-4bf8-b699-1f00b1f50bbb,Fearsome Darkens,Evil,No,Slow,Yes,55.0,Omega,whirlwind,1.733139,-0.468651,2.065541,-0.648536,-3.644766,Windland
7998,7998,9ddc67a9-d8b7-416e-8ed1-195b6ac9dc5e,Sea,Evil,Yes,Medium,No,50.0,Gamma,Magical light power,-0.951960,2.022711,-3.837023,3.000448,-2.363179,Iceland


In [45]:
df_test_raw

Unnamed: 0.1,Unnamed: 0,ID,Gormiti_Type,Nature,Deadly,Growth_rate,Attack,Strength,Ray,Ability,Against_Fire,Against_Meka,Against_Lava,Against_Wind,Against_Rock,Gormiti_Land
0,0,091a504d-521f-4ccf-b910-ec1041905d6e,Aoki,Evil,Yes,Fast,No,51.0,Alpha,stalagmites,0.421855,-2.681833,1.135186,1.907123,-0.603116,Iceland
1,1,6a5787a5-0af4-4166-bfd9-70a3bbe1957c,Fire,Evil,Yes,Slow,No,60.0,Alpha,Burning energy,1.295014,-1.418908,1.210324,0.372579,-0.446112,Rockland
2,2,ad51110e-563f-4122-a92c-1e0e529efeba,Fearsome Darkens,Good,No,Fast,No,42.0,Omega,stalagmites,-1.570687,0.828706,-3.671664,1.159636,-0.983343,Iceland
3,3,09fcccaa-3777-453f-902c-2cc25d9459d1,Aoki,Evil,No,Fast,No,59.0,Omega,Dark bullets,-1.357277,-1.131796,-0.950394,-0.948838,0.215572,Rockland
4,4,cd372fd7-ef45-4ec2-a7b5-7c001d68b75d,Meka,Good,No,Medium,Yes,52.0,Beta,whirlwind,-0.567937,-3.762130,-0.963888,-1.242648,-0.409619,Fireland
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1995,73fd6f38-75b9-4d11-b8d2-d294ede1dc03,Rock,Good,No,Slow,No,41.0,Alpha,Meka energy,-1.262841,0.405344,0.740532,-1.772230,0.266576,Windland
1996,1996,8a5557ee-8c1a-4812-9458-6ae15987a93c,Wind,Good,No,Slow,No,40.0,Gamma,whirlwind,0.232081,-0.496368,-0.878425,0.069013,-1.591982,Iceland
1997,1997,9522532f-62a6-4484-9aed-64882651200d,Ice,Evil,No,Medium,No,39.0,Gamma,stalagmites,0.756009,-0.928642,0.859297,-1.602827,-0.015843,Rockland
1998,1998,4b0ec347-9ad5-461a-901e-5e8303b28736,Sea,Good,Yes,Slow,No,65.0,Alpha,Burning energy,0.199959,0.146354,2.058367,-0.312298,-2.770650,Windland


In [46]:
df_train = df_train_raw.copy(deep=True)
df_test = df_test_raw.copy(deep=True)

In [47]:
df_train

Unnamed: 0.1,Unnamed: 0,ID,Gormiti_Type,Nature,Deadly,Growth_rate,Attack,Strength,Ray,Ability,Against_Fire,Against_Meka,Against_Lava,Against_Wind,Against_Rock,Gormiti_Land
0,0,580a08ea-4db6-46ba-b5f1-21f8dfcfa2b0,Fire,Evil,Yes,Slow,Yes,59.0,Omega,stalagmites,-1.020907,-0.198581,-2.994943,-1.596865,-0.164760,Iceland
1,1,efdc2c72-834e-44d3-acf1-8b83a6bd0760,Fire,Evil,No,Medium,Yes,40.0,Beta,Dark bullets,-1.671520,-0.516204,0.600385,0.116257,0.550372,Windland
2,2,e3878403-c13b-4f6f-8e87-a1e72ea2e71c,Wind,Evil,Yes,Fast,Yes,52.0,Alpha,Dark bullets,1.374496,-0.498522,0.097014,2.162580,-1.017195,Iceland
3,3,1d5c102b-47a9-40fd-80b9-96ab8d7b2b7b,Wind,Evil,No,Medium,No,36.0,Alpha,Meka energy,-0.340917,-1.741547,-1.269091,-1.211137,-2.840029,Rockland
4,4,8093fb82-8692-4352-b213-e4d18074fc18,Rock,Good,No,Medium,Yes,58.0,Gamma,whirlwind,1.472178,-1.313746,-1.670818,-2.643774,0.991630,Fireland
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,7995,0c50046e-8397-4ac1-a528-f87654a912ef,Aoki,Good,No,Medium,No,41.0,Gamma,Burning energy,0.723469,-1.909741,0.671805,0.080248,-0.690939,Rockland
7996,7996,64ceeaa4-a419-4e46-b1cc-b8befa447eb3,Sea,Good,Yes,Medium,No,41.0,Gamma,whirlwind,2.070754,0.348883,-1.141936,-1.845884,1.022508,Fireland
7997,7997,e28f5bfe-c2a2-4bf8-b699-1f00b1f50bbb,Fearsome Darkens,Evil,No,Slow,Yes,55.0,Omega,whirlwind,1.733139,-0.468651,2.065541,-0.648536,-3.644766,Windland
7998,7998,9ddc67a9-d8b7-416e-8ed1-195b6ac9dc5e,Sea,Evil,Yes,Medium,No,50.0,Gamma,Magical light power,-0.951960,2.022711,-3.837023,3.000448,-2.363179,Iceland


In [48]:
df_test

Unnamed: 0.1,Unnamed: 0,ID,Gormiti_Type,Nature,Deadly,Growth_rate,Attack,Strength,Ray,Ability,Against_Fire,Against_Meka,Against_Lava,Against_Wind,Against_Rock,Gormiti_Land
0,0,091a504d-521f-4ccf-b910-ec1041905d6e,Aoki,Evil,Yes,Fast,No,51.0,Alpha,stalagmites,0.421855,-2.681833,1.135186,1.907123,-0.603116,Iceland
1,1,6a5787a5-0af4-4166-bfd9-70a3bbe1957c,Fire,Evil,Yes,Slow,No,60.0,Alpha,Burning energy,1.295014,-1.418908,1.210324,0.372579,-0.446112,Rockland
2,2,ad51110e-563f-4122-a92c-1e0e529efeba,Fearsome Darkens,Good,No,Fast,No,42.0,Omega,stalagmites,-1.570687,0.828706,-3.671664,1.159636,-0.983343,Iceland
3,3,09fcccaa-3777-453f-902c-2cc25d9459d1,Aoki,Evil,No,Fast,No,59.0,Omega,Dark bullets,-1.357277,-1.131796,-0.950394,-0.948838,0.215572,Rockland
4,4,cd372fd7-ef45-4ec2-a7b5-7c001d68b75d,Meka,Good,No,Medium,Yes,52.0,Beta,whirlwind,-0.567937,-3.762130,-0.963888,-1.242648,-0.409619,Fireland
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1995,73fd6f38-75b9-4d11-b8d2-d294ede1dc03,Rock,Good,No,Slow,No,41.0,Alpha,Meka energy,-1.262841,0.405344,0.740532,-1.772230,0.266576,Windland
1996,1996,8a5557ee-8c1a-4812-9458-6ae15987a93c,Wind,Good,No,Slow,No,40.0,Gamma,whirlwind,0.232081,-0.496368,-0.878425,0.069013,-1.591982,Iceland
1997,1997,9522532f-62a6-4484-9aed-64882651200d,Ice,Evil,No,Medium,No,39.0,Gamma,stalagmites,0.756009,-0.928642,0.859297,-1.602827,-0.015843,Rockland
1998,1998,4b0ec347-9ad5-461a-901e-5e8303b28736,Sea,Good,Yes,Slow,No,65.0,Alpha,Burning energy,0.199959,0.146354,2.058367,-0.312298,-2.770650,Windland


In [49]:
df_train.dtypes

Unnamed: 0        int64
ID               object
Gormiti_Type     object
Nature           object
Deadly           object
Growth_rate      object
Attack           object
Strength        float64
Ray              object
Ability          object
Against_Fire    float64
Against_Meka    float64
Against_Lava    float64
Against_Wind    float64
Against_Rock    float64
Gormiti_Land     object
dtype: object

In [50]:
data_types_dict = {'ID': 'category', 'Gormiti_Type': 'category', 'Nature': 'category', 'Growth_rate': 'category', 'Attack': 'category', 'Ray': 'category', 'Ability': 'category', 'Deadly': 'category', 'Gormiti_Land': 'category'}
df_train = df_train.astype(data_types_dict)
df_test = df_test.astype(data_types_dict)
df_train.dtypes

Unnamed: 0         int64
ID              category
Gormiti_Type    category
Nature          category
Deadly          category
Growth_rate     category
Attack          category
Strength         float64
Ray             category
Ability         category
Against_Fire     float64
Against_Meka     float64
Against_Lava     float64
Against_Wind     float64
Against_Rock     float64
Gormiti_Land    category
dtype: object

In [51]:
df_test.dtypes

Unnamed: 0         int64
ID              category
Gormiti_Type    category
Nature          category
Deadly          category
Growth_rate     category
Attack          category
Strength         float64
Ray             category
Ability         category
Against_Fire     float64
Against_Meka     float64
Against_Lava     float64
Against_Wind     float64
Against_Rock     float64
Gormiti_Land    category
dtype: object

In [52]:
df_train = df_train.drop(columns=['ID', 'Unnamed: 0'])
df_test = df_test.drop(columns=['ID', 'Unnamed: 0'])

In [53]:
def hist_df(data, rows_max, cols_max):
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
    row = 1
    col = 1
    fig = make_subplots(rows=rows_max, cols=cols_max)
    for i in data.columns:
        if row == rows_max:
            fig.append_trace(go.Histogram(x=data[i], name=i), row=row, col=col)
            col += 1
            row = 1
            continue
        fig.append_trace(go.Histogram(x=data[i], name=i), row=row, col=col)
        row += 1

    fig.show()


hist_df(df_train, rows_max=3, cols_max=6)


In [54]:
hist_df(df_test, rows_max=3, cols_max=6)


## 2. Train a Softmax Regression model able to predict the Gormiti_Land class.


#### (a) Perform features pre-processing if necessary. Discuss your choices and the performed actions.


In [55]:
df_train.shape

(8000, 14)

In [56]:
noncat = [] #list for non categorical columns
for i in df_train.columns:
    if i in data_types_dict:
        continue
    # discovering outliers with IQR-score
    noncat.append(i)
    '''
    Q1 = df_train[i].quantile(0.05)
    Q3 = df_train[i].quantile(0.95)
    IQR = Q3 - Q1
    print(IQR)
    lowerlim = (df_train[i] < Q1)
    upperlim = (df_train[i] > Q3)
    df_train = df_train[~(lowerlim|upperlim)]'''
    # DROP
    
    #logical_index_not_outliers = (df_train[i] > (
     #   Q1 - 1.5 * IQR)) & (df_train[i] < (Q3 + 1.5 * IQR))
    #df_train = df_train[logical_index_not_outliers]
    # CAP
    #df_train.drop(df_train.loc[(df_train[i] < Q1), i], axis=0, inplace=True)
    #df_train.drop(df_train.loc[(df_train[i] > Q3), i], axis=0, inplace=True)

df_train.shape


(8000, 14)

In [57]:
df_transforming = df_train.copy(deep=True)

In [58]:
df_transforming.drop(columns=['Gormiti_Type','Nature','Deadly','Growth_rate','Attack','Ray','Ability','Gormiti_Land'], inplace=True)

In [59]:
df_transforming

Unnamed: 0,Strength,Against_Fire,Against_Meka,Against_Lava,Against_Wind,Against_Rock
0,59.0,-1.020907,-0.198581,-2.994943,-1.596865,-0.164760
1,40.0,-1.671520,-0.516204,0.600385,0.116257,0.550372
2,52.0,1.374496,-0.498522,0.097014,2.162580,-1.017195
3,36.0,-0.340917,-1.741547,-1.269091,-1.211137,-2.840029
4,58.0,1.472178,-1.313746,-1.670818,-2.643774,0.991630
...,...,...,...,...,...,...
7995,41.0,0.723469,-1.909741,0.671805,0.080248,-0.690939
7996,41.0,2.070754,0.348883,-1.141936,-1.845884,1.022508
7997,55.0,1.733139,-0.468651,2.065541,-0.648536,-3.644766
7998,50.0,-0.951960,2.022711,-3.837023,3.000448,-2.363179


In [60]:
hist_df(df_transforming, 2,3)

In [61]:
mas = prepro.RobustScaler()
df_train_scaled = mas.fit_transform(df_transforming.to_numpy())
df_train_scaled = pd.DataFrame(df_train_scaled, columns=['Strength','Against_Fire','Against_Meka',	'Against_Lava',	'Against_Wind',	'Against_Rock'])

In [62]:
hist_df(df_train_scaled, 2,3)

In [63]:
for i in noncat:
    df_train[i] = df_train_scaled[i]

In [64]:
X_train = df_train.drop(columns=['Gormiti_Type'])
y_train= df_train['Gormiti_Type']
X_test = df_test.drop(columns=['Gormiti_Type'])
y_test= df_test['Gormiti_Type']


In [65]:
lb = prepro.LabelBinarizer()

lb.fit(X_train.Attack.unique())
X_train['Attack'] = lb.transform(X_train['Attack'])

lb.fit(X_train.Deadly.unique())
X_train['Deadly'] = lb.transform(X_train['Deadly'])

lb.fit(X_train.Nature.unique())
X_train['Nature'] = lb.transform(X_train['Nature'])


lb.fit(X_test.Attack.unique())
X_test['Attack'] = lb.transform(X_test['Attack'])

lb.fit(X_test.Deadly.unique())
X_test['Deadly'] = lb.transform(X_test['Deadly'])

lb.fit(X_test.Nature.unique())
X_test['Nature'] = lb.transform(X_test['Nature'])




le = prepro.LabelEncoder()
le.fit(['Slow','Medium', 'Fast'])
X_train['Growth_rate'] = le.transform(X_train['Growth_rate'])

le.fit(X_train.Ray.unique())
X_train['Ray'] = le.transform(X_train['Ray'])

le.fit(X_train.Ability.unique())
X_train['Ability'] = le.transform(X_train['Ability'])

le.fit(X_train.Gormiti_Land.unique())
X_train['Gormiti_Land'] = le.transform(X_train['Gormiti_Land'])

le.fit(['Slow','Medium', 'Fast'])
X_test['Growth_rate'] = le.transform(X_test['Growth_rate'])

le.fit(X_test.Ray.unique())
X_test['Ray'] = le.transform(X_test['Ray'])

le.fit(X_test.Ability.unique())
X_test['Ability'] = le.transform(X_test['Ability'])

le.fit(X_test.Gormiti_Land.unique())
X_test['Gormiti_Land'] = le.transform(X_test['Gormiti_Land'])



#### (b) Train a regularized model by applying ℓ2 regularization (default regularization when you perform multinominal LogisticRegression on sklearn): tune the hyperparameter C (eventually with grid search) in order to optimize the generalization performances of the model. What happens if you increase the value of C?


In [66]:
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)

In [67]:
param_grid = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
print("Parameter grid:\n{}".format(param_grid))

Parameter grid:
{'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}


In [68]:
softgrid = GridSearchCV(softmax_reg, param_grid, cv=5)

In [69]:
softgrid.fit(X_train, y_train)

In [70]:
scores = cross_val_score(softgrid, X_train, y_train)
print("Cross-validation scores: {}".format(scores))

Cross-validation scores: [0.146875 0.143125 0.129375 0.138125 0.1275  ]


In [71]:
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Average cross-validation score: 0.14


#### (c) Evaluate the trained model on the provided test set. Verify that the trained model is not overfitting. Discuss the obtained results.


In [72]:
print("Test set score: {:.2f}".format(softgrid.score(X_test, y_test)))

Test set score: 0.11


In [73]:
print("Best parameters: {}".format(softgrid.best_params_))
print("Best cross-validation score: {:.2f}".format(softgrid.best_score_))

Best parameters: {'C': 0.001}
Best cross-validation score: 0.14


In [74]:
# convert to DataFrame
results = pd.DataFrame(softgrid.cv_results_)
display(results)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.039844,0.019526,0.002344,0.0004785913,0.0001,{'C': 0.0001},0.134375,0.138125,0.144375,0.131875,0.115625,0.132875,0.009598,7
1,0.067189,0.041449,0.002735,0.0003908396,0.001,{'C': 0.001},0.146875,0.149375,0.130625,0.138125,0.12625,0.13825,0.008948,1
2,0.107641,0.010807,0.00293,3.504023e-07,0.01,{'C': 0.01},0.143125,0.143125,0.129375,0.136875,0.125,0.1355,0.007293,2
3,0.13115,0.037505,0.002344,0.0004782019,0.1,{'C': 0.1},0.141875,0.136875,0.13,0.13375,0.1275,0.134,0.005071,6
4,0.21678,0.072434,0.002735,0.0009568903,1.0,{'C': 1},0.140625,0.1375,0.13,0.134375,0.130625,0.134625,0.004043,3
5,0.10401,0.020455,0.002148,0.0007311111,10.0,{'C': 10},0.14,0.1375,0.129375,0.135,0.13125,0.134625,0.003905,3
6,0.086526,0.005681,0.002149,0.000390482,100.0,{'C': 100},0.14,0.1375,0.129375,0.135,0.13125,0.134625,0.003905,3


## 3. Train a DecisionTree model able to predict the Gormiti_Land class.


#### (a) Perform features pre-processing if necessary. Discuss your choices and the performed actions.


In [75]:
X_train = df_train.drop(columns=['Gormiti_Land'])
y_train= df_train['Gormiti_Land']
X_test = df_test.drop(columns=['Gormiti_Land'])
y_test= df_test['Gormiti_Land']

In [76]:
lb = prepro.LabelBinarizer()

lb.fit(X_train.Attack.unique())
X_train['Attack'] = lb.transform(X_train['Attack'])

lb.fit(X_train.Deadly.unique())
X_train['Deadly'] = lb.transform(X_train['Deadly'])

lb.fit(X_train.Nature.unique())
X_train['Nature'] = lb.transform(X_train['Nature'])


lb.fit(X_test.Attack.unique())
X_test['Attack'] = lb.transform(X_test['Attack'])

lb.fit(X_test.Deadly.unique())
X_test['Deadly'] = lb.transform(X_test['Deadly'])

lb.fit(X_test.Nature.unique())
X_test['Nature'] = lb.transform(X_test['Nature'])




le = prepro.LabelEncoder()
le.fit(['Slow','Medium', 'Fast'])
X_train['Growth_rate'] = le.transform(X_train['Growth_rate'])

le.fit(X_train.Ray.unique())
X_train['Ray'] = le.transform(X_train['Ray'])

le.fit(X_train.Ability.unique())
X_train['Ability'] = le.transform(X_train['Ability'])

le.fit(X_train.Gormiti_Type.unique())
X_train['Gormiti_Type'] = le.transform(X_train['Gormiti_Type'])

le.fit(['Slow','Medium', 'Fast'])
X_test['Growth_rate'] = le.transform(X_test['Growth_rate'])

le.fit(X_test.Ray.unique())
X_test['Ray'] = le.transform(X_test['Ray'])

le.fit(X_test.Ability.unique())
X_test['Ability'] = le.transform(X_test['Ability'])

le.fit(X_test.Gormiti_Type.unique())
X_test['Gormiti_Type'] = le.transform(X_test['Gormiti_Type'])



#### (b) Search for good hyperparameter values for the DecisionTreeClassifier: make a choice on the hyperparameters you might tune and provide comments on your choice. Specify which hyperparameter might require a tuning procedure, and which is the effect of the tuning procedure on the final model.




In [77]:
params = {'max_depth': list(range(2,20)), 'criterion': ['gini','entropy'],'min_samples_split': list(range(1, 3)), 'min_samples_leaf': list(range(1, 3)), 'max_features': list(range(0,13))}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=24), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 1872 candidates, totalling 5616 fits


#### (c) Evaluate the trained model on the provided test set. Verify that the trained model is not overfitting. Discuss the obtained results.

In [78]:
grid_search_cv.best_params_

{'criterion': 'entropy',
 'max_depth': 10,
 'max_features': 10,
 'min_samples_leaf': 1,
 'min_samples_split': 1}

In [79]:

y_pred = grid_search_cv.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    Fireland       0.58      0.82      0.68       500
     Iceland       0.56      0.53      0.55       501
    Rockland       0.77      0.45      0.57       499
    Windland       0.65      0.68      0.66       500

    accuracy                           0.62      2000
   macro avg       0.64      0.62      0.61      2000
weighted avg       0.64      0.62      0.61      2000



## 4. Train a Random Forest model able to predict the Gormiti_Land class.


#### (a) Perform features pre-processing if necessary. Discuss your choices and the performed actions.


#### (b) Search for good hyperparameter values for the RandomForestClassifier: make a choice on the hyperparameters you might tune and provide comments on your choice. Specify which hyper-parameter might require a tuning procedure (comment the hyperparameters related to the ensemble only, since the ones related to the DecisionTree have been discussed above).


In [80]:
from sklearn.ensemble import RandomForestClassifier


In [81]:
params = {'n_estimators': list((10, 30 ,50, 100,5)), 'max_depth': list(range(2,10)), 'max_features': list(range(5,13))}
grid_search_cv = GridSearchCV(RandomForestClassifier(random_state=24), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 320 candidates, totalling 960 fits


In [82]:
grid_search_cv.best_params_

{'max_depth': 9, 'max_features': 7, 'n_estimators': 100}

In [83]:
y_pred = grid_search_cv.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    Fireland       0.59      0.90      0.71       500
     Iceland       0.70      0.62      0.66       501
    Rockland       0.79      0.62      0.70       499
    Windland       0.73      0.60      0.66       500

    accuracy                           0.68      2000
   macro avg       0.71      0.68      0.68      2000
weighted avg       0.71      0.68      0.68      2000



#### (c) Which are the 2 most important features for the trained model?


In [84]:
importances = grid_search_cv.best_estimator_.feature_importances_
x= 0 
max = []
for i in (importances):
    print(X_train.columns[x], ':', i)
    x+=1
    

Gormiti_Type : 0.004907529590938853
Nature : 0.0010615838083310274
Deadly : 0.00133310654645236
Growth_rate : 0.0019480951088955614
Attack : 0.0010269081204201513
Strength : 0.008222775841967086
Ray : 0.002954823690179974
Ability : 0.004388733874482244
Against_Fire : 0.18355323629376563
Against_Meka : 0.12822987512371334
Against_Lava : 0.24091916516156522
Against_Wind : 0.20571600559534955
Against_Rock : 0.21573816124393902


Against_Lava, Against_Rock

In [85]:
grid_search_cv.best_params_

{'max_depth': 9, 'max_features': 7, 'n_estimators': 100}

#### (d) Provide an out-of-bag evaluation of the trained model.


In [86]:
from sklearn.ensemble import BaggingClassifier 
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(RandomForestClassifier(max_depth=9, random_state=24), max_features=7, n_estimators=100, oob_score=True,bootstrap=True, n_jobs=-1, random_state=24)

bag_clf.fit(X_train, y_train)

y_pred = bag_clf.predict(X_test)
bag_clf.oob_score_

0.804375

In [87]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_train)
evaltrain = accuracy_score(y_train, y_pred)
evaltrain

0.869

In [88]:
bag_clf.oob_decision_function_

array([[0.13811891, 0.33191062, 0.42595754, 0.10401293],
       [0.22850392, 0.11535501, 0.25479671, 0.40134436],
       [0.11197096, 0.56286459, 0.14762806, 0.17753639],
       ...,
       [0.09459774, 0.09440307, 0.13215938, 0.67883982],
       [0.06617094, 0.74850046, 0.11301034, 0.07231826],
       [0.30680298, 0.08510449, 0.19042336, 0.41766917]])

#### (e) Evaluate the trained model on the provided test set. Verify that the trained model is not overfitting. Discuss the obtained results.


In [89]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.713

## 5. Train an AdaBoost model able to predict the Gormiti_Land class.


#### (a) Perform features pre-processing if necessary. Discuss your choices and the performed actions.


#### (b) Search for good hyperparameter values for the AdaBoostClassifier: make a choice on the hyperparameters you might tune and provide comments on your choice. Specify which hyper-parameter might require a tuning procedure.


In [90]:
from sklearn.ensemble import AdaBoostClassifier

params = {'n_estimators': list((10, 30 ,50, 100,5, 200)), 'learning_rate': [0.0001,0.001,0.01,0.1,1,5,10]}

grid_search_cv = GridSearchCV(AdaBoostClassifier(DecisionTreeClassifier(max_depth=1, random_state=24), algorithm='SAMME.R'), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)


Fitting 3 folds for each of 42 candidates, totalling 126 fits


In [91]:
y_pred = grid_search_cv.predict(X_train)
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

    Fireland       0.76      0.74      0.75      1999
     Iceland       0.84      0.71      0.77      2004
    Rockland       0.74      0.71      0.72      1995
    Windland       0.65      0.79      0.71      2002

    accuracy                           0.74      8000
   macro avg       0.75      0.74      0.74      8000
weighted avg       0.75      0.74      0.74      8000



#### (c) Evaluate the trained model on the provided test set. Verify that the trained model is not overfitting. Discuss the obtained results.


In [92]:
y_pred = grid_search_cv.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    Fireland       0.63      0.67      0.65       500
     Iceland       0.59      0.85      0.69       501
    Rockland       0.74      0.47      0.58       499
    Windland       0.71      0.60      0.65       500

    accuracy                           0.65      2000
   macro avg       0.67      0.65      0.64      2000
weighted avg       0.67      0.65      0.64      2000



## 6. Train a Soft Voting Classifier model able to predict the Gormiti_Land class.


#### (a) Combine the models trained above into an ensemble, using a soft voting classifier. Select the models which allow soft voting and comment on this choice. Eventually include other models (trained with different algorithms) to the ensemble.


In [93]:
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()

voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)], voting='soft') ## we specify SOFT voting

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

NameError: name 'svm_clf' is not defined

#### (b) Evaluate the trained model on the provided test set. Discuss the obtained results.


#### (c) How much better does the voting classifier perform compared to the individual classifiers?


## 7. Train a Blender classifier able to predict the Gormiti_Land class.


#### (a) Exploit the predictions performed by a set of models: a Softmax Regression model, a Decision Tree model, a Random Forest model and an AdaBoost model. Use the best hyperparameters values identified above for all the exploited models.


#### (b) Evaluate the trained stacking ensemble model on the provided test set. Discuss the obtained results.


## 8. Compare the performances of the previously trained classifiers and ensemble models evaluating them on the provided test set.