# Assignment for practical work 4. Basics of neural networks

Group:

*  Jannik Bucher
*  Dennis Imhof

### Using dataset: SkillCraft1 Master Table Dataset
[SkillCraft1 on UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SkillCraft1+Master+Table+Dataset#)

#### Notes:

* Other than in assignment 2a we will now use the original response variable "LeagueIndex" for classification.

* Also, we will split the data into train and test set and perform cross-validation on the training data instead of splitting the training data into fixed train/validation sets.

## General Assignment

Before performing the practical work, you need download the data set accordingly to the option on your machine
1. Write a program that splits the original sample into a training set and a test set (training set, validation set, test set)
2. Build a model using Perceptron (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) and MLPClassifier (http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). On the basis of experiments, select values for learning rate, the regularization parameter, the optimization function.
3. Build learning curves for better explanation of your experiments.

## Options
Data sets are taken from the UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/
The option is determined by the data set, which can be downloaded from the link above:
1. Condition Based Maintenance of Naval Propulsion Plants
2. UJIIndoorLoc
3. Insurance Company Benchmark (COIL 2000)
4. KDD Cup 1998 Data
5. [Forest Fires](https://www.kaggle.com/elikplim/predict-the-burned-area-of-forest-fires)
6. Concrete Compressive Strength
7. Concrete Slump Test
8. Communities and Crime
9. Parkinsons Telemonitoring
10. YearPredictionMSD
11. Relative location of CT slices on axial axis
12. Individual household electric power consumption
13. Energy efficiency
14. 3D Road Network (North Jutland, Denmark)
15. ISTANBUL STOCK EXCHANGE
16. Buzz in social media
17. Physicochemical Properties of Protein Tertiary Structure
18. Gas Sensor Array Drift Dataset at Different Concentrations
19. SkillCraft1 Master Table Dataset
20. SML2010
21. Bike Sharing Dataset
22. Combined Cycle Power Plant
23. BlogFeedback

In [219]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron

In [220]:
df = pd.read_csv("data/SkillCraft1_Dataset.csv")

In [221]:
df = df.drop("GameID", axis=1)

In [222]:
response_label_names = ["Bronze", "Silver", "Gold", "Platinum", "Diamond", "Master", "GrandMaster", "Professional"]
# response_indices = range(1,9)
# response_labels = dict(zip(response_indices, response_label_names))
# response_labels

In [223]:
# df.LeagueIndex = df.LeagueIndex.apply(lambda x: response_labels[x])

In [224]:
# pd.LeagueIndex = pd.Series(pd.Categorical(df.LeagueIndex, categories=response_label_names, ordered=True), dtype="category")

In [225]:
df.LeagueIndex.describe()

count    3395.000000
mean        4.184094
std         1.517327
min         1.000000
25%         3.000000
50%         4.000000
75%         5.000000
max         8.000000
Name: LeagueIndex, dtype: float64

### EDA

In [226]:
df.head()

Unnamed: 0,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,5,27,10,3000,143.718,0.003515,0.00022,7,0.00011,0.000392,0.004849,32.6677,40.8673,4.7508,28,0.001397,6,0.0,0.0
1,5,23,10,5000,129.2322,0.003304,0.000259,4,0.000294,0.000432,0.004307,32.9194,42.3454,4.8434,22,0.001194,5,0.0,0.000208
2,4,30,10,200,69.9612,0.001101,0.000336,4,0.000294,0.000461,0.002926,44.6475,75.3548,4.043,22,0.000745,6,0.0,0.000189
3,3,19,20,400,107.6016,0.001034,0.000213,1,5.3e-05,0.000543,0.003783,29.2203,53.7352,4.9155,19,0.000426,7,0.0,0.000384
4,3,32,10,500,122.8908,0.001136,0.000327,2,0.0,0.001329,0.002368,22.6885,62.0813,9.374,15,0.001174,4,0.0,1.9e-05


In [227]:
df.describe()

Unnamed: 0,LeagueIndex,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
count,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0,3395.0
mean,4.184094,117.046947,0.004299,0.000374,4.364654,9.8e-05,0.000387,0.003463,40.361562,63.739403,5.272988,22.131664,0.001032,6.534021,5.9e-05,0.000142
std,1.517327,51.945291,0.005284,0.000225,2.360333,0.000166,0.000377,0.000992,17.15357,19.238869,1.494835,7.431719,0.000519,1.857697,0.000111,0.000265
min,1.0,22.0596,0.0,0.0,0.0,0.0,0.0,0.000679,6.6667,24.0936,2.0389,5.0,7.7e-05,2.0,0.0,0.0
25%,3.0,79.9002,0.001258,0.000204,3.0,0.0,0.00014,0.002754,28.95775,50.4466,4.27285,17.0,0.000683,5.0,0.0,0.0
50%,4.0,108.0102,0.0025,0.000353,4.0,4e-05,0.000281,0.003395,36.7235,60.9318,5.0955,22.0,0.000905,6.0,0.0,2e-05
75%,5.0,142.7904,0.005133,0.000499,6.0,0.000119,0.000514,0.004027,48.2905,73.6813,6.0336,27.0,0.001259,8.0,8.6e-05,0.000181
max,8.0,389.8314,0.043088,0.001752,10.0,0.003019,0.004041,0.007971,237.1429,176.3721,18.5581,58.0,0.005149,13.0,0.000902,0.003084


In [228]:
# No missing values in any of the columns
df.isnull().any()

LeagueIndex             False
Age                     False
HoursPerWeek            False
TotalHours              False
APM                     False
SelectByHotkeys         False
AssignToHotkeys         False
UniqueHotkeys           False
MinimapAttacks          False
MinimapRightClicks      False
NumberOfPACs            False
GapBetweenPACs          False
ActionLatency           False
ActionsInPAC            False
TotalMapExplored        False
WorkersMade             False
UniqueUnitsMade         False
ComplexUnitsMade        False
ComplexAbilitiesUsed    False
dtype: bool

In [229]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3395 entries, 0 to 3394
Data columns (total 19 columns):
LeagueIndex             3395 non-null int64
Age                     3395 non-null object
HoursPerWeek            3395 non-null object
TotalHours              3395 non-null object
APM                     3395 non-null float64
SelectByHotkeys         3395 non-null float64
AssignToHotkeys         3395 non-null float64
UniqueHotkeys           3395 non-null int64
MinimapAttacks          3395 non-null float64
MinimapRightClicks      3395 non-null float64
NumberOfPACs            3395 non-null float64
GapBetweenPACs          3395 non-null float64
ActionLatency           3395 non-null float64
ActionsInPAC            3395 non-null float64
TotalMapExplored        3395 non-null int64
WorkersMade             3395 non-null float64
UniqueUnitsMade         3395 non-null int64
ComplexUnitsMade        3395 non-null float64
ComplexAbilitiesUsed    3395 non-null float64
dtypes: float64(12), int64(4),

In [230]:
# Though there were no None values in the dataset, closer inspection reveals missing values marked with "?"
# Convert the object-variables to numeric and set the missing values to None
missing_features = ["Age", "HoursPerWeek", "TotalHours"]

for col in missing_features:
    df[col] = pd.to_numeric(df[col], errors="coerce")

In [231]:
df.isna().sum()

LeagueIndex              0
Age                     55
HoursPerWeek            56
TotalHours              57
APM                      0
SelectByHotkeys          0
AssignToHotkeys          0
UniqueHotkeys            0
MinimapAttacks           0
MinimapRightClicks       0
NumberOfPACs             0
GapBetweenPACs           0
ActionLatency            0
ActionsInPAC             0
TotalMapExplored         0
WorkersMade              0
UniqueUnitsMade          0
ComplexUnitsMade         0
ComplexAbilitiesUsed     0
dtype: int64

In [232]:
df.groupby("LeagueIndex").describe()

Unnamed: 0_level_0,Age,Age,Age,Age,Age,Age,Age,Age,HoursPerWeek,HoursPerWeek,...,ComplexUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed,ComplexAbilitiesUsed,ComplexAbilitiesUsed,ComplexAbilitiesUsed,ComplexAbilitiesUsed,ComplexAbilitiesUsed,ComplexAbilitiesUsed,ComplexAbilitiesUsed
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
LeagueIndex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,167.0,22.724551,5.52286,16.0,19.0,21.0,26.0,40.0,167.0,13.125749,...,0.0,0.000318,167.0,4.2e-05,9.6e-05,0.0,0.0,0.0,4.5e-05,0.00063
2,347.0,22.15562,5.091531,16.0,18.0,21.0,25.0,43.0,347.0,13.29683,...,0.0,0.000494,347.0,7.6e-05,0.0002,0.0,0.0,0.0,5.7e-05,0.001763
3,553.0,22.050633,4.901305,16.0,18.0,21.0,24.0,41.0,553.0,13.949367,...,3.4e-05,0.00059,553.0,0.000117,0.000257,0.0,0.0,0.0,0.000129,0.002664
4,811.0,21.981504,4.141736,16.0,19.0,21.0,24.0,44.0,811.0,14.022195,...,9.9e-05,0.000786,811.0,0.000138,0.000245,0.0,0.0,2.9e-05,0.000176,0.002186
5,806.0,21.362283,3.662164,16.0,18.0,21.0,24.0,37.0,805.0,16.183851,...,0.000132,0.000902,806.0,0.000176,0.000282,0.0,0.0,6.6e-05,0.000229,0.003084
6,621.0,20.677939,3.030381,16.0,18.0,20.0,22.0,31.0,621.0,21.088567,...,0.000127,0.000781,621.0,0.000182,0.000293,0.0,0.0,5.3e-05,0.000269,0.002443
7,35.0,21.171429,2.864444,16.0,19.0,22.0,23.0,26.0,35.0,31.714286,...,0.000121,0.000386,35.0,0.000267,0.000588,0.0,0.0,4e-05,0.000235,0.002685
8,0.0,,,,,,,,0.0,,...,0.0,0.000457,55.0,0.000135,0.000246,0.0,0.0,0.0,0.000128,0.000959


### Splitting the dataset into train/test

Before doing any modeling

In [233]:
train_X, test_X, train_y, test_y = train_test_split(df.drop("LeagueIndex",axis=1), df["LeagueIndex"], train_size=0.8, random_state=42)

### Feature normalization

Here we scale all numerical variables and impute the missing values in "Age", "HoursPerWeek" and "TotalHours".

In [234]:
remaining_features = train_X.columns.drop(missing_features)

numeric_transformer = Pipeline([('scale', StandardScaler())])
imputer = Pipeline([('impute', SimpleImputer(strategy='median')), 
                     ('scale', numeric_transformer)])



preprocessor = ColumnTransformer(
    transformers=[
        ('imp', imputer, missing_features),
        ('num', numeric_transformer, remaining_features)])

In [235]:
train_X = pd.DataFrame(preprocessor.fit_transform(train_X), columns=train_X.columns)

In [236]:
train_X.describe()

Unnamed: 0,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
count,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0,2716.0
mean,-3.479462e-16,-4.0550120000000004e-17,2.616137e-18,8.633251000000001e-17,6.278728e-17,-7.128973000000001e-17,-1.399633e-16,-4.709046e-17,-1.491198e-16,-2.7469440000000002e-17,1.968643e-16,6.998166e-16,8.698655e-17,-2.459169e-16,6.671149000000001e-17,1.138019e-16,-1.052995e-16,3.270171e-18
std,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184,1.000184
min,-1.363021,-1.31535,-0.05287988,-1.853926,-0.8307951,-1.677631,-1.86615,-0.5806462,-1.025473,-2.801613,-1.917108,-2.047678,-2.132266,-2.314673,-1.825463,-2.441983,-0.5367249,-0.5320214
25%,-0.6331381,-0.6502364,-0.03740366,-0.7232601,-0.5818109,-0.7536533,-0.5950089,-0.5806462,-0.6537659,-0.7191526,-0.6705181,-0.6927804,-0.665624,-0.7050841,-0.672164,-0.8337324,-0.5367249,-0.5320214
50%,-0.1465498,-0.3176797,-0.02698197,-0.1741529,-0.3448266,-0.09823826,-0.1712952,-0.3473055,-0.2739262,-0.06801996,-0.2138966,-0.1487731,-0.1135236,-0.03442205,-0.2398668,-0.2976488,-0.5367249,-0.4547275
75%,0.5833327,0.3474335,-0.01134942,0.5216098,0.1820376,0.5668545,0.6761323,0.1112131,0.3344726,0.5768667,0.4429944,0.5251209,0.5183354,0.63624,0.4334208,0.7745184,0.2393433,0.1565407
max,5.449216,12.65203,52.05544,5.306294,7.395625,6.14142,2.370987,16.97499,9.483518,4.487375,11.21705,5.845117,8.977585,4.794345,7.785008,3.454937,7.484828,10.84133


## Classifictation with the Perceptron

In [237]:
perc = Perceptron(random_state=42, n_jobs=1)
perc.fit(train_X, train_y)

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=1,
           penalty=None, random_state=42, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

In [238]:
test_X = pd.DataFrame(preprocessor.fit_transform(test_X), columns=test_X.columns)
np.mean(test_y == perc.predict(test_X))

0.21354933726067746

In [251]:
params_perceptron = {'alpha' : 10**np.linspace(-20,-1,5)}

In [252]:
grid_perc = GridSearchCV(Perceptron(n_jobs=-1, random_state=42), param_grid=params_perceptron, n_jobs=-1, cv=3)
grid_perc.fit(train_X, train_y)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Perceptron(alpha=0.0001, class_weight=None,
                                  early_stopping=False, eta0=1.0,
                                  fit_intercept=True, max_iter=1000,
                                  n_iter_no_change=5, n_jobs=-1, penalty=None,
                                  random_state=42, shuffle=True, tol=0.001,
                                  validation_fraction=0.1, verbose=0,
                                  warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'alpha': array([1.00000000e-20, 5.62341325e-16, 3.16227766e-11, 1.77827941e-06,
       1.00000000e-01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [256]:
grid_perc.best_score_

0.29491899852724596

In [258]:
params_mlp = {'alpha' : 10**np.linspace(-5,-1,5)}

In [None]:
grid_mlp = GridSearchCV(MLPClassifier(random_state=42), param_grid=params_mlp, cv=3, n_jobs=-1, verbose=True)

In [None]:
grid_mlp.fit(train_X, train_y)