# DSBA 22/23 HSE & University of London

# Practical assignment 1. DL in classification.

## General info
Release data: 26.09.2022

Soft deadline: 10.10.2022 23:59 MSK

Hard deadline: 13.10.2021 23:59 MSK

In this task, you are to build a NN for a binary classification task. We suggest using Google Colab for access to GPU. Competition invite link: https://www.kaggle.com/t/1917e22edb71437ca24d790ab1d57695

## Evaluation and fines

Each section has a defined "value" (in brackets near the section). Maximum grade for the task - 10 points, other points can be assigned to your tests.

**Your notebook with the best solution must be reproducible should be sent to the dropbox!** If the assessor cannot reproduce your results, you may be assigned score = 0, so make all your computations fixed!

**You can only use neural networks / linear / nearest neighbors models for this task - tree-based models are forbidden!**

All the parts must be done independently.

After the hard deadline is passed, the hometask is not accepted. If you send the hometask after the soft deadline, you will be excluded from competition among your mates and the homework will only be scored by the "Beating the baseline" part.

Feel free to ask questions both the teacher and your mates, but __do not copy the code or do it together__. "Similar" solutions are considered a plagiarism and all the involved students (the ones who gave & the ones who did) cannot get more than 0.01 points for the task. If you found a solution in some open source, you __must__ reference it in a special block at the end of your work (to exclude the suspicions in plagiarism).


## Format of handing over

The tasks are sent to the dropbox: https://www.dropbox.com/request/Y6TJouxNbm3r0RgcBL35. Don't forget to attach your name, surname & your group.


## 1. Model training

**Important!** Public Leaderboard contains only 33% of the test data. Your points will be measured wrt to the whole test set, therefore your position on the LB after the end of the competition may change.

* test_accuracy > weak baseline (public LB): 3 points

* test_accuracy > medium baseline (public LB): + 3 points

* test_accuracy > strong baseline (public LB): + 2 points

* You are among 25% most successful students (private LB): + 2 point

* You are among top-3 most successful students (private LB): + 1 point

* You are among top-2 most successful students (private LB): + 1 point

* You are among top-1 most successful students (private LB): + 1 point

In [4]:
!pip install torch

Collecting torch
  Downloading torch-1.12.1-cp39-cp39-win_amd64.whl (161.8 MB)
Installing collected packages: torch
Successfully installed torch-1.12.1


You should consider upgrading via the 'c:\users\egor\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


In [1]:
# Your code here ╰( ͡° ͜ʖ ͡° )つ──☆*:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader

In [6]:
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
from tensorflow import keras
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.callbacks import EarlyStopping

# **Preprocessing data**

In [12]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
target_df = pd.read_csv('train_target.csv')
train_expected_target1 = pd.read_csv('train_expected_target_agent_1.csv')
train_expected_target2 = pd.read_csv('train_expected_target_agent_2.csv')

In [13]:
train_df.head()

Unnamed: 0,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,agent_1_feat_ODC,...,agent_2_feattotal_xg_3,agent_2_feattotal_xg_2,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean
0,58.8,85.1,15.8,6.99,1.1437,0.928715,7.13,14.16,267.0,194.0,...,2.739439,2.739439,2.739439,2.739439,,0.473684,0.473684,0.473684,0.473684,
1,44.8,71.1,23.4,6.84,0.954159,0.97535,9.99,7.66,191.0,287.0,...,2.336756,2.336756,2.336756,2.336756,,0.578947,0.578947,0.578947,0.578947,
2,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,7.34,179.0,298.0,...,2.120322,2.120322,2.120322,2.120322,,0.368421,0.368421,0.368421,0.368421,
3,50.2,77.5,24.4,6.87,1.037613,0.956836,9.6,9.53,195.0,239.0,...,2.216415,2.216415,2.216415,2.216415,,0.210526,0.210526,0.210526,0.210526,
4,44.9,75.0,17.2,6.77,0.983691,0.948837,12.24,8.76,161.0,283.0,...,2.604025,2.604025,2.604025,2.604025,,0.421053,0.421053,0.421053,0.421053,


In [14]:
test_df.head()

Unnamed: 0,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,agent_1_feat_ODC,...,agent_2_feattotal_xg_3,agent_2_feattotal_xg_2,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean
0,58.6,87.0,15.2,6.83,0.844742,1.165049,9.19,16.5,337.0,179.0,...,2.66187,1.893116,4.24136,2.932115,2.690442,1.0,0.0,1.0,0.666667,0.333333
1,50.7,81.3,14.2,6.65,0.743218,1.152593,10.31,13.63,311.0,208.0,...,3.550724,2.3737,4.19701,3.373811,3.075302,0.0,1.0,1.0,0.666667,0.625
2,47.3,81.4,17.7,6.73,0.954509,0.956938,14.21,11.82,207.0,270.0,...,2.693652,2.042668,0.966665,1.900995,3.007033,0.0,1.0,1.0,0.666667,0.555556
3,54.5,84.8,14.5,6.85,1.155612,1.049618,10.95,12.46,339.0,186.0,...,3.9381,1.466409,0.922046,2.108852,2.643923,1.0,0.0,0.0,0.333333,0.444444
4,51.3,81.8,16.4,6.81,1.199718,0.856327,11.27,11.52,193.0,293.0,...,3.358338,2.138405,1.872476,2.456406,3.113815,0.0,0.0,0.0,0.0,0.555556


In [15]:
train_df.shape

(2470, 234)

In [16]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2470 entries, 0 to 2469
Columns: 234 entries, agent_1_feat_Possession% to agent_2_featboth_scored_mean
dtypes: float64(212), int64(22)
memory usage: 4.4 MB


In [17]:
target_df.drop('id', axis = 1, inplace = True)

In [18]:
train_df = pd.concat([target_df, train_df], axis = 1)

In [19]:
train_df

Unnamed: 0,category,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,...,agent_2_feattotal_xg_3,agent_2_feattotal_xg_2,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean
0,1,58.8,85.1,15.8,6.99,1.143700,0.928715,7.13,14.16,267.0,...,2.739439,2.739439,2.739439,2.739439,,0.473684,0.473684,0.473684,0.473684,
1,1,44.8,71.1,23.4,6.84,0.954159,0.975350,9.99,7.66,191.0,...,2.336756,2.336756,2.336756,2.336756,,0.578947,0.578947,0.578947,0.578947,
2,0,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,7.34,179.0,...,2.120322,2.120322,2.120322,2.120322,,0.368421,0.368421,0.368421,0.368421,
3,0,50.2,77.5,24.4,6.87,1.037613,0.956836,9.60,9.53,195.0,...,2.216415,2.216415,2.216415,2.216415,,0.210526,0.210526,0.210526,0.210526,
4,1,44.9,75.0,17.2,6.77,0.983691,0.948837,12.24,8.76,161.0,...,2.604025,2.604025,2.604025,2.604025,,0.421053,0.421053,0.421053,0.421053,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2465,1,41.6,76.0,17.1,6.62,1.046406,1.032989,18.00,8.27,138.0,...,6.412100,1.977760,3.684860,4.024907,3.872622,1.000000,0.000000,0.000000,0.333333,0.444444
2466,1,42.9,76.1,18.3,6.61,1.161802,1.066236,16.14,7.60,201.0,...,2.689307,1.743456,1.568175,2.000313,2.572016,0.000000,0.000000,0.000000,0.000000,0.444444
2467,0,41.0,72.2,19.1,6.51,1.000858,1.026472,15.99,7.99,164.0,...,1.875957,1.742962,3.871643,2.496854,2.555157,0.000000,0.000000,1.000000,0.333333,0.500000
2468,1,51.4,79.3,14.1,6.62,1.037986,1.161401,9.73,10.47,222.0,...,1.913804,2.113308,4.904164,2.977092,2.495116,1.000000,0.000000,0.000000,0.333333,0.222222


## Delete outliers

In [20]:
train_expected_target1 = train_expected_target1.rename(columns={"0": "train_expected_target1"})
train_expected_target2 = train_expected_target2.rename(columns={"0": "train_expected_target2"})
train_df = pd.concat([train_expected_target1, train_df], axis = 1)
train_df = pd.concat([train_expected_target2, train_df], axis = 1)
train_df.head()

Unnamed: 0,train_expected_target2,train_expected_target1,category,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,...,agent_2_feattotal_xg_3,agent_2_feattotal_xg_2,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean
0,0.278076,1.16635,1,58.8,85.1,15.8,6.99,1.1437,0.928715,7.13,...,2.739439,2.739439,2.739439,2.739439,,0.473684,0.473684,0.473684,0.473684,
1,0.613273,1.2783,1,44.8,71.1,23.4,6.84,0.954159,0.97535,9.99,...,2.336756,2.336756,2.336756,2.336756,,0.578947,0.578947,0.578947,0.578947,
2,1.11757,1.90067,0,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,...,2.120322,2.120322,2.120322,2.120322,,0.368421,0.368421,0.368421,0.368421,
3,0.909774,0.423368,0,50.2,77.5,24.4,6.87,1.037613,0.956836,9.6,...,2.216415,2.216415,2.216415,2.216415,,0.210526,0.210526,0.210526,0.210526,
4,0.991901,1.68343,1,44.9,75.0,17.2,6.77,0.983691,0.948837,12.24,...,2.604025,2.604025,2.604025,2.604025,,0.421053,0.421053,0.421053,0.421053,


In [21]:
print('Rows before deleting: ', train_df.shape[0])
train_df = train_df.drop(train_df[(train_df.train_expected_target2 > 0.8) & (train_df.train_expected_target1 > 0.8) & (train_df.category == 0)].index)
train_df.drop(['train_expected_target1', 'train_expected_target2'], axis = 1, inplace = True)
print('Rows after deleting: ', train_df.shape[0])

Rows before deleting:  2470
Rows after deleting:  2143


## Correlation between target and other features

In [22]:
train_df.corr(method = "pearson")["category"].sort_values(ascending = False)

category                     1.000000
agent_2_feat_ScoredAv        0.087947
agent_2_feat_XGrealiz        0.083946
agent_2_featscored_mean_3    0.074701
agent_2_feat_XgAv            0.072745
                               ...   
agent_2_feat_xga_2          -0.039018
agent_2_feat_scheme_12      -0.039148
agent_2_feat_missed_2       -0.054183
agent_1_feat_scheme_12      -0.057431
agent_2_feat_PPDA           -0.079134
Name: category, Length: 235, dtype: float64

## Work with missing variables

### V1

In [None]:
# nan_columns = train_df.columns[train_df.isna().any()].tolist()
# print(train_df[nan_columns].isna().sum())
# train_df = train_df.apply(lambda x: x.fillna(x.mean()), axis=0)
# train_df.isna().sum().sort_values(ascending = False)

### V2

In [23]:
print('Rows before deleting: ', train_df.shape[0])
train_df = train_df.dropna()  
print('Rows after deleting: ', train_df.shape[0])

Rows before deleting:  2143
Rows after deleting:  2022


## Split dataset on train and test

In [24]:
Y = train_df['category']
X = train_df.drop(['category'], axis=1)

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 42)

## Scale data

In [41]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
test_df = scaler.transform(test_df)

  "X does not have valid feature names, but"


# **Model**

In [None]:
batch_size = [32, 64, 128, 256]
epochs = [50, 75, 150]
optimizer = ['SGD', 'Adam']
learning_rate = [0.001, 0.0001]
param_opt = dict(batch_size=batch_size, epochs=epochs, learning_rate=learning_rate, optimizer = optimizer)

In [35]:
def create_model(batch_size, epochs, learning_rate, optimizer):
  model = Sequential()
  model.add(Dense(128, activation="relu")) # Hidden Layer 1

  model.add(Dense(64, activation="relu")) # Hidden Layer 2
  model.add(Dropout(0.2))

  model.add(Dense(64, activation="relu")) # Hidden Layer 2
  model.add(Dropout(0.2))
  
  model.add(Dense(32, activation="relu")) # Hidden Layer 3
  model.add(Dropout(0.2))

  model.add(Dense(16, activation="relu")) # Hidden Layer 4
  model.add(Dropout(0.2))

  model.add(Dense(1, activation="sigmoid")) # Outout Layer

  if optimizer == 'SGD':
    opt = keras.optimizers.SGD(learning_rate=learning_rate)
  elif optimizer == 'Adam':
    opt = keras.optimizers.Adam(learning_rate=learning_rate)
  
  model.compile(optimizer=opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

  return model

In [None]:
model_GridSearch = KerasClassifier(build_fn=create_model, verbose=0)
grid = GridSearchCV(estimator=model_GridSearch, param_grid=param_opt, n_jobs=1, cv=3, verbose = 0)
grid_result = grid.fit(X_train, y_train)

In [None]:
print('Best parameters are: ')
print('batch_size: ' + str(grid_result.best_params_['batch_size']))
print('epochs: ' + str(grid_result.best_params_['epochs']))
print('optimizer: ' + str(grid_result.best_params_['optimizer']))
print('learning_rate: ' + str(grid_result.best_params_['learning_rate']))

Best parameters are: 
batch_size: 128
epochs: 75
optimizer: Adam
learning_rate: 0.0001


In [None]:
# batch_size = grid_result.best_params_['batch_size']
# epochs = grid_result.best_params_['epochs']
# learning_rate = grid_result.best_params_['learning_rate']
# optimizer = grid_result.best_params_['optimizer']

In [42]:
batch_size = 32
epochs = 40
learning_rate = 0.001
optimizer = 'Adam'

In [43]:
model = create_model(batch_size, epochs, learning_rate, optimizer)

In [44]:
model.fit(X_train, y_train, batch_size = batch_size, epochs = epochs)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7fae521ba710>

In [45]:
validation_loss, validation_accuracy = model.evaluate(X_test, y_test, batch_size=batch_size)
print("Loss: "+ str(np.round(validation_loss, 3)))
print("Accuracy: "+ str(np.round(validation_accuracy, 3)))

Loss: 3.198
Accuracy: 0.547


In [46]:
y_pred = np.round(model.predict(X_test), 0)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.43      0.44      0.43       201
           1       0.63      0.62      0.62       305

    accuracy                           0.55       506
   macro avg       0.53      0.53      0.53       506
weighted avg       0.55      0.55      0.55       506



# Make a submission

In [None]:
Answer = np.round(model.predict(test_df), 0)
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission['tmp'] = Answer
sample_submission.drop(['category'], axis = 1, inplace= True)
sample_submission = sample_submission.rename(columns={"tmp": "category"})
print(sample_submission.head())
sample_submission.to_csv('Answer.csv', index = False)