# DSBA 22/23 HSE & University of London

# Practical assignment 1. DL in classification.

## General info
Release data: 26.09.2022

Soft deadline: 10.10.2022 23:59 MSK

Hard deadline: 13.10.2021 23:59 MSK

In this task, you are to build a NN for a binary classification task. We suggest using Google Colab for access to GPU. Competition invite link: https://www.kaggle.com/t/1917e22edb71437ca24d790ab1d57695

## Evaluation and fines

Each section has a defined "value" (in brackets near the section). Maximum grade for the task - 10 points, other points can be assigned to your tests.

**Your notebook with the best solution must be reproducible should be sent to the dropbox!** If the assessor cannot reproduce your results, you may be assigned score = 0, so make all your computations fixed!

**You can only use neural networks / linear / nearest neighbors models for this task - tree-based models are forbidden!**

All the parts must be done independently.

After the hard deadline is passed, the hometask is not accepted. If you send the hometask after the soft deadline, you will be excluded from competition among your mates and the homework will only be scored by the "Beating the baseline" part.

Feel free to ask questions both the teacher and your mates, but __do not copy the code or do it together__. "Similar" solutions are considered a plagiarism and all the involved students (the ones who gave & the ones who did) cannot get more than 0.01 points for the task. If you found a solution in some open source, you __must__ reference it in a special block at the end of your work (to exclude the suspicions in plagiarism).


## Format of handing over

The tasks are sent to the dropbox: https://www.dropbox.com/request/Y6TJouxNbm3r0RgcBL35. Don't forget to attach your name, surname & your group.


## 1. Model training

**Important!** Public Leaderboard contains only 33% of the test data. Your points will be measured wrt to the whole test set, therefore your position on the LB after the end of the competition may change.

* test_accuracy > weak baseline (public LB): 3 points

* test_accuracy > medium baseline (public LB): + 3 points

* test_accuracy > strong baseline (public LB): + 2 points

* You are among 25% most successful students (private LB): + 2 point

* You are among top-3 most successful students (private LB): + 1 point

* You are among top-2 most successful students (private LB): + 1 point

* You are among top-1 most successful students (private LB): + 1 point

In [1]:
# Your code here ╰( ͡° ͜ʖ ͡° )つ──☆*:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader

In [4]:
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [5]:
from tensorflow import keras
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from tensorflow.keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization
from keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import SGD

In [6]:
import mlxtend
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

# **Preprocessing data**

In [7]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
target_df = pd.read_csv('train_target.csv')
train_expected_target1 = pd.read_csv('train_expected_target_agent_1.csv')
train_expected_target2 = pd.read_csv('train_expected_target_agent_2.csv')
train_target_agent_1 = pd.read_csv('train_target_agent_1.csv')
train_target_agent_2 = pd.read_csv('train_target_agent_2.csv')

In [8]:
train_target_agent_1 = train_target_agent_1.rename(columns={"0": "expected_target1"})
train_target_agent_2 = train_target_agent_2.rename(columns={"0": "expected_target2"})

In [9]:
train_df = pd.concat([train_df, train_target_agent_1, train_target_agent_2], axis=1)

In [10]:
train_df.head()

Unnamed: 0,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,agent_1_feat_ODC,...,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean,expected_target1,expected_target2
0,58.8,85.1,15.8,6.99,1.1437,0.928715,7.13,14.16,267.0,194.0,...,2.739439,2.739439,,0.473684,0.473684,0.473684,0.473684,,1,2
1,44.8,71.1,23.4,6.84,0.954159,0.97535,9.99,7.66,191.0,287.0,...,2.336756,2.336756,,0.578947,0.578947,0.578947,0.578947,,2,2
2,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,7.34,179.0,298.0,...,2.120322,2.120322,,0.368421,0.368421,0.368421,0.368421,,0,1
3,50.2,77.5,24.4,6.87,1.037613,0.956836,9.6,9.53,195.0,239.0,...,2.216415,2.216415,,0.210526,0.210526,0.210526,0.210526,,0,1
4,44.9,75.0,17.2,6.77,0.983691,0.948837,12.24,8.76,161.0,283.0,...,2.604025,2.604025,,0.421053,0.421053,0.421053,0.421053,,2,2


In [11]:
test_df.head()

Unnamed: 0,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,agent_1_feat_ODC,...,agent_2_feattotal_xg_3,agent_2_feattotal_xg_2,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean
0,58.6,87.0,15.2,6.83,0.844742,1.165049,9.19,16.5,337.0,179.0,...,2.66187,1.893116,4.24136,2.932115,2.690442,1.0,0.0,1.0,0.666667,0.333333
1,50.7,81.3,14.2,6.65,0.743218,1.152593,10.31,13.63,311.0,208.0,...,3.550724,2.3737,4.19701,3.373811,3.075302,0.0,1.0,1.0,0.666667,0.625
2,47.3,81.4,17.7,6.73,0.954509,0.956938,14.21,11.82,207.0,270.0,...,2.693652,2.042668,0.966665,1.900995,3.007033,0.0,1.0,1.0,0.666667,0.555556
3,54.5,84.8,14.5,6.85,1.155612,1.049618,10.95,12.46,339.0,186.0,...,3.9381,1.466409,0.922046,2.108852,2.643923,1.0,0.0,0.0,0.333333,0.444444
4,51.3,81.8,16.4,6.81,1.199718,0.856327,11.27,11.52,193.0,293.0,...,3.358338,2.138405,1.872476,2.456406,3.113815,0.0,0.0,0.0,0.0,0.555556


In [12]:
train_df.shape

(2470, 236)

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2470 entries, 0 to 2469
Columns: 236 entries, agent_1_feat_Possession% to expected_target2
dtypes: float64(212), int64(24)
memory usage: 4.4 MB


In [14]:
target_df.drop('id', axis = 1, inplace = True)

In [15]:
train_df = pd.concat([target_df, train_df], axis = 1)

In [16]:
train_df

Unnamed: 0,category,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,...,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean,expected_target1,expected_target2
0,1,58.8,85.1,15.8,6.99,1.143700,0.928715,7.13,14.16,267.0,...,2.739439,2.739439,,0.473684,0.473684,0.473684,0.473684,,1,2
1,1,44.8,71.1,23.4,6.84,0.954159,0.975350,9.99,7.66,191.0,...,2.336756,2.336756,,0.578947,0.578947,0.578947,0.578947,,2,2
2,0,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,7.34,179.0,...,2.120322,2.120322,,0.368421,0.368421,0.368421,0.368421,,0,1
3,0,50.2,77.5,24.4,6.87,1.037613,0.956836,9.60,9.53,195.0,...,2.216415,2.216415,,0.210526,0.210526,0.210526,0.210526,,0,1
4,1,44.9,75.0,17.2,6.77,0.983691,0.948837,12.24,8.76,161.0,...,2.604025,2.604025,,0.421053,0.421053,0.421053,0.421053,,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2465,1,41.6,76.0,17.1,6.62,1.046406,1.032989,18.00,8.27,138.0,...,3.684860,4.024907,3.872622,1.000000,0.000000,0.000000,0.333333,0.444444,1,2
2466,1,42.9,76.1,18.3,6.61,1.161802,1.066236,16.14,7.60,201.0,...,1.568175,2.000313,2.572016,0.000000,0.000000,0.000000,0.000000,0.444444,2,3
2467,0,41.0,72.2,19.1,6.51,1.000858,1.026472,15.99,7.99,164.0,...,3.871643,2.496854,2.555157,0.000000,0.000000,1.000000,0.333333,0.500000,0,5
2468,1,51.4,79.3,14.1,6.62,1.037986,1.161401,9.73,10.47,222.0,...,4.904164,2.977092,2.495116,1.000000,0.000000,0.000000,0.333333,0.222222,1,3


## Delete outliers

In [17]:
train_expected_target1 = train_expected_target1.rename(columns={"0": "train_expected_target1"})
train_expected_target2 = train_expected_target2.rename(columns={"0": "train_expected_target2"})
train_df = pd.concat([train_expected_target1, train_df], axis = 1)
train_df = pd.concat([train_expected_target2, train_df], axis = 1)
train_df.head()

Unnamed: 0,train_expected_target2,train_expected_target1,category,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,...,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean,expected_target1,expected_target2
0,0.278076,1.16635,1,58.8,85.1,15.8,6.99,1.1437,0.928715,7.13,...,2.739439,2.739439,,0.473684,0.473684,0.473684,0.473684,,1,2
1,0.613273,1.2783,1,44.8,71.1,23.4,6.84,0.954159,0.97535,9.99,...,2.336756,2.336756,,0.578947,0.578947,0.578947,0.578947,,2,2
2,1.11757,1.90067,0,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,...,2.120322,2.120322,,0.368421,0.368421,0.368421,0.368421,,0,1
3,0.909774,0.423368,0,50.2,77.5,24.4,6.87,1.037613,0.956836,9.6,...,2.216415,2.216415,,0.210526,0.210526,0.210526,0.210526,,0,1
4,0.991901,1.68343,1,44.9,75.0,17.2,6.77,0.983691,0.948837,12.24,...,2.604025,2.604025,,0.421053,0.421053,0.421053,0.421053,,2,2


In [18]:
print('Rows before deleting: ', train_df.shape[0])
train_df = train_df.drop(train_df[(train_df.train_expected_target2 > 1) &
                                  (train_df.train_expected_target1 > 1) &
                                  (train_df.category == 0)].index)
train_df.drop(['train_expected_target1', 'train_expected_target2'], axis = 1, inplace = True)
print('Rows after deleting: ', train_df.shape[0])

Rows before deleting:  2470
Rows after deleting:  2295


## Work with missing variables

In [19]:
print('Rows before deleting: ', train_df.shape[0])
train_df = train_df.dropna()  
print('Rows after deleting: ', train_df.shape[0])

Rows before deleting:  2295
Rows after deleting:  2163


# Corr

In [20]:
train_df

Unnamed: 0,category,agent_1_feat_Possession%,agent_1_feat_Pass%,agent_1_feat_AerialsWon,agent_1_feat_Rating,agent_1_feat_XGrealiz,agent_1_feat_XGArealiz,agent_1_feat_PPDA,agent_1_feat_OPPDA,agent_1_feat_DC,...,agent_2_feattotal_xg_1,agent_2_feattotal_xg_mean_3,agent_2_feattotal_xg_mean,agent_2_featboth_scored_3,agent_2_featboth_scored_2,agent_2_featboth_scored_1,agent_2_featboth_scored_mean_3,agent_2_featboth_scored_mean,expected_target1,expected_target2
20,0,44.0,70.3,25.1,6.79,0.711201,0.915529,10.74,9.43,218.0,...,1.608046,2.112304,1.608046,0.578947,0.578947,1.0,0.719298,1.000000,0,0
21,0,57.0,84.6,15.9,7.07,1.094698,0.938272,7.57,13.92,575.0,...,2.479335,2.214160,2.479335,0.526316,0.526316,1.0,0.684211,1.000000,0,1
22,1,48.1,76.9,17.7,6.74,0.994530,1.235052,9.77,8.24,175.0,...,1.712261,2.183093,1.712261,0.526316,0.526316,1.0,0.684211,1.000000,3,3
23,0,46.3,70.8,21.7,6.77,0.918434,1.118603,9.56,7.34,179.0,...,2.675331,2.627794,2.675331,0.421053,0.421053,1.0,0.614035,1.000000,1,0
24,0,50.7,82.1,14.4,6.86,1.124694,0.875939,11.79,10.66,156.0,...,1.331644,2.260683,1.331644,0.368421,0.368421,0.0,0.245614,0.000000,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2465,1,41.6,76.0,17.1,6.62,1.046406,1.032989,18.00,8.27,138.0,...,3.684860,4.024907,3.872622,1.000000,0.000000,0.0,0.333333,0.444444,1,2
2466,1,42.9,76.1,18.3,6.61,1.161802,1.066236,16.14,7.60,201.0,...,1.568175,2.000313,2.572016,0.000000,0.000000,0.0,0.000000,0.444444,2,3
2467,0,41.0,72.2,19.1,6.51,1.000858,1.026472,15.99,7.99,164.0,...,3.871643,2.496854,2.555157,0.000000,0.000000,1.0,0.333333,0.500000,0,5
2468,1,51.4,79.3,14.1,6.62,1.037986,1.161401,9.73,10.47,222.0,...,4.904164,2.977092,2.495116,1.000000,0.000000,0.0,0.333333,0.222222,1,3


In [21]:
corr_matrix1 = train_df.corr()
print(corr_matrix1["expected_target1"].sort_values(ascending=False))

expected_target1          1.000000
agent_1_feat_ScoredAv     0.385431
agent_1_feat_XgAv         0.360820
agent_1_feat_DC           0.341588
agent_1_feat_pl_median    0.328869
                            ...   
agent_1_feat_PPDA        -0.213783
agent_2_feat_Rating      -0.215384
agent_1_feat_MissedAv    -0.260551
agent_1_feat_XgaAv       -0.267009
agent_1_feat_ODC         -0.290223
Name: expected_target1, Length: 237, dtype: float64


In [22]:
corr_matrix2 = train_df.corr()
print(corr_matrix2["expected_target2"].sort_values(ascending=False))

expected_target2         1.000000
category                 0.471964
agent_2_feat_ScoredAv    0.335948
agent_2_feat_XgAv        0.324758
agent_2_feat_pl_mean     0.294398
                           ...   
agent_2_feat_MissedAv   -0.225085
agent_1_feat_pl_mean    -0.225340
agent_2_feat_XgaAv      -0.241378
agent_1_feat_Rating     -0.250496
agent_2_feat_ODC        -0.263211
Name: expected_target2, Length: 237, dtype: float64


## Split dataset on train and test

In [23]:
X = train_df.drop(['expected_target1','category','expected_target2'], axis=1)
Y_category = train_df['category']
Y_expected1 = train_df['expected_target1']
Y_expected2 = train_df['expected_target2']
X_train, X_test, y_train1, y_test1, y_train2, y_test2 = (X.iloc[0:int(len(X)*0.8)], 
                                    X.iloc[int(len(X)*0.8):len(X)], 
                                    Y_expected1.iloc[0:int(len(Y_expected1)*0.8)], 
                                    Y_expected1.iloc[int(len(Y_expected1)*0.8):len(Y_expected1)],
                                    Y_expected2.iloc[0:int(len(Y_expected2)*0.8)], 
                                    Y_expected2.iloc[int(len(Y_expected2)*0.8):len(Y_expected2)])

In [24]:
X_train.shape, X_test.shape, y_train1.shape, y_test1.shape, y_train2.shape, y_test2.shape

((1730, 234), (433, 234), (1730,), (433,), (1730,), (433,))

# **Model**

In [25]:
def create_model(batch_size, epochs, learning_rate, num_classes, input_shape):
    model = Sequential()
    model.add(Dense(128, activation="relu", input_shape=(input_shape, ))) # Hidden Layer 1
    model.add(BatchNormalization())
    
    model.add(Dense(128, activation="relu")) # Hidden Layer 2
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    model.add(Dense(64, activation="relu")) # Hidden Layer 3
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    model.add(Dense(64, activation="relu")) # Hidden Layer 5
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    
    model.add(Dense(32, activation="relu")) # Hidden Layer 7
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
        
    model.add(Dense(num_classes, activation="softmax")) # Outout Layer
    
    sgd = SGD(learning_rate=learning_rate, momentum=0.9)

    model.compile(optimizer=sgd, loss = 'categorical_crossentropy', metrics = ['accuracy'])

    return model

In [26]:
def make_equal_dummies(y_train, y_test):
    dif = len(y_train.columns) - len(y_test.columns)
    if dif == 0:
        pass
    elif dif > 0:
        while dif != 0:
            y_test[int(y_test.columns[-1])+1] = 0
            dif -= 1
    else:
        while dif != 0:
            y_train[int(y_train.columns[-1])+1] = 0
            dif += 1
            
    print(y_train.shape, y_test.shape)
    return y_train, y_test

In [27]:
def make_graphs(history):
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(loss) + 1)
    plt.plot(epochs, loss, 'y', label='Training loss')
    plt.plot(epochs, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()


    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    plt.plot(epochs, acc, 'y', label='Training acc')
    plt.plot(epochs, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.show()

In [28]:
def calc_train_acc(model, expected_target, Y_expected, X_test):
    tmp_real_value = train_df[expected_target][int(len(Y_expected)*0.8):len(Y_expected)]
    y_pred = model.predict(X_test)
    tmp_pred_value = []
    for i in y_pred:
        if i.argmax() != 0:
            tmp_pred_value.append(1)
        else:
            tmp_pred_value.append(0)
    
#     print(len(tmp_real_value), len(tmp_pred_value))
    num_matching = 0
    if len(tmp_real_value) == len(tmp_pred_value):
        for i, j in zip(tmp_real_value, tmp_pred_value):
            if i != 0:
                i = 1
            if i == j:
                num_matching += 1
        print("ACC: ", np.round((num_matching / len(tmp_real_value)), 3))
    else:
        print('Not equal siae of predictions')
        
    return tmp_pred_value


In [29]:
def scale_data(X_train, X_test, test_df):
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    test_df = scaler.transform(test_df)
    
    return X_train, X_test, test_df

In [30]:
def KNN_model(X_train, y_train, X_test, y_test):
    error_rates = []

    for i in np.arange(1, 100):
        new_model = KNeighborsClassifier(n_neighbors = i)
        new_model.fit(X_train, y_train)
        model1 = KNeighborsClassifier(n_neighbors = 5)
        model1.fit(X_train, y_train)
        new_predictions = new_model.predict(X_test)
        error_rates.append(np.mean(new_predictions != y_test))

    plt.plot(error_rates);
    
    return error_rates

In [31]:
def feature_selection(X_train, y_train, X_test):
    feature_selection_model = KNeighborsClassifier(n_neighbors = np.argmin(KNN_model(X_train, y_train, X_test, y_test)))
    
    sfs1 = SFS(feature_selection_model1,
          k_features=X_train.shape[1],
          forward=True, 
          floating=False, 
          scoring='accuracy', 
          verbose=1,
          cv=StratifiedKFold(n_splits=5),
          n_jobs=-1
          )

    sfs = sfs.fit(X_train, y_train)
    
    sfs_data = pd.DataFrame.from_dict(sfs.get_metric_dict()).T

    sfs_data['avg_score'] = pd.to_numeric(sfs_data['avg_score'])
    fearues_names = (sfs_data.loc[sfs_data['avg_score'].idxmax(), 'feature_names'])
    
    return fearues_names

In [32]:
def grid_search(X_train, y_train, expected_target):
    batch_size = [16, 32, 64, 128, 256]
    epochs = [50, 75, 125, 150]
    learning_rate = [0.001, 0.0001]
    num_classes = [train_df[expected_target].nunique()]
    input_shape = [X_train.shape[1]]
    param_opt = dict(batch_size=batch_size,
                     epochs=epochs,
                     learning_rate=learning_rate,
                     num_classes=num_classes,
                     input_shape=input_shape)


    model_GridSearch = KerasClassifier(build_fn=create_model, 
                                       verbose=0)
    grid = GridSearchCV(estimator=model_GridSearch, 
                        param_grid=param_opt, 
                        n_jobs=1, 
                        cv=5, 
                        verbose = 0)
    grid_result = grid.fit(X_train, y_train)
    
    print('Best parameters are: ')
    print('batch_size: ' + str(grid_result.best_params_['batch_size']))
    print('epochs: ' + str(grid_result.best_params_['epochs']))
    print('learning_rate: ' + str(grid_result.best_params_['learning_rate']))
    
    return grid_result

# Model for Agent1

## Feature selection 

In [33]:
fearues_names1 = feature_selection(X_train, y_train1, X_test, y_test1)

TypeError: feature_selection() takes 3 positional arguments but 4 were given

## Main model

In [None]:
new_X_train1 = X_train[list(fearues_names1)]
new_X_train1

In [None]:
new_X_test1 = X_test[list(fearues_names1)]
new_X_test1

In [None]:
new_test_df1 = test_df[list(fearues_names1)]
new_test_df1

### Scale data 

In [None]:
new_X_train1, new_X_test1, new_test_df1 = scale_data(new_X_train1, new_X_test1, new_test_df1)

In [None]:
y_train1 = pd.get_dummies(y_train1)
y_test1 = pd.get_dummies(y_test1)
y_train1, y_test1 = make_equal_dummies(y_train1, y_test1)

In [None]:
grid_results1 = grid_search(new_X_train1, y_train1, 'expected_target1')
batch_size1 = grid_results1.best_params_['batch_size']
epochs1 = grid_results1.best_params_['epochs']
learning_rate1 = grid_results1.best_params_['learning_rate']

In [None]:
# batch_size1 = 32
# epochs1 = 100
# learning_rate1 = 0.0001

In [None]:
num_classes = train_df['expected_target1'].nunique()
model1 = create_model(batch_size1, epochs1, learning_rate1, num_classes, new_X_train1.shape[1])
history1 = model1.fit(new_X_train1, y_train1, 
                      batch_size = batch_size1, 
                      epochs = epochs1,
                      validation_data = (new_X_test1, y_test1),
)

In [None]:
make_graphs(history1)

In [None]:
tmp_pred_value1 = calc_train_acc(model1, 'expected_target1', Y_expected1, new_X_test1)

# Model for agent2

## Feature selection 

In [None]:
fearues_names2 = feature_selection(X_train, y_train2, X_test, y_test2)

## Main model

In [None]:
new_X_train2 = X_train[list(fearues_names2)]
new_X_train2

In [None]:
new_X_test2 = X_test[list(fearues_names2)]
new_X_test2

In [None]:
new_test_df2 = test_df[list(fearues_names2)]
new_test_df2

### Scale data 

In [None]:
new_X_train2, new_X_test2, new_test_df2 = scale_data(new_X_train2, new_X_test2, new_test_df2)

In [None]:
y_train2 = pd.get_dummies(y_train2)
y_test2 = pd.get_dummies(y_test2)
y_train2, y_test2 = make_equal_dummies(y_train2, y_test2)

In [None]:
grid_results2 = grid_search(new_X_train2, y_train2, 'expected_target2')
batch_size2 = grid_results2.best_params_['batch_size']
epochs2 = grid_results2.best_params_['epochs']
learning_rate2 = grid_results2.best_params_['learning_rate']

In [None]:
# batch_size2 = 32
# epochs2 = 100
# learning_rate2 = 0.0001

In [None]:
num_classes = train_df['expected_target2'].nunique()
model2 = create_model(batch_size2, epochs2, learning_rate2, num_classes, new_X_train2.shape[1])
history2 = model2.fit(new_X_train2, y_train2, 
                      batch_size = batch_size2, 
                      epochs = epochs2,
                      validation_data = (new_X_test2, y_test2),
)

In [None]:
make_graphs(history2)

In [None]:
tmp_pred_value2 = calc_train_acc(model2, 'expected_target2', Y_expected2, new_X_test2)

# Test acc

In [None]:
test_pred = np.logical_and(tmp_pred_value1, tmp_pred_value2)
test_pred = [0 if x==False else x for x in test_pred]
test_pred = [1 if x==True else x for x in test_pred]

In [None]:
test_real_value = train_df['category'][int(len(Y_expected1)*0.8):len(Y_expected1)]
test_real_value

In [None]:
acc = 0
for elem1, elem2 in zip(test_pred, test_real_value):
    if elem1 == elem2:
        acc += 1
print("ACC: ", np.round((acc / len(test_real_value)), 3))

# Make a submission

In [None]:
fin_pred1 = model1.predict(new_test_df1)
tmp1 = []
for i in fin_pred1:
    if i.argmax() != 0:
        tmp1.append(1)
    else:
        tmp1.append(0)

print(tmp1)

In [None]:
fin_pred2 = model2.predict(new_test_df2)
tmp2 = []
for i in fin_pred2:
    if i.argmax() != 0:
        tmp2.append(1)
    else:
        tmp2.append(0)

print(tmp2)

In [None]:
Answer = np.logical_and(tmp1, tmp2)
Answer = [0 if x==False else x for x in Answer]
Answer = [1 if x==True else x for x in Answer]

In [None]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['tmp'] = Answer
sample_submission.drop(['category'], axis = 1, inplace= True)
sample_submission = sample_submission.rename(columns={"tmp": "category"})
sample_submission.to_csv('Answer.csv', index = False)

In [None]:
sample_submission

# Check 33% acc

In [None]:
answers = pd.read_csv('released_test.csv')

In [None]:
num = 0
for i, j in zip(sample_submission.iloc[0:len(answers)]['category'], answers['category']):
    if i == j:
        num +=1
print(num / len(answers))