# Classification Assesment Technical Report


### Before running code you must :

 * Make sure to have all data being used in the appropriate path(data/..)
 * Using the correct python version (>3) 
 * Make sure to have all the libraries being used installed on your machine.

### Workflow being implemented for the assesment :
        
  1. Import data in the enviroment
  2. Exploration analysis and summary statistics
  3. Fitting 3 models to assess begininng state
  4. Overfit/Underfit trade-off of best perfoming model
  5. Feature importance on the best perfoming model and reduce complexity (overfit)
  6. Tuning hyperparameters of best performing model
  7. Results of the best perfoming model

###### [1] Importing the dataset in python enviroment

We are also importing all the appropriate libraries that will be used in the following process.

In [1]:
# Important libraries for data frame manipulations.
import pandas as pd
import numpy as np
import sys as sys

# For fitting a decision tree.
from sklearn.tree import DecisionTreeRegressor

# For fitting logistic regression.

# Any others needed, feel free to append.


In [2]:
# Reading in the training and testing set already provided.
training_data = pd.read_csv("data/train.csv")
testing_data = pd.read_csv("data/test.csv")

###### [2] Exploration Analysis applied on the data

This step is fundamental for the construction of our candidate classifiers. We have to make sure that our two datasets have the same strucuture (covariates). However this step is undertaken with extreme caution, because the test data should not be viewed.

In [3]:
# Having info analysis on the training dataset.
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 32 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      300000 non-null  int64  
 1   cat0    300000 non-null  object 
 2   cat1    300000 non-null  object 
 3   cat2    300000 non-null  object 
 4   cat3    300000 non-null  object 
 5   cat4    300000 non-null  object 
 6   cat5    300000 non-null  object 
 7   cat6    300000 non-null  object 
 8   cat7    300000 non-null  object 
 9   cat8    300000 non-null  object 
 10  cat9    300000 non-null  object 
 11  cat10   300000 non-null  object 
 12  cat11   300000 non-null  object 
 13  cat12   300000 non-null  object 
 14  cat13   300000 non-null  object 
 15  cat14   300000 non-null  object 
 16  cat15   300000 non-null  object 
 17  cat16   300000 non-null  object 
 18  cat17   300000 non-null  object 
 19  cat18   300000 non-null  object 
 20  cont0   300000 non-null  float64
 21  cont1   30

In [4]:
# Having info analysis on the testing dataset
testing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      200000 non-null  int64  
 1   cat0    200000 non-null  object 
 2   cat1    200000 non-null  object 
 3   cat2    200000 non-null  object 
 4   cat3    200000 non-null  object 
 5   cat4    200000 non-null  object 
 6   cat5    200000 non-null  object 
 7   cat6    200000 non-null  object 
 8   cat7    200000 non-null  object 
 9   cat8    200000 non-null  object 
 10  cat9    200000 non-null  object 
 11  cat10   200000 non-null  object 
 12  cat11   200000 non-null  object 
 13  cat12   200000 non-null  object 
 14  cat13   200000 non-null  object 
 15  cat14   200000 non-null  object 
 16  cat15   200000 non-null  object 
 17  cat16   200000 non-null  object 
 18  cat17   200000 non-null  object 
 19  cat18   200000 non-null  object 
 20  cont0   200000 non-null  float64
 21  cont1   20

From the above results, we confirm that the two datasets do not hold any null values and they are of the same design structure.

 * Both datasets contain 19 factor type covariates
 * Both datasets contain 11 float type covariates
 * The training dataset has 1 extra column of type int which is our response variable of training
 * Training set is constructed by 300K rows
 * Testing set is constructed by 200K rows
 
We now check the different levels of each factor variable. The reason behind this action is because in Python our testing and training set must have the same number of factor levels for the identical covariate. When we apply 1-hot-encoder, each level will act as a unique covariate. With this being said if one factor covariate has different number levels between the two sets, the design structure wont be appropriate.

**1-Hot-Encoder** is a method used in Python to overcome the different levels of the categorical (string) columns of the data. Because Python models only understand numerical values, we need to convert the string data to numerical without implementing a mathematical meaning to them. This technique creates a new column for **each** different level and filles in the cells with **binary values** (1 = true, 0 = false). 

In [5]:
# Using describe function to check the levels of the categories in training dataset.
training_data.describe(include=[object])
  

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,cat11,cat12,cat13,cat14,cat15,cat16,cat17,cat18
count,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000
unique,2,15,19,13,20,84,16,51,61,19,299,2,2,2,2,4,4,4,4
top,A,I,A,A,E,BI,A,AH,BM,A,DJ,A,A,A,A,B,D,D,B
freq,223525,90809,168694,187251,129385,238563,187896,45818,42380,201945,31584,258932,257139,292712,160166,203574,206906,247125,255482


In [6]:
# Using describe function to check the levels of categories in testing dataset.
testing_data.describe(include=[object])

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,cat11,cat12,cat13,cat14,cat15,cat16,cat17,cat18
count,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000
unique,2,15,19,13,20,84,16,51,61,19,295,2,2,2,2,4,4,4,4
top,A,I,A,A,E,BI,A,AH,BM,A,DJ,A,A,A,A,B,D,D,B
freq,149023,60152,112465,124506,86073,158916,125098,30593,28368,134223,21166,172586,171098,195016,106607,135542,137908,165066,170068


From the above results we can view that in the testing data column 'cat10' has 4 extra unique values. For this reason we will need to 1-hot-encode the categorical data for both datasets together and split them again to the beginning states.

Before we merge the two datasets and ecnode them, we need to isolate the response in the training dataset alone. Additionally we need to remove the index column from the data, because it should not be used in the training or predicting.

In [7]:
# Getting a copy of data frames to 1-hot-encode

## TRAINING DATA
train_encoder = training_data.copy()
# Need to drop response varibale from training and hold it seperate.
trainY_response = training_data['target'].copy()
train_encoder.drop('target', axis = 1, inplace = True)
# We also need to remove the index variables because they should not be considered in the model training
train_encoder.drop('id', axis = 1, inplace = True)

## TESTING DATA
test_encoder = testing_data.copy()
test_encoder.drop('id', axis = 1, inplace = True)

## MERGED DATA
# This will concatenate the test data rows below the train data
# The first 300K rows will be the train.
# The last 200K rows will be the testing.
merged_encoder = pd.concat([train_encoder, test_encoder])

In [8]:
# Using get_dummies() which is the same thing as 1-hot-encoder but ignores numerical values.
%time
merged_encoder = pd.get_dummies(merged_encoder)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.96 µs


In [9]:
# Checking if the one hot encoder worked.
merged_encoder.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 199999
Columns: 642 entries, cont0 to cat18_D
dtypes: float64(11), uint8(631)
memory usage: 346.7 MB


From the above result we can understand that our data has become a large sparse matrix of values. This is one of the consequences when using a 1-Hot-Encoder technique. We now need to split the data again back to training and testing.

In [10]:
# First 300K rows is our training set.
trainingX_data = merged_encoder.iloc[:300000,:].copy()
trainingX_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 0 to 299999
Columns: 642 entries, cont0 to cat18_D
dtypes: float64(11), uint8(631)
memory usage: 208.0 MB


In [11]:
# Last 200K rows is our testing set.
testingX_data = merged_encoder.iloc[300000:,:].copy()
testingX_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200000 entries, 0 to 199999
Columns: 642 entries, cont0 to cat18_D
dtypes: float64(11), uint8(631)
memory usage: 138.7 MB


We can now explore correlation relations between the covariates with response. This will help us identify the covariates that mostly influence the response value. The covariates that mostly influence the response values are also most probably the most important covariates to consider in our training.

In [12]:
# Merge the trainingX_data with trainY_data to see the correlations
training_all = pd.concat([trainingX_data.copy(), trainY_response.copy()], axis=1)
%time
# Computing correlation relations. (takes about 5-6 minutes btw. Time below is worng)
training_corr_influnce = training_all.corr()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.81 µs


In [14]:
# Printing the correlation with response in sorted way
training_corr_influnce["target"].sort_values(ascending=False)[:20]

target     1.000000
cat16_B    0.522759
cat15_D    0.467675
cat14_B    0.302301
cat18_D    0.299653
cat11_B    0.285503
cat0_A     0.268109
cat18_C    0.260021
cat17_C    0.237540
cont5      0.215184
cat2_Q     0.213173
cat13_B    0.205714
cat8_K     0.194427
cat1_L     0.190920
cont6      0.189832
cont8      0.183726
cont1      0.164655
cat7_AF    0.160744
cat9_A     0.156035
cat4_H     0.153590
Name: target, dtype: float64

**One-Hot-Encoder** is completed now and **Correlation analysis of response** has also been applied as an indication of what features will be valuable in our training.
A quick recap to our different data.frames we have to now.
   
 1. NON-One-Hot-Encoded-Data
 * training_data = Full training dataset to reference back to it if mistake occurs further on.
 * testing_data = Full testing dataset to reference back to it if mistake occures.
 2. One-Hot-Encoded-Data
 * merged_encoder = Both datasets merged and encoded
 * trainingX_data = Training dataset explanatory variables. (values we will use for training)
 * trainY_response = Training dataset response variable. (value used to training models in construnction)
 * testingX_data = Testing dataser explanatory variables (values we will use for predictions.)
 3. Extra infomration
 * training_corr_influnce = Holds correlation values between variables
 * training_all = Holds the 1-Hot-Encoded explanatory variables with the response, to calculate correlations.
 
More exploration analysis and graph analysis can be found in the external file EDA.ipynp.

######  [3] Fitting 3 models to assess begininng state

###### [4] Overfit/Underfit trade-off of best perfoming mode

###### [5] Feature importance on the best perfoming model and reduce complexity (overfit)

###### [6] Tuning hyperparameters of best performing model

###### [7] Results of the best perfoming model