# Classification Assesment Technical Report


### Before running code you must :

 * Make sure to have all data being used in the appropriate path(data/..)
 * Using the correct python version (>3) 
 * Make sure to have all the libraries being used installed on your machine.

### Workflow being implemented for the assesment :
        
  1. Import data in the enviroment
  2. Exploration analysis and summary statistics
  3. Feature Importance analysis (using a decision tree)
  4. Fitting multiple models to assess begininng state
  5. Tune models hyperparameters to try investigsate complexity trade-off.
  
  
  
  
  
  
  5. Feature importance on the best perfoming model and reduce complexity (overfit)
  6. Tuning hyperparameters of best performing model
  7. Results of the best perfoming model

###### [1] Importing the dataset in python enviroment

We are also importing all the appropriate libraries that will be used in the following process.

In [117]:
# Important libraries for data frame manipulations.
import pandas as pd
import numpy as np
import sys as sys

# For fitting logistic regression.
from sklearn.linear_model import LogisticRegression

# For fitting a random forest.
from sklearn.ensemble import RandomForestClassifier

# For fitting a decision tree.
from sklearn.tree import DecisionTreeClassifier

# For fitting a bagging classifier.
from sklearn.ensemble import BaggingClassifier

# Importing accuracy_score to auto calculate our classifiers accuracy
from sklearn.metrics import accuracy_score

# Finding static metrics
import statistics

# To apply a gridsearch in hyperparameters for random forest.
from sklearn.model_selection import GridSearchCV


In [3]:
# Reading in the training and testing set already provided.
training_data = pd.read_csv("data/train.csv")
testing_data = pd.read_csv("data/test.csv")

###### [2] Exploration Analysis applied on the data

This step is fundamental for the construction of our candidate classifiers. We have to make sure that our two datasets have the same strucuture (covariates). However this step is undertaken with extreme caution, because the test data should not be viewed.

In [4]:
# Having info analysis on the training dataset.
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 32 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      300000 non-null  int64  
 1   cat0    300000 non-null  object 
 2   cat1    300000 non-null  object 
 3   cat2    300000 non-null  object 
 4   cat3    300000 non-null  object 
 5   cat4    300000 non-null  object 
 6   cat5    300000 non-null  object 
 7   cat6    300000 non-null  object 
 8   cat7    300000 non-null  object 
 9   cat8    300000 non-null  object 
 10  cat9    300000 non-null  object 
 11  cat10   300000 non-null  object 
 12  cat11   300000 non-null  object 
 13  cat12   300000 non-null  object 
 14  cat13   300000 non-null  object 
 15  cat14   300000 non-null  object 
 16  cat15   300000 non-null  object 
 17  cat16   300000 non-null  object 
 18  cat17   300000 non-null  object 
 19  cat18   300000 non-null  object 
 20  cont0   300000 non-null  float64
 21  cont1   30

In [5]:
# Having info analysis on the testing dataset
testing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      200000 non-null  int64  
 1   cat0    200000 non-null  object 
 2   cat1    200000 non-null  object 
 3   cat2    200000 non-null  object 
 4   cat3    200000 non-null  object 
 5   cat4    200000 non-null  object 
 6   cat5    200000 non-null  object 
 7   cat6    200000 non-null  object 
 8   cat7    200000 non-null  object 
 9   cat8    200000 non-null  object 
 10  cat9    200000 non-null  object 
 11  cat10   200000 non-null  object 
 12  cat11   200000 non-null  object 
 13  cat12   200000 non-null  object 
 14  cat13   200000 non-null  object 
 15  cat14   200000 non-null  object 
 16  cat15   200000 non-null  object 
 17  cat16   200000 non-null  object 
 18  cat17   200000 non-null  object 
 19  cat18   200000 non-null  object 
 20  cont0   200000 non-null  float64
 21  cont1   20

From the above results, we confirm that the two datasets do not hold any null values and they are of the same design structure.

 * Both datasets contain 19 factor type covariates
 * Both datasets contain 11 float type covariates
 * The training dataset has 1 extra column of type int which is our response variable of training
 * Training set is constructed by 300K rows
 * Testing set is constructed by 200K rows
 
We now check the different levels of each factor variable. The reason behind this action is because in Python our testing and training set must have the same number of factor levels for the identical covariate. When we apply 1-hot-encoder, each level will act as a unique covariate. With this being said if one factor covariate has different number levels between the two sets, the design structure wont be appropriate.

**1-Hot-Encoder** is a method used in Python to overcome the different levels of the categorical (string) columns of the data. Because Python models only understand numerical values, we need to convert the string data to numerical without implementing a mathematical meaning to them. This technique creates a new column for **each** different level and filles in the cells with **binary values** (1 = true, 0 = false). 

In [6]:
# Using describe function to check the levels of the categories in training dataset.
training_data.describe(include=[object])
  

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,cat11,cat12,cat13,cat14,cat15,cat16,cat17,cat18
count,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000,300000
unique,2,15,19,13,20,84,16,51,61,19,299,2,2,2,2,4,4,4,4
top,A,I,A,A,E,BI,A,AH,BM,A,DJ,A,A,A,A,B,D,D,B
freq,223525,90809,168694,187251,129385,238563,187896,45818,42380,201945,31584,258932,257139,292712,160166,203574,206906,247125,255482


In [7]:
# Using describe function to check the levels of categories in testing dataset.
testing_data.describe(include=[object])

Unnamed: 0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,cat11,cat12,cat13,cat14,cat15,cat16,cat17,cat18
count,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000,200000
unique,2,15,19,13,20,84,16,51,61,19,295,2,2,2,2,4,4,4,4
top,A,I,A,A,E,BI,A,AH,BM,A,DJ,A,A,A,A,B,D,D,B
freq,149023,60152,112465,124506,86073,158916,125098,30593,28368,134223,21166,172586,171098,195016,106607,135542,137908,165066,170068


From the above results we can view that in the testing data column 'cat10' has 4 extra unique values. For this reason we will need to 1-hot-encode the categorical data for both datasets together and split them again to the beginning states.

Before we merge the two datasets and ecnode them, we need to isolate the response in the training dataset alone. Additionally we need to remove the index column from the data, because it should not be used in the training or predicting.

In [8]:
# Getting a copy of data frames to 1-hot-encode

## TRAINING DATA
train_encoder = training_data.copy()
# Need to drop response varibale from training and hold it seperate.
trainY_response = training_data['target'].copy()
train_encoder.drop('target', axis = 1, inplace = True)
# We also need to remove the index variables because they should not be considered in the model training
train_encoder.drop('id', axis = 1, inplace = True)

## TESTING DATA
test_encoder = testing_data.copy()
test_encoder.drop('id', axis = 1, inplace = True)

## MERGED DATA
# This will concatenate the test data rows below the train data
# The first 300K rows will be the train.
# The last 200K rows will be the testing.
merged_encoder = pd.concat([train_encoder, test_encoder])

In [9]:
# Using get_dummies() which is the same thing as 1-hot-encoder but ignores numerical values.
%time
merged_encoder = pd.get_dummies(merged_encoder)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.25 µs


In [10]:
# Checking if the one hot encoder worked.
merged_encoder.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 199999
Columns: 642 entries, cont0 to cat18_D
dtypes: float64(11), uint8(631)
memory usage: 346.7 MB


From the above result we can understand that our data has become a large sparse matrix of values. This is one of the consequences when using a 1-Hot-Encoder technique. We now need to split the data again back to training and testing.

In [11]:
# First 300K rows is our training set.
trainingX_data = merged_encoder.iloc[:300000,:].copy()
trainingX_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 0 to 299999
Columns: 642 entries, cont0 to cat18_D
dtypes: float64(11), uint8(631)
memory usage: 208.0 MB


In [12]:
# Last 200K rows is our testing set.
testingX_data = merged_encoder.iloc[300000:,:].copy()
testingX_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200000 entries, 0 to 199999
Columns: 642 entries, cont0 to cat18_D
dtypes: float64(11), uint8(631)
memory usage: 138.7 MB


We can now explore correlation relations between the covariates with response. This will help us identify the covariates that mostly influence the response value. The covariates that mostly influence the response values are also most probably the most important covariates to consider in our training.

In [13]:
# Merge the trainingX_data with trainY_data to see the correlations
training_all = pd.concat([trainingX_data.copy(), trainY_response.copy()], axis=1)

In [14]:
%%time
# Computing correlation relations. (takes about 5-6 minutes btw. Time below is wrong)
training_corr_influnce = training_all.corr()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 24.1 µs


In [15]:
# Printing the correlation with response in sorted way (20 most positive influential features).
training_corr_influnce["target"].sort_values(ascending=False)[:20]

target     1.000000
cat16_B    0.522759
cat15_D    0.467675
cat14_B    0.302301
cat18_D    0.299653
cat11_B    0.285503
cat0_A     0.268109
cat18_C    0.260021
cat17_C    0.237540
cont5      0.215184
cat2_Q     0.213173
cat13_B    0.205714
cat8_K     0.194427
cat1_L     0.190920
cont6      0.189832
cont8      0.183726
cont1      0.164655
cat7_AF    0.160744
cat9_A     0.156035
cat4_H     0.153590
Name: target, dtype: float64

In [16]:
# Printing the correlation with response in sorted way (20 most negative influential).
training_corr_influnce["target"].sort_values(ascending=True)[:20]

cat16_D    -0.505020
cat15_B    -0.435327
cat18_B    -0.409153
cat14_A    -0.302301
cat11_A    -0.285503
cat0_B     -0.268109
cat17_D    -0.267343
cat13_A    -0.205714
cat1_I     -0.198141
cat4_E     -0.165314
cat6_A     -0.149729
cont3      -0.148316
cat2_A     -0.141533
cat9_E     -0.126067
cat10_CR   -0.100503
cat1_A     -0.095467
cat2_C     -0.093678
cat12_B    -0.083143
cont4      -0.075585
cat7_E     -0.075031
Name: target, dtype: float64

From the above results we can see some values close to 0, which indicates they dont influence the response variable. We will further investigate feature importance analysis.

**One-Hot-Encoder** is completed now and **Correlation analysis of response** has also been applied as an indication of what features will be valuable in our training.
A quick recap to our different data.frames we have to now.
   
 1. NON-One-Hot-Encoded-Data
 * training_data = Full training dataset to reference back to it if mistake occurs further on.
 * testing_data = Full testing dataset to reference back to it if mistake occures.
 2. One-Hot-Encoded-Data
 * merged_encoder = Both datasets merged and encoded
 * trainingX_data = Training dataset explanatory variables. (values we will use for training)
 * trainY_response = Training dataset response variable. (value used to training models in construnction)
 * testingX_data = Testing dataser explanatory variables (values we will use for predictions.)
 3. Extra infomration
 * training_corr_influnce = Holds correlation values between variables
 * training_all = Holds the 1-Hot-Encoded explanatory variables with the response, to calculate correlations.
 
More exploration analysis and graph analysis can be found in the external file EDA.ipynp.

###### [3] Feature Importance analysis (using a decision tree)

From the data wrangling and convertions applied above, we have resulted to very sparsed matrix for the training set that consists of 300K rows and 642 columns. This can cause issues in terms of time/computational complexity and can be an issue in the overfit/underfit trade-off of our analysis. For this reason, we have chosen to apply feature important analysis and keep the analysis process only with features that constribute to our response.

Feature importance code and idea was motivated/inspired by a [youtube](https://www.youtube.com/watch?v=NPdn3YPkg9w) tutorial.

In [30]:
%%time
# Fitting a random forest with only 5 trees to get feature importance.
DT_feature_importance = DecisionTreeClassifier(criterion='entropy', random_state=5059)
DT_feature_importance.fit(trainingX_data, trainY_response)

CPU times: user 33.3 s, sys: 1.58 s, total: 34.9 s
Wall time: 36.1 s


DecisionTreeClassifier(criterion='entropy', random_state=5059)

In [52]:
# Finding the features that are important.
features = []
feature_score = []
# Appenind all scores in arrays to create a data frame.
for i, column in enumerate(trainingX_data):
    features.append(column)
    feature_score.append(DT_feature_importance.feature_importances_[i])
    
# Create a dataframe with these arrays.
feature_score_df = zip(features, feature_score)
feature_score_df = pd.DataFrame(feature_score_df, columns = ['Feature', 'Feature Score'])

# Sort the data frame according to feature score.
feature_score_df = feature_score_df.sort_values('Feature Score', ascending=False).reset_index()
# Removing all features that 0 influence on the response.
feature_keeping = feature_score_df[feature_score_df['Feature Score'] > 0.0]
# Keeping the columns that we constribute to the result.
feature_keeping = feature_keeping['Feature'].copy()
print("Number of features that have an impact on Y variables is {}".format(feature_keeping.shape))

Number of features that have an impact on Y variables is (434,)


From the above computations we can observe that over 200 features did not constribute at all in the decision tree classification in predicting the Y variable. For this reason we will use the above result to reduce the dimensions of the training set that will be used to construct the model.

**NOTE** :
Because we are changing the structure of the training data dimensions, we have to change it for the testing data dimensions as well. Our two datasets should follow the the structure so predictions can be obtained.

In [53]:
# Showing the initial dimension state.
print("The initial structure of the data was {}".format(trainingX_data.shape))
# Reducing the dimensions of the data.
trainingX_data = trainingX_data[feature_keeping]
testingX_data = testingX_data[feature_keeping]

# Showing the new dimension state of our data.
print("The new dimensions of our training data is {}".format(trainingX_data.shape))
print("The new dimensions of our testing data is {}".format(testingX_data.shape))


The initial structure of the data was (300000, 642)
The new dimensions of our training data is (300000, 434)
The new dimensions of our testing data is (200000, 434)


######  [4] Fitting models to assess

We first create a function that enables us to convert hte predicted array to correct format, so we can obtain perfomance metric (Accuracy) results from kaggle.

In [58]:
## Function
# It writes a CSV file in the correct format so we can obtain perfomance metrics from kaggle/
# INPUT ARGUMENT:
# preditions: the array that holds the predictions
# id_row: an array the hold the id index column of the testing data
# model: string with file being saved with.
# OUTPUT:
# no variable output, it writes a CSV on local machine.
def convert_prediction_format(predictions, id_row, model):
    file_submit = pd.concat([id_row, predictions], axis=1)
    file_submit.to_csv(model, index = False, header=['id', 'target'])

*Logistic Regression*

In [54]:
# Constructing a logistic regression.
log_reg = LogisticRegression(solver="saga", max_iter= 1000, random_state=123)

In [56]:
%%time
# Fitting the logistic regression
log_reg.fit(trainingX_data, trainY_response)

CPU times: user 5min 47s, sys: 2.47 s, total: 5min 49s
Wall time: 5min 55s


LogisticRegression(max_iter=1000, random_state=123, solver='saga')

In [69]:
# Predicting values using the test X matrix
y_predicted = pd.DataFrame(log_reg.predict(testingX_data))
# Calling function to write the csv file
convert_prediction_format(y_predicted, testing_data[['id']], 'logistic.csv')

These initial predictions had 76.09% accuracy scores. The low accuracy score could be a phenomeno of overfitting, this will have to be further examined.

In [72]:
# We will see how it performs with training data
train_predicted = log_reg.predict(trainingX_data)
# Calculating the accuracy
print("Accuracy score using the training data for prediction is ", 
      accuracy_score(trainY_response, train_predicted, normalize=True))

Accuracy score using the training data for prediction is  0.8447133333333333


From the above results this could be overfitting since the training data is being predicted with 8% more accuracy

*Random Forest*

In [89]:
# Instantiate model with 200 decision trees (wanted more, but computationally expensive in our case).
# oob_score needs to be true to consider out of bag sampling.
Random_forest_model = RandomForestClassifier(n_estimators = 100, random_state = 42, oob_score=True)

In [90]:
%%time
Random_forest_model.fit(trainingX_data, trainY_response)

CPU times: user 3min 7s, sys: 9.69 s, total: 3min 16s
Wall time: 3min 23s


RandomForestClassifier(oob_score=True, random_state=42)

In [93]:
# Predicting values using the test X matrix
y_predicted = pd.DataFrame(Random_forest_model.predict(testingX_data))
# Calling function to write the csv file
convert_prediction_format(y_predicted, testing_data[['id']], 'random-forest.csv')

These initial predictions had 76.11% accuracy scores. Will try and predict with the training data to see if we have a case of overfit here.

In [92]:
# Calculating the accuracy for training with considering the OOB sample
print("Accuracy score using the training data for prediction is ", 
      Random_forest_model.oob_score_)

Accuracy score using the training data for prediction is  0.8459033333333333


The accuracy score here using the testing data indicate we might be overfitting seeing the training data performs 8% better in terms of accuracy.
We now use a gridsearch or some values for the hyperparameters to try and optimize our random forest. More complex models does not mean better models. For this reason we test different values for the hyperparameters since we might be overfitting.

**NOTE** 
Because of technology being available to us, it is too expensive to apply a full on grid-search, for this reason we apply random values for the hyperparameters to search in hope of improving accuracy perfomance.

In [123]:
# Create example hyperparameters to test.
parameter_examine = {
    'oob_score': [True],
    'max_depth': [5, 10],
    'max_features' : ['auto', 'sqrt'],
    'n_estimators': [100]
}
# Create a based model
base_forest = RandomForestRegressor()

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = base_forest,
                           param_grid = parameter_examine, 
                           cv = 3,
                           n_jobs = -1, verbose = 2)

In [124]:
%%time
# Fit the grid search model to see the best params.
grid_search.fit(trainingX_data, trainY_response)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
CPU times: user 15min, sys: 12.1 s, total: 15min 13s
Wall time: 39min 39s


GridSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=-1,
             param_grid={'max_depth': [5, 10], 'max_features': ['auto', 'sqrt'],
                         'n_estimators': [100], 'oob_score': [True]},
             verbose=2)

In [125]:
# Fit the grid search to the data


In [126]:
grid_search.best_params_

{'max_depth': 10,
 'max_features': 'auto',
 'n_estimators': 100,
 'oob_score': True}

Fitting a random forest with the best parameters from the gridsearch applied.

In [128]:
# Fitting random forest with bet vross-validation parameters
Random_forest_model = RandomForestClassifier(n_estimators = 100,
                                             random_state = 42,
                                             oob_score=True,
                                             max_depth = 10,
                                             max_features = 'auto')

In [129]:
%%time
Random_forest_model.fit(trainingX_data, trainY_response)

CPU times: user 1min 9s, sys: 8.08 s, total: 1min 17s
Wall time: 1min 35s


RandomForestClassifier(max_depth=10, oob_score=True, random_state=42)

In [130]:
# Predicting values using the test X matrix to check if accuracy has been improved.
y_predicted = pd.DataFrame(Random_forest_model.predict(testingX_data))
# Calling function to write the csv file
convert_prediction_format(y_predicted, testing_data[['id']], 'random-forest.csv')

Accuracy with new random forest parameters is: ....

In [131]:
# Calculating the accuracy for training with considering the OOB sample
print("Accuracy score using the training data for prediction is ", 
      Random_forest_model.oob_score_)

Accuracy score using the training data for prediction is  0.8402966666666667


*Bagging Classifier*

In [98]:
# Constructing a bagging classifier with decision tree
Bagging_model = BaggingClassifier(DecisionTreeClassifier(random_state=42), n_estimators=100,
    max_samples=100, bootstrap=True, random_state=42, oob_score=True)

In [100]:
%%time
# Fitting the bagging classifier.
Bagging_model.fit(trainingX_data, trainY_response)

CPU times: user 6min 11s, sys: 2min 17s, total: 8min 29s
Wall time: 8min 48s


BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                  max_samples=100, n_estimators=100, oob_score=True,
                  random_state=42)

In [101]:
# Predicting values using the test X matrix
y_predicted = pd.DataFrame(Bagging_model.predict(testingX_data))
# Calling function to write the csv file
convert_prediction_format(y_predicted, testing_data[['id']], 'bagging.csv')

In [102]:
# Calculating the accuracy for training with considering the OOB sample
print("Accuracy score using the training data for prediction is ", 
      Bagging_model.oob_score_)

Accuracy score using the training data for prediction is  0.8331966666666667


The accuracy score here using the testing data indicate we might be overfitting seeing the training data performs 8% better in terms of accuracy.

######  [3] Fitting 3 models to assess begininng state

*Random Forest*

*Bagging? Ensembe?*

###### [4] Overfit/Underfit trade-off of best perfoming mode

###### [5] Feature importance on the best perfoming model and reduce complexity (overfit)

###### [6] Tuning hyperparameters of best performing model

###### [7] Results of the best perfoming model