# Modeling and Classification

[Theory and Approach followed](#section1)
- [Overfitting and Underfitting](#section2)
- [Model evaluation criteria](#section3)

[First (Baseline) Model](#section4)
- [Import libraries](#section5)
- [Split data into train test](#section6)
- [Useful functions](#section7)
- [Run and evaluate](#section8)

[Dealing with class imbalance](#section9)
- [Over and Under sampling](#section10)
- [One vs Rest method after fixing class imbalance](#section11)
- [Using class weights](#section12)

[Other simple Models](#section13)

[Bagging Models](#section14)

[Boosting Models](#section14)

[Stacking Models](#section14)


<a id='section1'></a>
## Theory and Approach followed
Our goal is to develop a model that describes the relationship between the features in the data and the target class (product category). So far we have got good understanding of the data and we can apply the machine learning algorithms and find out the best model. Technically speaking, this is Supervised Classification ML problem. Here we have the training data with the actual labels. We need to model that and use the model to predict labels for new data.

We will start with a quick and dirty baseline model. And improve upon our model in interative fashion. Following are some ideas to keep in mind.

<a id='section2'></a>
### Overfitting and Underfitting
When the model fits the training data too perfectly but not the test data, it is called overfitting. Such a model will not be useful for predictions. While at the other extreme end our model might fit the training data very poorly. Consequently, it can't fit test data any better and will give inaccurate predictions. Generally its a trade off between overfitting and underfitting. This is also called **Bias-Variance trade-off**. When the model fits too well to the training data (a subset of population) it has low Bias, that is, it mimics the data very well. But it might fail miserably in mimicking the test data (another subset of population). Or it has variance between different subsets of data. As we try to make this variance low, it will increase the Bias.
    
In our case, the number of data points is very large (67 thousand) as compared to the number of features. This will help in avoiding overfitting. We are going to always compare train and test accuracy to see how well the model fits. We should also design train-test split of the data wisely to regulate the bias-variance trade off.
    
<a id='section3'></a>    
### Model Evaluation Criteria
Evaluating our model closely relates with how the model is going to be used by the client. In case of classification problems one type of error (say false negative) might be more costly then another type of error (false positive). In such a case our model should be more stringent for false negatives.
    
In our case we need to classify each product (described by 93 features) into correct class. As such both type of errors have equal weightage. One important fact is that our traning data has class imbalance. There is more data for some classes then others. This is important consideration while chosing the model evaluation criterion. We should try and handle the class imbalance problem and see if improve our models. Criteria that we are going to use.
- **Accuracy**: Although important we can't rely on just that. Because we have class imabalnce, our accuracy might be high but predictions for minority class might be wrong.

- **Confusion Matirx**: This would tell us how many precdictions we got wrong for each class. And how well the minority classes are doing along with majority classes.

- **Classification Matrix**: To measure precision, recall and F1 score (harmonic mean of precision and recall) for each class. It tells us average F1 score for all classes too. This can serve as a single value to compare the models.

- **Log loss**: This is the evaluation criteria used by the client (Otto Group). So this is our main criterion to compare and improve the models. Each product has been labeled with one true category. For each product, we find a set of predicted probabilities (one for every category). The formula is then,
  \begin{equation*}
        logloss=-\dfrac{1}{N} \sum_{i=1}^M \sum_{i=1}^N y_{ij}log(p_{ij}),
  \end{equation*}
    where N is the number of products in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

<a id='section4'></a>
## Baseline Model

<a id='section5'></a>
### Import libraries

In [152]:
# Import all the necessary python libraries

# basic computing and handling tabular data
import numpy as np
import pandas as pd
from collections import Counter
from random import sample

# visualizations
import matplotlib.pyplot as plt
import seaborn as sns

#statistics
from scipy.stats import normaltest,skewtest,kurtosistest, spearmanr

# machine learning 
from sklearn.model_selection import train_test_split,GridSearchCV,StratifiedShuffleSplit,RandomizedSearchCV,cross_validate
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import log_loss,classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline

# ensemble bagging boosting
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

# saving to disk
from sklearn.externals import joblib

# over sampling for imbalanced classes
from imblearn.over_sampling import RandomOverSampler,SMOTE, ADASYN
from imblearn.under_sampling import ClusterCentroids, RandomUnderSampler, NearMiss, EditedNearestNeighbours, RepeatedEditedNearestNeighbours,InstanceHardnessThreshold
from imblearn.combine import SMOTEENN, SMOTETomek

# calculate weights for imbablanced classes
from sklearn.utils import class_weight

In [75]:
# in order to reproduce the results we can use a fixed random state
# set global random seed
random_seed = 42
np.random.seed(random_seed)

In [10]:
# using normalized data for all models
# using pca components instead of normal features is a good idea? I don't think so.
# for logistic regression it is better to use nomralized data

df = pd.read_csv("../data/train_norm.csv")
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,84,85,86,87,88,89,90,91,92,target
0,0.402093,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,-0.293664,-0.291038,-0.243606,...,0.2461,-0.42087,-0.249802,-0.413584,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1
1,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,0.149647,-0.291038,-0.243606,...,-0.280099,-0.42087,-0.249802,-0.413584,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1
2,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,0.149647,-0.291038,-0.243606,...,-0.280099,-0.42087,-0.249802,-0.413584,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1
3,0.402093,-0.210106,-0.307165,0.07924,13.50871,4.524667,4.665884,-0.293664,-0.291038,0.679472,...,-0.280099,-0.047949,1.019683,-0.413584,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1
4,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,-0.293664,-0.291038,-0.243606,...,0.2461,-0.42087,-0.249802,-0.413584,-0.299712,0.040798,-0.129516,-0.386938,-0.104963,1


<a id='section6'></a>
### Split the data into train test

In [144]:
# here the classes are imbalanced so we should use stratified split 
# the folds are made by preserving the percentage of samples for each class
# note that the imbalance will still be their when we train the model using this split

# since we have big amount of data our test set can be just 1% of all data
# this will help in avoiding overfitting
sss = StratifiedShuffleSplit(n_splits=1,test_size=0.1, random_state=42)
print(sss)

StratifiedShuffleSplit(n_splits=1, random_state=42, test_size=0.1,
            train_size=None)


In [145]:
# drop target column for features data X
X = df.drop("target",axis=1)
y = df.target

train_index = []
test_index = []

for tr, tes in sss.split(X,y):
    print("TRAIN:", tr, "TEST:", tes)
    train_index = tr
    test_index = tes

X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]


TRAIN: [57972 30244  9427 ... 60232 28576 27516] TEST: [59081 21681 51999 ...  1777   269 53901]


In [146]:
print("Shapes of data sets")
print("X_train: ", X_train.shape, "y_train: ", y_train.shape)
print("X_test: ", X_test.shape,"y_test: ", y_test.shape)

Shapes of data sets
X_train:  (55690, 93) y_train:  (55690,)
X_test:  (6188, 93) y_test:  (6188,)


<a id='section7'></a>
### Useful functions

In [118]:
def evaluate_model(fitted_model):
    """
    Evaluate the given model.

    Evaluate model using accuracy, confusion matrix, classification report and logloss for given model and test data

    Parameters
    ----------
    fitted_model : dict
        Dictionary with all the info related to the model 
    
    Returns
    -------
    dict
        Returns the fitted_model with added key values

    """
    
    model = fitted_model.get("model")
    train_x = fitted_model.get("train_x")
    train_y = fitted_model.get("train_y")
    
    test_x = fitted_model.get("test_x")
    test_y = fitted_model.get("test_y")
    
    train_score = model.score(train_x,train_y).round(5)
    test_score = model.score(test_x,test_y).round(5)
    log_loss_train = log_loss(train_y, model.predict_proba(train_x)).round(5)
    log_loss_test = log_loss(test_y, model.predict_proba(test_x)).round(5)
    conf_matrix = confusion_matrix(test_y, model.predict(test_x))
    classifi_report = classification_report(test_y, model.predict(test_x))
    
    print("Train score: ",train_score)
    print("Test score: ",test_score)
    print("Log loss train: ",log_loss_train)
    print("Log loss test: ",log_loss_test)
    print("\nConfusion Matrix: \n", conf_matrix)
    print("\nClassification Report: \n", classifi_report)
    
    # update the model with all the evaluations
    fitted_model.update({"train_score":train_score})
    fitted_model.update({"test_score":test_score})
    fitted_model.update({"log_loss_train":log_loss_train})
    fitted_model.update({"log_loss_test":log_loss_test})
    fitted_model.update({"confusion_matrix":conf_matrix})
    fitted_model.update({"classification_report":classifi_report})    
    
    return fitted_model

In [120]:
def fit_evaluate(model,main_params, hyper_params_grid, train_x, train_y, test_x, test_y, scoring,name="dummy"):
    """
    Fit the ML model

    Fit the model using the params and hyper params
    
    Parameters
    ----------
    model : machine learning model
    main_params: main parameters like learning rate etc. of the model
    hyper_params_grid: hyper parameters grid to be used grid search to find params that best fit the model
    scoring: scoring criterion to be used by grid search to find the best hyper params
    name: to save the model to the disk
    
    Returns
    -------
    dict
        Returns the fitted_model with all the information about model, train-test data, best hyper param etc.

    """
    
    # default scoring
    scoring = "neg_log_loss"
    random_seed = 42
    
    # baisc parameters
    model.set_params(**main_params)
    
    # fit hyper params if provided
    if(len(hyper_params_grid) >0):
        
        # grid search to find the best hyper param for given scoring
        gsv = GridSearchCV(estimator=model,cv=3,param_grid=hyper_params_grid,scoring=scoring)
        gsv.fit(train_x,train_y)
                
        best_params = gsv.best_params_
        main_params.update(best_params)        
        print("Best params", best_params)
        
        # choose the best estimator as the model
        model = gsv.best_estimator_
        
    else:
        # fit with fixed hyper params
        model.fit(train_x,train_y)
        
    print("\nAll params: \n", main_params)
    print("\nClassifier model: \n", model)
        
    # dict to save detailed info of the model    
    fitted_model = {}
    fitted_model.update({"model":model})
    fitted_model.update({"train_x":train_x})
    fitted_model.update({"train_y":train_y})
    fitted_model.update({"test_x":test_x})
    fitted_model.update({"test_y":test_y})
    
    fitted_model = evaluate_model(fitted_model)
    
    # save to the disk
    joblib.dump(fitted_model,filename="../models-repo/"+name+".sav")
    
    return fitted_model

<a id='section'></a>
### Run and evaluate

In [122]:
# Base line model - Logistic Regression
# Try out different params and hyper params and save the results

# last two not done yet
main_params_list = {"l1_liblinear": {"cv":3,"random_state":42,"penalty":"l1","solver":"liblinear"},
                    "l1_saga": {"cv":3,"random_state":42,"penalty":"l1","solver":"saga"},
                    "l1_saga_multinomial": {"cv":3,"random_state":42,"penalty":"l1","solver":"saga","multi_class":"multinomial"},
                    "l2_lbfgs": {"cv":3,"random_state":42,"penalty":"l2","solver":"lbfgs"},
                    "l2_lbfgs_multinomial": {"cv":3,"random_state":42,"penalty":"l2","solver":"lbfgs","multi_class":"multinomial"},
                    #"l2_sag": {"cv":3,"random_state":42,"penalty":"l2","solver":"sag"},
                    #"l2_sag_multinomial": {"cv":3,"random_state":42,"penalty":"l2","solver":"sag","multi_class":"multinomial"}                    
                   }

hyper_params_grid = {"Cs":[1,10,100]}

lrcv = LogisticRegressionCV()

# temporay commenting out
# for name, main_params in main_params_list.items():
#     fit_evaluate(lrcv, main_params, hyper_params_grid, \
#                  X_train, y_train,X_test, y_test,"neg_log_loss", name)


In [None]:
# Base line model - Logistic Regression
# Try out different params and hyper params and save the results

# last two not done yet
main_params_list = {"l1_liblinear": {"penalty":"l1","solver":"liblinear"},
                    "l1_saga": {"penalty":"l1","solver":"saga"},
                    "l1_saga_multinomial": {"penalty":"l1","solver":"saga","multi_class":"multinomial"},
                    "l2_lbfgs": {"penalty":"l2","solver":"lbfgs","max_iter":200},
                    #"l2_sag": {"penalty":"l2","solver":"sag","max_iter":1000},
                    "l2_lbfgs_multinomial": {"penalty":"l2","solver":"lbfgs","multi_class":"multinomial","max_iter":1000},
                    #"l2_sag_multinomial": {"penalty":"l2","solver":"sag","multi_class":"multinomial","max_iter":4000}                    
                   }

hyper_params_grid = {"Cs":[10,100]}

lr = LogisticRegression()

# temporay commenting out
# for name, main_params in main_params_list.items():
#     fit_evaluate(lrcv, main_params, hyper_params_grid, \
#                  X_train, y_train,X_test, y_test,"neg_log_loss", name)

In [240]:
# Load the saved models
model_names = ['l1_liblinear','l1_saga','l1_saga_multinomial','l2_lbfgs','l2_lbfgs_multinomial']
models = [joblib.load("../models-repo/" + name + ".sav") for name in model_names]

In [241]:
# check the keys of the model objects
models[0].keys()

dict_keys(['model', 'train_x', 'train_y', 'test_x', 'test_y', 'train_score', 'test_score', 'log_loss_train', 'log_loss_test', 'confusion_matrix', 'classification_report'])

In [239]:
# extract important params from all params
def get_imp_params(params):    
    param_str = "C:"+str(params.get("Cs"))+", " \
            + params.get("solver") + ", " + params.get("penalty") + ", " + params.get("multi_class")
    return param_str

In [249]:
# print details and evaluations of models to compare
def print_models(models):
    for model in models:    
        print("Model: \n",model.get("model"))
        print("\nConfustion Matrix: \n",model.get("confusion_matrix"))
        print("\nClassification Report: \n",model.get("classification_report"))
        model.update({"params":get_imp_params(model.get("model").get_params())})
print_models(models)    

Model: 
 LogisticRegressionCV(Cs=10, class_weight=None, cv=3, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l1', random_state=42,
           refit=True, scoring=None, solver='liblinear', tol=0.0001,
           verbose=0)

Confustion Matrix: 
 [[  56   27    0    0    0   21    2   41   46]
 [   0 1440  138    5    6    6   10    5    2]
 [   0  566  217    2    0    1   10    3    1]
 [   0  169   29   44    4   19    4    0    0]
 [   0   13    2    0  259    0    0    0    0]
 [   3   34    1    3    1 1302   23   28   19]
 [   2   54   20    1    1   30  159   14    3]
 [  11   17    1    0    0   26    9  773    9]
 [   6   25    0    1    0   15    1   19  429]]

Classification Report: 
              precision    recall  f1-score   support

          1       0.72      0.29      0.41       193
          2       0.61      0.89      0.73      1612
          3       0.53      0.27      0.36       800
  

In [254]:
def get_model_evaluations(models):
    
    # compare key metrics like log_loss_test etc.
    models_df = pd.DataFrame.from_dict(models)
    eval_df = pd.DataFrame(columns=["train_score","test_score","log_loss_train",\
                                    "log_loss_test","params(C,solver,penalty,multi_class)"])
    eval_df["log_loss_train"] = models_df.log_loss_train
    eval_df["log_loss_test"] = models_df.log_loss_test
    eval_df["train_score"] = models_df.train_score
    eval_df["test_score"] = models_df.test_score
    eval_df["params(C,solver,penalty,multi_class)"] = models_df.params
    
    return eval_df


print(get_model_evaluations(models))

   train_score  test_score  log_loss_train  log_loss_test  \
0      0.75737     0.75614         0.66104        0.66987   
1      0.75708     0.75549         0.66273        0.67156   
2      0.76719     0.76471         0.62487        0.63743   
3      0.75730     0.75501         0.66162        0.67034   
4      0.76786     0.76568         0.62094        0.63614   

  params(C,solver,penalty,multi_class)  
0             C:10, liblinear, l1, ovr  
1                  C:10, saga, l1, ovr  
2          C:10, saga, l1, multinomial  
3                 C:10, lbfgs, l2, ovr  
4        C:100, lbfgs, l2, multinomial  


<a id='section9'></a>
## Dealing with class imbalance

<a id='section10'></a>
### Over and Under sampling

In [245]:
def balance_classes(sm, train_x, train_y, name="dummy"):
    """
    Balance the classes using the given technique
    
    Parameters
    ----------
    sm : class balancing model
    name: to save the balanced data to the disk
        
    """
    
    print("original data shape(X,y): ", train_x.shape, train_y.shape)
    X_os,y_os = sm.fit_sample(train_x,train_y)
    print("balanced data shape(X,y): ", X_os.shape, y_os.shape)
    
    # uncomment if you want to save the balanced data separately
    # joblib.dump([X_os,y_os],filename="" + name + ".data")
    
    return X_os,y_os


def balance_and_fit_lr(sm,train_x, train_y, sm_name="dummy",name="dummy"):
    """
    Fit the Logistic Regression model after balancing the classes
    
    Parameters
    ----------
    sm : class balancing model
    sm_name: to save the balanced data to the disk
    name: to save the model to the disk
    
    Returns
    -------
    tuple:
        Pandas dataframes for X and y having balanced data

    """
    
    # first time
    #X_os, y_os = balance_classes(sm,X_train,y_train,name=sm_name)

    # all other times
    X_os, y_os = joblib.load("../models-repo/" + sm_name + ".data")
    
    # Fit Logistic model with balanced data
    lr = LogisticRegression()
    fit_evaluate(lr, {"C":10},{},X_os,y_os,X_test,y_test,"neg_log_loss", name)


In [248]:
# Load the saved models
balanced_names = ['lr_ros','lr_adasyn']
balanced_models = [joblib.load("../models-repo/" + name + ".sav") for name in balanced_names]



In [250]:
print_models(balanced_models)

Model: 
 LogisticRegression(C=3, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Confustion Matrix: 
 [[ 139    2    2    2    2    3    7   14   22]
 [  23  912  328  292   14    3   32    3    5]
 [   2  203  373  175    3    1   40    0    3]
 [   0   38   28  181    5    7   10    0    0]
 [   1    3    2    0  268    0    0    0    0]
 [  57    7    4   22    2 1215   46   22   39]
 [  19    7   16   17    1    3  214    4    3]
 [  82    3    4    0    0   12   19  713   13]
 [  78    5    1    5    1   10    3   11  382]]

Classification Report: 
              precision    recall  f1-score   support

          1       0.35      0.72      0.47       193
          2       0.77      0.57      0.65      1612
          3       0.49      0.47      0.48       800
          4       0.26      0.67  

In [258]:
print(get_model_evaluations(balanced_models).drop("params(C,solver,penalty,multi_class)",axis=1))

   train_score  test_score  log_loss_train  log_loss_test
0      0.73603     0.71057         0.74830        0.79605
1      0.75970     0.75663         0.65856        0.67975


<a id='section11'></a>
### One vs Rest method after fixing class imbalance

Steps to follow:
1. Set the targets for binary classification
2. Balance the classes
3. Fit model for each class vs all other classes
3. For new data, predict the class with highest probability given by each classifier

In [259]:
X_train.shape

(55690, 93)

In [266]:
# verify the class imablances
Counter(y_train)

Counter({1: 1736,
         2: 14510,
         3: 7204,
         4: 2422,
         5: 2465,
         6: 12721,
         7: 2555,
         8: 7618,
         9: 4459})

In [269]:
# make labels for one vs rest
def make_ovr_label(y, positive_class):
    """
    Convert multi classes into binary classes
    
    Parameters
    ----------
    y : class distribution in target variable
    positive_class: postive class, rest of the classes will be called as negative class
    
    Returns
    -------
    Series:
        Pandas series for target having binary classes

    """
    print("old y distribution: ", Counter(y))
    y_new = y.copy()
    y_new[y_new!=positive_class] = 0
    y_new[y_new==positive_class] = 1
    print("new y distribution: ", Counter(y_new))
    return y_new

In [270]:
# turn all train and test into binary classes
y_trains = [make_ovr_label(y_train, positive_class) for positive_class in np.arange(1,10)]
y_tests = [make_ovr_label(y_test, positive_class) for positive_class in np.arange(1,10)]

old y distribution:  Counter({2: 14510, 6: 12721, 8: 7618, 3: 7204, 9: 4459, 7: 2555, 5: 2465, 4: 2422, 1: 1736})
new y distribution:  Counter({0: 53954, 1: 1736})
old y distribution:  Counter({2: 14510, 6: 12721, 8: 7618, 3: 7204, 9: 4459, 7: 2555, 5: 2465, 4: 2422, 1: 1736})
new y distribution:  Counter({0: 41180, 1: 14510})
old y distribution:  Counter({2: 14510, 6: 12721, 8: 7618, 3: 7204, 9: 4459, 7: 2555, 5: 2465, 4: 2422, 1: 1736})
new y distribution:  Counter({0: 48486, 1: 7204})
old y distribution:  Counter({2: 14510, 6: 12721, 8: 7618, 3: 7204, 9: 4459, 7: 2555, 5: 2465, 4: 2422, 1: 1736})
new y distribution:  Counter({0: 53268, 1: 2422})
old y distribution:  Counter({2: 14510, 6: 12721, 8: 7618, 3: 7204, 9: 4459, 7: 2555, 5: 2465, 4: 2422, 1: 1736})
new y distribution:  Counter({0: 53225, 1: 2465})
old y distribution:  Counter({2: 14510, 6: 12721, 8: 7618, 3: 7204, 9: 4459, 7: 2555, 5: 2465, 4: 2422, 1: 1736})
new y distribution:  Counter({0: 42969, 1: 12721})
old y distribu

In [21]:
# Balance the binary classes
rus = RandomUnderSampler(random_state=42)
X_oss = []
y_oss =[]
for i in np.arange(0,9):
    X_os, y_os = rus.fit_sample(X_train,y_trains[i])
    X_oss.append(X_os)
    y_oss.append(y_os)

In [272]:
lrs = [LogisticRegressionCV(cv=5) for i in np.arange(1,10)] 

In [23]:
# initialize all the binary classifiers
[lrs[i].fit(X_oss[i],y_oss[i]) for i in np.arange(0,9)]

[LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
            fit_intercept=True, intercept_scaling=1.0, max_iter=100,
            multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
            refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0),
 LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
            fit_intercept=True, intercept_scaling=1.0, max_iter=100,
            multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
            refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0),
 LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
            fit_intercept=True, intercept_scaling=1.0, max_iter=100,
            multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
            refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0),
 LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
            fit_intercept=True, intercept_scaling=1.0, max_iter=100,
    

In [None]:
# calculate predictions and probabilities using all the binary classifiers
y_preds = []
y_pred_probs = []
for row in np.arange(0,X_test.shape[0]):
    predict_probs = [lrs[i].predict_proba(X_test.iloc[row:row+1,:]).flatten() for i in np.arange(0,9)]
    y_preds.append(np.argmax(np.transpose(predict_probs)[1])+1)
    # make the predict probabilities sum as 1
    y_pred_probs.append(np.transpose(predict_probs)[1])

In [25]:
print("Log loss test: ",log_loss(y_test, y_pred_probs))
print("\nConfusion Matrix: \n", confusion_matrix(y_test, y_preds))
print("\nClassification Report: \n", classification_report(y_test, y_preds))

Log loss test:  0.949572140707

Confusion Matrix: 
 [[ 105    4    0    0    1    5   10   27   41]
 [  16 1120  247  171   21    2   24    5    6]
 [   2  330  277  133    4    1   46    3    4]
 [   0   76   32  140    4   10    7    0    0]
 [   1    4    3    0  266    0    0    0    0]
 [  33   12    4   14    1 1259   34   29   28]
 [  21   14   10   14    2   11  201   10    1]
 [  38    4    4    1    0   14   15  753   17]
 [  41    6    0    2    0   12    4   14  417]]

Classification Report: 
              precision    recall  f1-score   support

          1       0.41      0.54      0.47       193
          2       0.71      0.69      0.70      1612
          3       0.48      0.35      0.40       800
          4       0.29      0.52      0.38       269
          5       0.89      0.97      0.93       274
          6       0.96      0.89      0.92      1414
          7       0.59      0.71      0.64       284
          8       0.90      0.89      0.89       846
          9

<a id='section12'></a>
### Balance using class weights

In [273]:
cw = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
cw

cw_pairs = [(i,cw[i-1]) for i in np.arange(1,10)]
cw_dict = dict(cw_pairs)
cw_dict

{1: 3.5643881208397339,
 2: 0.42644919212803428,
 3: 0.85893639336171268,
 4: 2.5548215432608496,
 5: 2.5102546765832772,
 6: 0.48642227637589636,
 7: 2.4218308327897371,
 8: 0.81225751874216034,
 9: 1.3877052652562856}

In [491]:
# weighted Logistic Regression
lr_weighted = LogisticRegression(class_weight=cw_dict)

In [492]:
lr_weighted.fit(X_train,y_train)

LogisticRegression(C=1.0,
          class_weight={1: 3.5643881208397339, 2: 0.42644919212803428, 3: 0.85893639336171268, 4: 2.5548215432608496, 5: 2.5102546765832772, 6: 0.48642227637589636, 7: 2.4218308327897371, 8: 0.81225751874216034, 9: 1.3877052652562856},
          dual=False, fit_intercept=True, intercept_scaling=1,
          max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2',
          random_state=None, solver='liblinear', tol=0.0001, verbose=0,
          warm_start=False)

In [495]:
evaluate_model(lr_weighted)

Train score:  0.75419
Test score:  0.75323
Log loss train:  0.72042
Log loss test:  0.73284

Confusion Matrix: 
 [[ 121    4    0    1    1    5    7   22   32]
 [  19 1214  242   86   11    4   24    6    6]
 [   2  381  311   64    2    1   36    2    1]
 [   0   93   33  120    5    8    9    0    1]
 [   1    4    2    0  267    0    0    0    0]
 [  38   13    3    9    2 1260   34   27   28]
 [  21   15   16   10    1    7  205    7    2]
 [  45    5    4    0    0   15   14  754    9]
 [  49    6    0    2    0   13    4   13  409]]

Classification Report: 
              precision    recall  f1-score   support

          1       0.41      0.63      0.49       193
          2       0.70      0.75      0.73      1612
          3       0.51      0.39      0.44       800
          4       0.41      0.45      0.43       269
          5       0.92      0.97      0.95       274
          6       0.96      0.89      0.92      1414
          7       0.62      0.72      0.66       284
   

<a id='section13'></a>
## Other simple models

## K-NN 

In [76]:
# we should do cross validation twice, once to find the best param and then to evaluate the model
knn = KNeighborsClassifier(weights='distance')
selected= np.random.choice(df.index,1000)
X = df.drop("target",axis=1).iloc[selected]
y = df.target.iloc[selected]

X.shape, y.shape

((1000, 93), (1000,))

In [128]:
params = {"n_neighbors":[25,35,40,45]}
#gcv = RandomizedSearchCV(knn,cv=3,n_jobs=-1,param_distributions=params,scoring='neg_log_loss',verbose=True)
gcv = GridSearchCV(knn,cv=3,n_jobs=-1,param_grid=params,scoring='neg_log_loss',verbose=True)
gcv.fit(X,y)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.3s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='distance'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_neighbors': [25, 35, 40, 45]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_log_loss', verbose=True)

In [142]:
scores = cross_validate(estimator=gcv,X=X,y=y,cv=3,n_jobs=-1,scoring='neg_log_loss',verbose=True)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


  **self._backend_args)


Fitting 3 folds for each of 4 candidates, totalling 12 fits


  **self._backend_args)


Fitting 3 folds for each of 4 candidates, totalling 12 fits


  **self._backend_args)
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.6s finished
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.6s finished
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.6s finished
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.7s finished


In [143]:
scores["test_score"].mean()

-1.3968954746308562

In [137]:
scores2 = cross_validate(estimator=gcv.best_estimator_,X=X,y=y,cv=3,n_jobs=-1,scoring='neg_log_loss',verbose=True)

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.2s finished


In [138]:
scores2["test_score"].mean()

-1.3747380524337995

In [139]:
gcv.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=40, p=2,
           weights='distance')

In [140]:
gcv.cv_results_

{'mean_fit_time': array([0.00733932, 0.00911347, 0.00971063, 0.00501164]),
 'mean_score_time': array([0.0470341 , 0.05398655, 0.06497502, 0.04385034]),
 'mean_test_score': array([-1.89810491, -1.50121155, -1.37520825, -1.37683696]),
 'mean_train_score': array([-8.10462808e-15, -8.10462808e-15, -8.10462808e-15, -8.10462808e-15]),
 'param_n_neighbors': masked_array(data=[25, 35, 40, 45],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 25},
  {'n_neighbors': 35},
  {'n_neighbors': 40},
  {'n_neighbors': 45}],
 'rank_test_score': array([4, 3, 1, 2], dtype=int32),
 'split0_test_score': array([-2.04272732, -1.65457106, -1.49839293, -1.53485505]),
 'split0_train_score': array([-8.10462808e-15, -8.10462808e-15, -8.10462808e-15, -8.10462808e-15]),
 'split1_test_score': array([-1.67279368, -1.46051687, -1.30070121, -1.33071135]),
 'split1_train_score': array([-8.10462808e-15, -8.10462808e-15, -8.10462808e-15, -8.104

In [141]:
gcv.cv_results_["mean_test_score"].mean()

-1.5378404185091514

In [147]:
knn = KNeighborsClassifier(weights='distance',n_neighbors=40,n_jobs=-1)
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=40, p=2,
           weights='distance')

In [148]:
log_loss(y_pred=knn.predict_proba(X_test),y_true=y_test)

0.7910542653441273

In [149]:
scores = cross_validate(estimator=knn,X=df.drop("target",axis=1),y=df.target,cv=3,n_jobs=-1,scoring='neg_log_loss',verbose=True)

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  3.7min finished


In [150]:
scores

{'fit_time': array([3.79166007, 3.17578793, 5.36287308]),
 'score_time': array([70.39256716, 73.21098399, 69.21527386]),
 'test_score': array([-0.87746544, -0.85461572, -0.82274036]),
 'train_score': array([-8.10462808e-15, -8.10462808e-15, -8.10462808e-15])}

In [164]:
target_indices = [sample(list(df[df.target==target].index),1929) for target in np.arange(1,10)]

balanced_df = pd.DataFrame()
# balanced_df = balanced_df.join(df.iloc[target_indices[0]])

for target in np.arange(0,9):
     balanced_df = balanced_df.append(df.iloc[target_indices[target]])

In [165]:
type(balanced_df)

pandas.core.frame.DataFrame

In [169]:
balanced_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,84,85,86,87,88,89,90,91,92,target
1046,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,-0.293664,-0.291038,0.679472,...,0.2461,1.070815,-0.249802,0.531842,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1
1391,0.402093,-0.210106,-0.307165,-0.279443,4.394992,-0.119331,0.782741,1.03627,-0.291038,3.448709,...,-0.280099,1.070815,-0.249802,-0.413584,-0.299712,-0.176699,-0.129516,0.631001,-0.104963,1
556,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,0.782741,0.592959,-0.291038,-0.243606,...,-0.280099,-0.047949,0.38494,-0.413584,-0.299712,0.040798,-0.129516,0.631001,-0.104963,1
1160,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,1.03627,-0.291038,-0.243606,...,-0.280099,-0.42087,-0.249802,0.059129,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1
1748,-0.253508,-0.210106,-0.307165,-0.279443,-0.161867,-0.119331,-0.188045,-0.293664,-0.291038,-0.243606,...,0.772299,-0.42087,-0.249802,-0.413584,-0.299712,-0.176699,-0.129516,-0.386938,-0.104963,1


In [200]:
X_train, X_test, y_train, y_test = train_test_split(balanced_df.drop("target",axis=1),balanced_df.target,test_size=0.3, random_state=42)

In [201]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((12152, 93), (5209, 93), (12152,), (5209,))

In [202]:
knn2 = KNeighborsClassifier(weights='distance',n_neighbors=40,n_jobs=-1)
knn2.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=40, p=2,
           weights='distance')

In [203]:
log_loss(y_pred=knn2.predict_proba(X_test),y_true=y_test)

0.9803243668926108

<a id='section14'></a>
## Please refer other notebooks for Bagging, Boosting and Stacking