# **IMI BIG DATA & AI CASE COMPETITION**

## *By: Hafsa, Cindy, Tahir & Albert*

Before we start training our models, it is best to understand what metrics we will be using and makes sense for our business use case. One of the most common metrics used in Machine Learning Problems is Accuracy, or the number of times the predicted value was equivelent to the ground truth. <br>

However, from EDA we know that we have a Class Imbalance issue so accuracy is flawed. In this notebook, we look into some other metrics that we can use, and implement our own metric based on domain knowledge. We also develop some baseline models as a bench mark for what's to come.

<br> 


In [1]:
# Import relevent Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import math

# Model Metrics & Data Pre-processing 
from scipy import stats
from sklearn import metrics
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, classification_report, precision_recall_curve
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV, RandomizedSearchCV

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier

#import lightgbm and xgboost 
import lightgbm as lgb 
import xgboost as xgb 

# Imbalance dataset methods
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
from imblearn.combine import SMOTETomek
from imblearn.combine import SMOTEENN

# Miscellaneous
from collections import Counter

# Additional Libraries -- Automatic Explanatory Data Analysis
from pandas_profiling import ProfileReport
from IPython.core.display import display, HTML

# Remove warnings (so it doesn't take up space)
import warnings
warnings.filterwarnings('ignore')

# Set seed for repition 
np.random.seed(2022)

  from pandas import MultiIndex, Int64Index
  from IPython.core.display import display, HTML


 <font size="4"> Recall that we mentioned accuracy is inaccurate in the case of class imbalance problems. </font> 

 <font size="4"> Hence, we will be taking a more holistic approach, and looking at the following evaluation metrics:</font> <br>

- Accuracy
- Confusion Matrix
- Percision (P)
- Recall (R)
- F1 Score (F1)
- Area under the ROC, or simply AUC
- Log loss
- Sensitivity, True Positive Rate (How well the positive class was predicted) 
- Specificity, True Negative Rate (How well the negative class was predicted)
- G-Mean = sqrt(sensitivity * specificity), it combines both sensitivity and specificity
- Custom Loss Function

Most of the metrics mentioned here are built in python already. <br>However, we define the custom loss function below. 

## Metrics - Custom Loss Function 

## Define Base Models

We chose two simple baseline models (one linear, and one non-linear) to help with the next few phases of the pipe line.

The two baseline models are:

- Logistic Regression
- Decision Trees

In [8]:
#Initialize empty lists for results
from sklearn.metrics import log_loss
model_name, train_acc, test_acc, logLoss = [], [], [], []
TN_lst, FN_lst, TP_lst, FP_lst, F1_SCORE, AUC, G_Mean = [], [], [], [], [], [], []

def run_base_models(data, cols_to_drop):
    
    # Set seed for reproducability by setting random_state to 2022. 
    
    # Get rid of the row column, and the Final IG label. 
    all_features = data.drop(axis=1, labels =cols_to_drop)
    all_targets = all_features.pop("B_PLUS_FLAG")
    train_features, test_features, train_targets, test_targets = train_test_split(all_features, all_targets, test_size=0.2, random_state=2022)
    
    # Initialize all models in a list
    models = [LogisticRegression(),
              DecisionTreeClassifier(max_depth=8),
             ]
    # Define all the model names
    model_names = ["Logistic Regression",
                   "Decision Tree",
                  ]
    # Print the data size
    print("Training Data size: {}".format(train_features.shape))
    print("Total Number of class labels in Test Set:\n", test_targets.value_counts())

    # Loop over models instead of having separate cell per model
    for name, model in zip(model_names, models):
        # Training and model accuracy
        model.random_state = 0
        print("Training Model :  {}".format(name))
        model.fit(train_features, train_targets)
        print("Done Training {}".format(name))
        test_score = model.score(test_features, test_targets) * 100
        train_score = model.score(train_features, train_targets) * 100

        # Predict Y values and see the TP, FP, et c(Using confusion matrix)
        y_pred = model.predict(test_features) # removed the predict.probabilities
        precision, recall, thresholds = metrics.precision_recall_curve(test_targets, y_pred)
        pr_auc = metrics.auc(recall, precision)
        f1score = f1_score(test_targets, y_pred)
        CM = confusion_matrix(test_targets, y_pred)
        TN, FN, TP, FP = CM[0][0], CM[1][0],  CM[1][1], CM[0][1]
        Sensitivity, Specifity = (TP / (TP + FP)) , (TN / (FP + TN))
        Gmean = np.sqrt(Sensitivity * Specifity)
        logloss = log_loss(test_targets, y_pred)
        
        # Store results
        model_name.append(name)
        train_acc.append(train_score)
        test_acc.append(test_score)
        TN_lst.append(TN)
        FN_lst.append(FN)
        TP_lst.append(TP)
        FP_lst.append(FP)
        F1_SCORE.append(f1score)
        AUC.append(pr_auc)
        G_Mean.append(Gmean)
        logLoss.append(logloss)    
        
    return None

#drop_cols = ["ROW", "Final_IG", "Date"]
#run_base_models(df1, drop_cols)

results_dict = {"Model Name": model_name, "Train Accuracy": train_acc, "Test Accuracy": test_acc, "TP": TP_lst, "TN": TN_lst, "FN": FN_lst, "FP": FP_lst, "F1-Score": F1_SCORE, "AUC":AUC, "G-Mean": G_Mean, "Log-Loss": logLoss}
results_df = pd.DataFrame.from_dict(results_dict)
results_df

Unnamed: 0,Model Name,Train Accuracy,Test Accuracy,TP,TN,FN,FP,F1-Score,AUC,G-Mean,Log-Loss
