<a href="https://colab.research.google.com/github/HarryPotter12/PractiseML/blob/master/AML_Ass2_Q4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### 4. Gradient Boosting: (3 + 5 = 8 marks) 
In this question, we will explore the use of pre-processing methods and Gradient Boosting on the popular Lending Club dataset. \
You are provided with two files: **loan_train.csv** and **loan_test.csv**. The dataset is almost as provided by the the original source, and you may have to make the necessary changes to make it suitable for applying ML algorithms. (If required, you can further divide loan train.csv into a validation set for model selection.) \
Your efforts will be to pre-process the data appropriately, and then apply gradient boosting to classify whether a customer should be given a loan or not. \
* The target attribute is in the column ***loan status***, which has values "Fully
Paid" for which you can assign +1 to, and "Charged off" for which you can assign -1 to.
* The other records with loan status values "Current" (in both train and test) are not relevant to this problem.

Your tasks are to do the following :

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df_train = pd.read_csv("/content/gdrive/My Drive/Datasets/loan_train.csv")
df_test = pd.read_csv("/content/gdrive/My Drive/Datasets/loan_test.csv")

(a) Pre-process the data as needed to apply the classifier to the training data (you are free to use pandas or other relevant libraries). \
Note that test data should not be used for pre-processing in any way, but the same pre-processing steps can be used on test data. \
Some steps to consider:
* Check for missing values, and how you want to handle them (you can delete the records, or replace the missing value with mean/median of the attribute - this is a decision you must make. Please document your decisions/choices in the final submitted report.)
* Check whether you really need all the provided attributes, and choose the necessary attributes. (You can employ feature selection methods, if you are familiar; if not, you can eyeball.)
* Transform categorical data into binary features, and any other relevant columns to suitable datatypes
* Any other steps that help you perform better

In the below step, we
* Change the target attribute(loan_status) as instructed
* Clean and format a few categorical variables
 * term
 * grade
 * home_ownership
 * verification_status
 * emp_length
 * revol_util
 * int_rate
 * sub_grade
 * purpose
 * addr_state
* Drop rows containing missing values for the feature 'revol_util' (it amounts to ~ 1% )
* Fill the missing values of 'emp_length' with 0

In [None]:
def preprocess_loan_part1(df):
  df = df.loc[df['loan_status'].isin(['Fully Paid','Charged Off'])]
  replace_map = {'loan_status': {'Fully Paid': 1, 'Charged Off': -1},
                 'term': {' 36 months': 36, ' 60 months': 60},
                 'grade': {'A': 7, 'B': 6, 'C': 5, 'D': 4, 'E': 3, 'F': 2, 'G': 1},
                 'home_ownership': {'MORTGAGE': 6, 'RENT': 5, 'OWN': 4, 'OTHER': 3, 'NONE': 2, 'ANY': 1},
                 'verification_status': {'Source Verified': 2, 'Verified': 1, 'Not Verified': 0}}
  df.replace(replace_map, inplace=True)
  df['loan_status'] = df['loan_status'].astype('category')
  df['term'] = df['term'].astype('category')
  df['verification_status'] = df['verification_status'].astype('category')
  df['home_ownership'] = df['home_ownership'].astype('category')
  df['emp_length'].fillna(value=0, inplace=True)
  df['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
  df['emp_length'] = df['emp_length'].astype(int)
  df.dropna(subset = ['revol_util'], inplace=True)
  df['revol_util'] = df['revol_util'].str.rstrip('%').astype('float')
  df['int_rate'] = df['int_rate'].str.rstrip('%').astype('float')
  sub_grade = list(df['sub_grade'].unique())
  sub_grade.sort(reverse=True)
  for x,e in enumerate(sub_grade):
    df['sub_grade'].replace(to_replace=e, value=x, inplace=True)
  df['sub_grade'] = df['sub_grade'].astype('category')
  purpose = df['purpose'].unique()
  for x,e in enumerate(purpose):
    df['purpose'].replace(to_replace=e, value=x, inplace=True)
  df['purpose'] = df['purpose'].astype('category')
  addr_state = df['addr_state'].unique()
  for x,e in enumerate(addr_state):
    df['addr_state'].replace(to_replace=e, value=x, inplace=True)
  df['addr_state'] = df['addr_state'].astype('category')

  return df

In [None]:
df_train['loan_status'].value_counts()

Fully Paid     20827
Charged Off     3474
Current          698
Name: loan_status, dtype: int64

In [None]:
df_train = preprocess_loan_part1(df_train)

In [None]:
df_train['loan_status'].value_counts()

1     20810
-1     3462
Name: loan_status, dtype: int64

In [None]:
df_train.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24272 entries, 0 to 24998
Data columns (total 111 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   id                              24272 non-null  int64   
 1   member_id                       24272 non-null  int64   
 2   loan_amnt                       24272 non-null  int64   
 3   funded_amnt                     24272 non-null  int64   
 4   funded_amnt_inv                 24272 non-null  float64 
 5   term                            24272 non-null  category
 6   int_rate                        24272 non-null  float64 
 7   installment                     24272 non-null  float64 
 8   grade                           24272 non-null  int64   
 9   sub_grade                       24272 non-null  category
 10  emp_title                       24266 non-null  object  
 11  emp_length                      24272 non-null  int64   
 12  home_ownership   

In the below steps, we
* Remove the features that were unavailable before lending a loan. The goal of this project is to predict if a loan will be paid off BEFORE making the decision to lend the loan. (From Domain Knowledge)
* Remove features associated with >60% missing values
* Remove features that have a single unique constant value. A feature associated with one unique value does not help the model to generalize well since it’s variance is zero.
* Remove duplicate rows

In [None]:
drop_irrelevant_feature_list = ['acc_now_delinq', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 
                                'collection_recovery_fee', 'collections_12_mths_ex_med', 'delinq_2yrs', 'delinq_amnt', 'desc', 'earliest_cr_line',
                                'emp_title', 'funded_amnt', 'funded_amnt_inv', 'id', 'inq_last_6mths', 'issue_d', 'last_credit_pull_d', 'last_pymnt_amnt', 
                                'last_pymnt_d', 'member_id', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mths_since_recent_bc', 'mths_since_recent_inq', 
                                'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 
                                'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 
                                'num_tl_op_past_12m', 'out_prncp', 'out_prncp_inv', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pymnt_plan', 'recoveries', 
                                'tax_liens', 'title', 'tot_coll_amt', 'tot_cur_bal', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 
                                'total_il_high_credit_limit', 'total_pymnt', 'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee', 'total_rec_prncp', 
                                'total_rev_hi_lim', 'url', 'zip_code']

In [None]:
df_train.drop(labels=drop_irrelevant_feature_list, axis=1, inplace=True)

In [None]:
missing_frac = df_train.isnull().mean()
drop_missing_feature_list = sorted(missing_frac[missing_frac > 0.60].index)

In [None]:
print(drop_missing_feature_list)

['all_util', 'annual_inc_joint', 'dti_joint', 'il_util', 'inq_fi', 'inq_last_12m', 'max_bal_bc', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mort_acc', 'mths_since_last_delinq', 'mths_since_last_major_derog', 'mths_since_last_record', 'mths_since_rcnt_il', 'mths_since_recent_bc_dlq', 'mths_since_recent_revol_delinq', 'next_pymnt_d', 'open_acc_6m', 'open_il_12m', 'open_il_24m', 'open_il_6m', 'open_rv_12m', 'open_rv_24m', 'total_bal_il', 'total_cu_tl', 'verification_status_joint']


In [None]:
df_train.drop(labels=drop_missing_feature_list, axis=1, inplace=True)

In [None]:
df_train.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24272 entries, 0 to 24998
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   loan_amnt             24272 non-null  int64   
 1   term                  24272 non-null  category
 2   int_rate              24272 non-null  float64 
 3   installment           24272 non-null  float64 
 4   grade                 24272 non-null  int64   
 5   sub_grade             24272 non-null  category
 6   emp_length            24272 non-null  int64   
 7   home_ownership        24272 non-null  category
 8   annual_inc            24272 non-null  float64 
 9   verification_status   24272 non-null  category
 10  loan_status           24272 non-null  category
 11  purpose               24272 non-null  category
 12  addr_state            24272 non-null  category
 13  dti                   24272 non-null  float64 
 14  open_acc              24272 non-null  int64   
 15  pu

In [None]:
def find_constant_features(dataFrame):
    const_features = []
    for column in list(dataFrame.columns):
        if dataFrame[column].unique().size < 2:
            const_features.append(column)
    return const_features

In [None]:
drop_constant_feature_list = find_constant_features(df_train)

In [None]:
print(drop_constant_feature_list)

['initial_list_status', 'policy_code', 'application_type']


In [None]:
df_train.drop(labels=drop_constant_feature_list, axis=1, inplace=True)

In [None]:
df_train.drop_duplicates(inplace=True)

In [None]:
df_train.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24272 entries, 0 to 24998
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   loan_amnt             24272 non-null  int64   
 1   term                  24272 non-null  category
 2   int_rate              24272 non-null  float64 
 3   installment           24272 non-null  float64 
 4   grade                 24272 non-null  int64   
 5   sub_grade             24272 non-null  category
 6   emp_length            24272 non-null  int64   
 7   home_ownership        24272 non-null  category
 8   annual_inc            24272 non-null  float64 
 9   verification_status   24272 non-null  category
 10  loan_status           24272 non-null  category
 11  purpose               24272 non-null  category
 12  addr_state            24272 non-null  category
 13  dti                   24272 non-null  float64 
 14  open_acc              24272 non-null  int64   
 15  pu

In [None]:
df_train[['pub_rec','pub_rec_bankruptcies']].corr()

Unnamed: 0,pub_rec,pub_rec_bankruptcies
pub_rec,1.0,0.842039
pub_rec_bankruptcies,0.842039,1.0


We could see above that "pub_rec" and "pub_rec_bankruptcies" are highly correlated. So we can only feed one of them into the modeling. Thus, we are removing "pub_rec_bankruptcies".

In [None]:
drop_correlated_feature_list = ['pub_rec_bankruptcies']

In [None]:
df_train.drop(labels=drop_correlated_feature_list, axis=1, inplace=True)

In [None]:
df_train.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24272 entries, 0 to 24998
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   loan_amnt            24272 non-null  int64   
 1   term                 24272 non-null  category
 2   int_rate             24272 non-null  float64 
 3   installment          24272 non-null  float64 
 4   grade                24272 non-null  int64   
 5   sub_grade            24272 non-null  category
 6   emp_length           24272 non-null  int64   
 7   home_ownership       24272 non-null  category
 8   annual_inc           24272 non-null  float64 
 9   verification_status  24272 non-null  category
 10  loan_status          24272 non-null  category
 11  purpose              24272 non-null  category
 12  addr_state           24272 non-null  category
 13  dti                  24272 non-null  float64 
 14  open_acc             24272 non-null  int64   
 15  pub_rec            

In [None]:
missing_frac2 = df_train.isnull().mean()

In [None]:
print(sorted(missing_frac2[missing_frac2 > 0].index))

[]


Now, we have a clean data with 19 features and no missing values.

(b) Apply gradient boosting using the function *sklearn.ensemble.GradientBoostingClassifier* for training the model. You will need to import *sklearn*, *sklearn.ensemble*, and *numpy*. Your effort will be focused on predicting whether or not a loan is likely to default.
* Get the best test accuracy you can, and show what hyperparameters led to this
accuracy. Report the precision and recall for each of the models that you built.
* In particular, study the effect of increasing the number of trees in the classifier.
* Compare your final best performance (accuracy, precision, recall) against a simple decision tree built using information gain. (You can use sklearn's inbuilt decision tree function for this.)

In [None]:
X_train_full = df_train.loc[:, df_train.columns != 'loan_status'].values
y_train_full = df_train.loc[:, df_train.columns == 'loan_status'].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size = 0.25, random_state = 42, stratify = y_train_full)

In [None]:
df_test = preprocess_loan_part1(df_test)
df_test.drop(labels=drop_irrelevant_feature_list, axis=1, inplace=True)
df_test.drop(labels=drop_missing_feature_list, axis=1, inplace=True)
df_test.drop(labels=drop_constant_feature_list, axis=1, inplace=True)
df_test.drop_duplicates(inplace=True)
df_test.drop(labels=drop_correlated_feature_list, axis=1, inplace=True)

In [None]:
X_test = df_test.loc[:, df_test.columns != 'loan_status'].values
y_test = df_test.loc[:, df_test.columns == 'loan_status'].values

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

Let's create a Baseline Model first. We'll choose the 'learning_rate' from this.

In [None]:
param_test0 = {'learning_rate':[0.15,0.1,0.05,0.01,0.005,0.001], 'n_estimators':range(200,2100,100)}
gsearch0 = GridSearchCV(estimator =GradientBoostingClassifier(max_depth=4, min_samples_split=2, min_samples_leaf=1, subsample=1, max_features='sqrt', random_state=42), 
            param_grid = param_test0, scoring='accuracy',n_jobs=40, cv=5)
gsearch0.fit(X_train,y_train.ravel())
gsearch0.best_params_, gsearch0.best_score_

({'learning_rate': 0.05, 'n_estimators': 300}, 0.8576686153725428)

In [None]:
predictions = gsearch0.predict(X_val)
print("Model 0 : Classification Report of Validation Set")
print(classification_report(y_val, predictions))

Model 0 : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.36      0.02      0.04       866
           1       0.86      0.99      0.92      5202

    accuracy                           0.85      6068
   macro avg       0.61      0.51      0.48      6068
weighted avg       0.79      0.85      0.80      6068



Now let's tune the parameter 'n_estimators' (number of trees) setting the 'learning_rate' = 0.05

In [None]:
param_test1 = {'n_estimators':range(200,2100,100)}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.05, max_depth=4, min_samples_split=2, min_samples_leaf=1, subsample=1, max_features='sqrt', random_state=42), 
                        param_grid = param_test1, scoring='accuracy', n_jobs=40, cv=5)
gsearch1.fit(X_train,y_train.ravel())
gsearch1.best_params_, gsearch1.best_score_

({'n_estimators': 300}, 0.8576686153725428)

In [None]:
predictions = gsearch1.predict(X_val)
print("Model 1 : Classification Report of Validation Set")
print(classification_report(y_val, predictions))

Model 1 : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.36      0.02      0.04       866
           1       0.86      0.99      0.92      5202

    accuracy                           0.85      6068
   macro avg       0.61      0.51      0.48      6068
weighted avg       0.79      0.85      0.80      6068



We found the best 'n_estimators' = 300. \
Now let's start Tuning tree-specific parameters.
1. max_depth and num_samples_split
2. min_samples_leaf
3. max_features

In [None]:
param_test2 = {'max_depth':range(1,8,2), 'min_samples_split':range(200,1001,200)}
gsearch2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.05, n_estimators=300, min_samples_leaf=1, subsample=1, max_features='sqrt', random_state=42), 
                        param_grid = param_test2, scoring='accuracy', n_jobs=40, cv=5)
gsearch2.fit(X_train,y_train.ravel())
gsearch2.best_params_, gsearch2.best_score_

({'max_depth': 3, 'min_samples_split': 400}, 0.8576686455538418)

In [None]:
predictions = gsearch2.predict(X_val)
print("Model 2 : Classification Report of Validation Set")
print(classification_report(y_val, predictions))

Model 2 : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.33      0.01      0.02       866
           1       0.86      1.00      0.92      5202

    accuracy                           0.86      6068
   macro avg       0.60      0.50      0.47      6068
weighted avg       0.78      0.86      0.79      6068



We found 'max_depth' = 3, 'min_samples_split' = 400 as the best. \
Let's tune 'min_samples_leaf'.

In [None]:
param_test3 = {'min_samples_leaf':range(30,71,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.05, n_estimators=300, max_depth=3, min_samples_split=400, subsample=1, max_features='sqrt', random_state=42), 
                        param_grid = param_test3, scoring='accuracy', n_jobs=40, cv=5)
gsearch3.fit(X_train, y_train.ravel())
gsearch3.best_params_, gsearch3.best_score_

({'min_samples_leaf': 70}, 0.8581081456308042)

In [None]:
predictions = gsearch3.predict(X_val)
print("Model 3 : Classification Report of Validation Set")
print(classification_report(y_val, predictions))

Model 3 : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.43      0.02      0.04       866
           1       0.86      1.00      0.92      5202

    accuracy                           0.86      6068
   macro avg       0.65      0.51      0.48      6068
weighted avg       0.80      0.86      0.80      6068



Setting the best 'min_samples_leaf' = 70, let's tune 'max_features'.

In [None]:
param_test4 = {'max_features':range(7,20,2)}
gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.05, n_estimators=300, max_depth=3, min_samples_split=400, min_samples_leaf=70, subsample=1, random_state=42),
                        param_grid = param_test4, scoring='accuracy', n_jobs=5, cv=5)
gsearch4.fit(X_train, y_train.ravel())
gsearch4.best_params_, gsearch4.best_score_

({'max_features': 15}, 0.8584926704715224)

In [None]:
predictions = gsearch4.predict(X_val)
print("Model 4 : Classification Report of Validation Set")
print(classification_report(y_val, predictions))

Model 4 : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.40      0.02      0.04       866
           1       0.86      0.99      0.92      5202

    accuracy                           0.86      6068
   macro avg       0.63      0.51      0.48      6068
weighted avg       0.79      0.86      0.80      6068



Finally we'll tune the parameter 'subsample'.

In [None]:
param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.05, n_estimators=300, max_depth=7, min_samples_split=400, min_samples_leaf=70, max_features=15, random_state=42),
                        param_grid = param_test5, scoring='accuracy', n_jobs=5, cv=5)
gsearch5.fit(X_train, y_train.ravel())
gsearch5.best_params_, gsearch5.best_score_

({'subsample': 0.75}, 0.8559109017870347)

In [None]:
predictions = gsearch5.predict(X_val)
print("Model 5 : Classification Report of Validation Set")
print(classification_report(y_val, predictions))

Model 5 : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.40      0.04      0.07       866
           1       0.86      0.99      0.92      5202

    accuracy                           0.85      6068
   macro avg       0.63      0.51      0.49      6068
weighted avg       0.79      0.85      0.80      6068



Now, as we have our best hyper-parameters, let's build our final model. The hyper-parameters are as follows:
* learning_rate=0.05
* n_estimators=300
* max_depth=3
* min_samples_split=400
* min_samples_leaf=70
* max_features=15
* subsample=0.75

In [None]:
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=300, max_depth=7, min_samples_split=400, min_samples_leaf=70, max_features=15, subsample=0.75, random_state=42)
gb.fit(X_train, y_train.ravel())

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.05, loss='deviance', max_depth=7,
                           max_features=15, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=70, min_samples_split=400,
                           min_weight_fraction_leaf=0.0, n_estimators=300,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=0.75, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [None]:
y_val_pred = gb.predict(X_val)
print("Final Model : Confusion Matrix of Test Set")
print(confusion_matrix(y_val, y_val_pred))
print()
print("Final Model : Classification Report of Validation Set")
print(classification_report(y_val, y_val_pred))

Final Model : Confusion Matrix of Test Set
[[  31  835]
 [  47 5155]]

Final Model : Classification Report of Validation Set
              precision    recall  f1-score   support

          -1       0.40      0.04      0.07       866
           1       0.86      0.99      0.92      5202

    accuracy                           0.85      6068
   macro avg       0.63      0.51      0.49      6068
weighted avg       0.79      0.85      0.80      6068



We'll now re-fit the final model with complete training set (including the validation set) and predict using the test set.

In [None]:
gb = GradientBoostingClassifier(learning_rate=0.05, n_estimators=300, max_depth=7, min_samples_split=400, min_samples_leaf=70, max_features=15, subsample=0.75, random_state=42)
gb.fit(X_train_full, y_train_full.ravel())

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.05, loss='deviance', max_depth=7,
                           max_features=15, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=70, min_samples_split=400,
                           min_weight_fraction_leaf=0.0, n_estimators=300,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=0.75, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [None]:
y_pred = gb.predict(X_test)
print("Final Model : Confusion Matrix of Test Set")
print(confusion_matrix(y_test, y_pred))
print()
print("Final Model : Classification Report of Test Set")
print(classification_report(y_test, y_pred))

Final Model : Confusion Matrix of Test Set
[[   64  2085]
 [   89 12017]]

Final Model : Classification Report of Test Set
              precision    recall  f1-score   support

          -1       0.42      0.03      0.06      2149
           1       0.85      0.99      0.92     12106

    accuracy                           0.85     14255
   macro avg       0.64      0.51      0.49     14255
weighted avg       0.79      0.85      0.79     14255



In [None]:
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
classifier.fit(X_train_full, y_train_full)
y_pred = classifier.predict(X_test)

In [None]:
print("Decision Tree Model : Confusion Matrix of Test Set")
print(confusion_matrix(y_test, y_pred))
print()
print("Decision Tree Model : Classification Report of Test Set")
print(classification_report(y_test, y_pred))

Decision Tree Model : Confusion Matrix of Test Set
[[  417  1732]
 [ 1776 10330]]

Decision Tree Model : Classification Report of Test Set
              precision    recall  f1-score   support

          -1       0.19      0.19      0.19      2149
           1       0.86      0.85      0.85     12106

    accuracy                           0.75     14255
   macro avg       0.52      0.52      0.52     14255
weighted avg       0.76      0.75      0.75     14255

