## XGBoost Classification

In this notebook, we will use XGBoost to build a collection of boosted trees, using the [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) data from Kaggle. 

XGBoost creates decision trees based on the values provided, and since it uses sparse matrices, if we decide to replace the missing values with `0`, it will not take up a lot of memory. 

In [1]:
## first the imports
import numpy as np
import pandas as pd
import xgboost as xgb
## the preprocessing related packages
## splitting and for cross-validation
from sklearn.model_selection import train_test_split, GridSearchCV 
## scoring packages
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer 
## confusion matrix packages
## these are deprecated
# from sklearn.metrics import confusion_matrix, plot_confusion_matrix 

In [2]:
## moving on to loading the data
data = pd.read_csv('C://Users//12145//Documents//GitHub//Python//data/Telco-Customer-Churn.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
## checking the types and if there's any missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
## we have many objects in our data
## and they all seem to be categorical
## so we can simply change their types and values
def data_cleaner(df, drop_cols = []):
    ## first dropping the passed columns
    if drop_cols:
        df = df.drop(drop_cols, axis=1)
    ## cleaning the column names
    df.columns = [x.replace(' ', '_').lower() for x in df.columns]
    astype_dict = {}
    for col in df.columns:
        if df[col].dtype == 'object' and len(df[col].value_counts()) < 3:
            df[col] = np.where((df[col].str.lower() == 'y') | (df[col].str.lower() == 'yes'), 1,0)#.astype('uint8')
            astype_dict[col] = 'uint8'
        elif df[col].dtype == 'object' and len(df[col].value_counts()) < 6:
            df[col] = df[col].replace(r'\s+', '_', regex=True).str.lower()
            df = pd.get_dummies(data=df, columns=[col])
        elif df[col].dtype == 'object' and len(df.loc[df[col].str.strip()=='']) > 1:
            df.loc[df[col].str.strip()=='', col] = 0
            try:
                df.astype({col:'float16'})
                astype_dict[col] = 'float16'
            except Exception as e:
                pass
        elif df[col].dtype == 'float64' and df[col].min() == df.astype({col:'float16'})[col].min():
            astype_dict[col] = 'float16'
        elif df[col].dtype == 'float64' and df[col].min() == df.astype({col:'float32'})[col].min():
            astype_dict[col] = 'float32'
        elif df[col].dtype == 'int64' and df[col].min() == df.astype({col:'int8'})[col].min():
            astype_dict[col] = 'int8'
        elif df[col].dtype == 'int64' and df[col].min() == df.astype({col:'int16'})[col].min():
            astype_dict[col] = 'int16'
        elif df[col].dtype == 'int64' and df[col].min() == df.astype({col:'int32'})[col].min():
            astype_dict[col] = 'int32'
    return df.astype(astype_dict)
cleaned_data = data_cleaner(data.copy(), drop_cols= ['customerID'])
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 41 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   gender                                   7043 non-null   uint8  
 1   seniorcitizen                            7043 non-null   int8   
 2   partner                                  7043 non-null   uint8  
 3   dependents                               7043 non-null   uint8  
 4   tenure                                   7043 non-null   int8   
 5   phoneservice                             7043 non-null   uint8  
 6   paperlessbilling                         7043 non-null   uint8  
 7   monthlycharges                           7043 non-null   float16
 8   totalcharges                             7043 non-null   float16
 9   churn                                    7043 non-null   uint8  
 10  multiplelines_no                         7043 no

In [5]:
## one of the factors that can cause problems in our prediction
## is to have an unbalanced ratio for the labels
label_column = 'churn'
round(sum(cleaned_data[label_column])/cleaned_data.shape[0],2)*100

27.0

In [6]:
## the next step is to create our first model
## and we need to create our train and test data
## and since we noticed that we have unbalanced set
## we will have to use stratification for the split step
rseed = 10
X, y = cleaned_data.drop(label_column, axis=1), cleaned_data[label_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rseed, stratify=y)
## the objective is to create a binar/logistic model
## we've taken care of the missing values, so we can set that to None
classifier = xgb.XGBClassifier(objective='binary:logistic',
                               eval_metric='aucpr',
                               early_stopping_rounds=10,
                               missing=None)
classifier.fit(X_train, y_train, 
               verbose=True,
               eval_set=[(X_test, y_test)],
              )

[0]	validation_0-aucpr:0.62322
[1]	validation_0-aucpr:0.64581
[2]	validation_0-aucpr:0.64814
[3]	validation_0-aucpr:0.64660
[4]	validation_0-aucpr:0.64816
[5]	validation_0-aucpr:0.64087
[6]	validation_0-aucpr:0.63667
[7]	validation_0-aucpr:0.64316
[8]	validation_0-aucpr:0.64511
[9]	validation_0-aucpr:0.64295
[10]	validation_0-aucpr:0.64252
[11]	validation_0-aucpr:0.63822
[12]	validation_0-aucpr:0.63313
[13]	validation_0-aucpr:0.63595
[14]	validation_0-aucpr:0.63591


XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=10,
              enable_categorical=False, eval_metric='aucpr', feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=None, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)

In [54]:
## drawing the confusion matrix
plot_confusion_matrix(classifier,X_test, y_test,
                     values_format='d')



XGBoostError: [17:13:06] C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0b3782d1791676daf-1\xgboost\xgboost-ci-windows\include\xgboost/json.h:630: Invalid type for: `missing`, expecting one of the: {``Number`, `Integer`}, got: `Null`

In [None]:
## XGBoost only allows int, float, and boolean
## so we need to take care of the data types
## we will have to use stratification for the split step
## to maintain the same % of the 1/0 labels in both sets

In [None]:
## we will be using the XGB Classifer for our model
## with the objective of binary:logistic
## and set missing to None
## and then for fitting the data
## have early stopping rounds to 10
## eval_metric to aucpr
## and pass our evaluation set at the same step
## one of the ways to imporve the performace
## is to increase the penalty of miss-classifying
## by scale_pos_weight
## which is usefule when the data isn't balanced
## in terms of the labels