## XGBoost Classification

In this notebook, we will use XGBoost to build a collection of boosted trees, using the [Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) data from Kaggle. 

XGBoost creates decision trees based on the values provided, and since it uses sparse matrices, if we decide to replace the missing values with `0`, it will not take up a lot of memory. 

In [2]:
## first the imports
import numpy as np
import pandas as pd
import xgboost as xgb
## the preprocessing related packages
## splitting and for cross-validation
from sklearn.model_selection import train_test_split, GridSearchCV 
## scoring packages
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer 
## confusion matrix packages
from sklearn.metrics import confusion_matrix, plot_confusion_matrix 

In [3]:
## moving on to loading the data
data = pd.read_csv('C://Users//12145//Documents//GitHub//Python//data/Telco-Customer-Churn.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
## checking the types and if there's any missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [18]:
## we have many objects in our data
## and they all seem to be categorical
## so we can simply change their types and values
def data_cleaner(df):
    ## cleaning the column names
    df.columns = [x.replace(' ', '_').lower() for x in df.columns]
    astype_dict = {}
    for col in df.columns:
        if df[col].dtype == 'object' and len(df[col].value_counts()) < 3:
            df[col] = np.where(((df[col].str.lower == 'y') | (df[col].str.lower == 'yes')), 1,0).astype('uint8')
        elif df[col].dtype == 'object' and len(df[col].value_counts()) < 6:
            df[col] = df[col].replace(r'\s+', '_', regex=True).str.lower()
            df = pd.get_dummies(data=df, columns=[col])
        elif df[col].dtype == 'object' and len(df.loc[df[col].str.strip()=='']) > 1:
            df.loc[df[col].str.strip()=='', col] = 0
            try:
                df.astype({col:'float16'})
                astype_dict[col] = 'float16'
            except Exception as e:
                pass
        elif df[col].dtype == 'float64' and df[col].min() == df.astype({col:'float16'})[col].min():
            astype_dict[col] = 'float16'
        elif df[col].dtype == 'float64' and df[col].min() == df.astype({col:'float32'})[col].min():
            astype_dict[col] = 'float32'
        elif df[col].dtype == 'int64' and df[col].min() == df.astype({col:'int8'})[col].min():
            astype_dict[col] = 'int8'
        elif df[col].dtype == 'int64' and df[col].min() == df.astype({col:'int16'})[col].min():
            astype_dict[col] = 'int16'
        elif df[col].dtype == 'int64' and df[col].min() == df.astype({col:'int32'})[col].min():
            astype_dict[col] = 'int32'
    return df.astype(astype_dict)
cleaned_data = data_cleaner(data)
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 42 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   customerid                               7043 non-null   object 
 1   gender                                   7043 non-null   uint8  
 2   seniorcitizen                            7043 non-null   int8   
 3   partner                                  7043 non-null   uint8  
 4   dependents                               7043 non-null   uint8  
 5   tenure                                   7043 non-null   int8   
 6   phoneservice                             7043 non-null   uint8  
 7   paperlessbilling                         7043 non-null   uint8  
 8   monthlycharges                           7043 non-null   float16
 9   totalcharges                             7043 non-null   float16
 10  churn                                    7043 no

In [None]:
## XGBoost only allows int, float, and boolean
## so we need to take care of the data types
## we will have to use stratification for the split step
## to maintain the same % of the 1/0 labels in both sets

In [None]:
## we will be using the XGB Classifer for our model
## with the objective of binary:logistic
## and set missing to None
## and then for fitting the data
## have early stopping rounds to 10
## eval_metric to aucpr
## and pass our evaluation set at the same step
## one of the ways to imporve the performace
## is to increase the penalty of miss-classifying
## by scale_pos_weight
## which is usefule when the data isn't balanced
## in terms of the labels