<p style="text-align:center;">
<img src=https://noodle.digitalfutures.com/studentuploads/Data_Cygnets_logo.png width = 150px, height=150px/
     style="float: center; " />
</p>

# Gradient Boosting 🌲🚀 on Swan Teleco
Trying Gradient boosting algorithms.Results were worse than a logistic regression model.
### by Data Cygnets
🦢 Jamie M

🦢 Muqadas

🦢 Sennan

🦢 Maarja

## 1. Python Essentials 📁

**Note, some packages are outside of the typical python library**

In [None]:
#Basic Python packages
from time import time
import warnings
warnings.filterwarnings('ignore')


#Numerical information
import numpy as np

#General Data use
import pandas as pd
from collections import Counter

#Data Visualisation
import seaborn as sns
import matplotlib.pyplot as plt

##Predictive Modelling##
#Metrics
from sklearn import metrics
from sklearn.metrics import (confusion_matrix, accuracy_score)

#Model
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

#HyperParameters
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split

#Modelling Visuals
from sklearn import tree


#######External packages########
#Smarter Decision Tree Visualisation
#Please run this if the package is not currently installed in the environment:
#!pip install supertree -U
from supertree import SuperTree

#Class balancing with SMOTE(Synthetic Minority Over-sampling Technique-https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
#Please run this if the package is not currently installed in the environment:
#!pip install imbalanced-learn -U
import imblearn
from imblearn.pipeline import Pipeline as imbpipeline
from imblearn.over_sampling import SMOTE
#https://medium.com/data-science/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7#:~:text=We%20first%20split%20the%20data,cross%2Dvalidation%20and%20test%20scores.

In [None]:
df = pd.read_excel('1 - Project Data.xlsx')

## 2. Data Cleaning and Splitting 📊

### Outlier removal

In [None]:
df_tenured = df[df['Tenure Months']!=0]

### Data Splitting

In [None]:
train, test = train_test_split(df_tenured, random_state = 60, stratify = df_tenured['Churn Label'], test_size = 0.2)

## 3. Feature Engineering🛠️

In [None]:
Y_N_cols = ['Dependents', 'Phone Service', 'Multiple Lines', 'Internet Service',
       'Online Security', 'Online Backup', 'Device Protection', 'Tech Support',
       'Streaming TV', 'Streaming Movies', 'Paperless Billing']
categorical_cols = ['City','Contract','Payment Method']
def str_to_int_boolean(x):
    if x=='Yes':
        return 1
    else:
        return 0
def service_count(row):
    return (~row.str.contains('No')).sum()


def feature_eng(df):
    #Removal of Columns
    features = list(df.columns)
    #Dropping unnecessary columns#
    features.remove('CustomerID')
    features.remove('Count')
    features.remove('Country')
    features.remove('State')
    features.remove('Lat Long')
    features.remove('Latitude')
    features.remove('Longitude')
    features.remove('Churn Reason')
    features.remove('Gender')
    features.remove('Partner')
    features.remove('Senior Citizen')
    #Duplication
    features.remove('Churn Value')

    df_copy = df[features].copy()
    df_copy.index = df.CustomerID
    ##Cleaning of Data##
    df_copy.replace('No internet service', 'No', inplace=True)
    df_copy.replace('No phone service', 'No', inplace=True)
    df_copy['Total Charges'] = df_copy['Total Charges'].replace(' ', 0)
    df_copy['Total Charges'] = df_copy['Total Charges'].astype(float)


    cols = ['Phone Service','Multiple Lines', 'Internet Service', 'Online Security',
       'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV',
       'Streaming Movies']

    #Counting the number of services
    df_copy['service_count'] = df_copy[cols].apply(service_count, axis=1)

    ##Changing from str to integer booleans##
    for i in Y_N_cols:
        df_copy[i] = df_copy[i].apply(str_to_int_boolean)

    #df_copy['Gender'] = [1 if val == 'Male' else 0 for val in df_copy['Gender']]

    ##Changing categorical into numerical

    for i in categorical_cols:
        df_copy[i] = pd.factorize(df_copy[i])[0]

    ##Space for further feature Engineering##




    y = df_copy['Churn Label'].copy()
    X = df_copy.drop(columns = ['Churn Label']).copy()


    return X,y

In [None]:
X_train_fe, y_train_fe = feature_eng(train)

In [None]:
X_train_fe.head()

## 4. Model Building and Training🏗️

**SMOTE has improved the recall of the model by ~16%, sacrifising the accuracy by 1-2%**

**Hyperparameter selections**
- Model Criterion: How the model learns
- Model minimum same split: The size of nodes after all splits
- Model minimum leaf size: The size of the leaves (final results)
- Model maximum depth: The number of edges in the trees

In [None]:
pipeline = imbpipeline(steps = [['smote', SMOTE(random_state=60)],
                                        ['classifier', GradientBoostingClassifier(random_state=60,
                                                                          max_depth=4)]])


stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=60)


param_grid = {'classifier__criterion':['squared_error', 'friedman_mse'],
             'classifier__min_samples_split':[20,25,30],
             'classifier__min_samples_leaf':[5,10,20],
             'classifier__max_depth':[2,3,4,5]}
grid_search = GridSearchCV(estimator=pipeline,
                            param_grid=param_grid,
                            scoring='accuracy',
                            cv=stratified_kfold,
                            n_jobs=-1)


grid_search.fit(X_train_fe, y_train_fe)
cv_score = grid_search.best_score_


In [None]:
dt = grid_search.best_estimator_
dt.fit(X_train_fe, y_train_fe)

## 5. Model Prediction using the best hyperparameters 🔮

In [None]:
X_test_fe, y_test_fe = feature_eng(test)

In [None]:
print(f'Score on training set: {dt.score(X_train_fe, y_train_fe)}')
print(f'Score on testing set: {dt.score(X_test_fe, y_test_fe)}')

In [None]:
train_results = X_train_fe.copy()
train_results['y_pred'] = dt.predict(X_train_fe)
train_results['y_real'] = y_train_fe
train_results['y_prob'] = dt.predict_proba(X_train_fe)[:,1]

In [None]:
test_results = X_test_fe.copy()
test_results['y_pred'] = dt.predict(X_test_fe)
test_results['y_real'] = y_test_fe
test_results['y_prob'] = dt.predict_proba(X_test_fe)[:,1]

### Results on the Training Set

- Accuracy-86%
- Recall on desired class (Yes)-0.82

In [None]:
print(metrics.confusion_matrix(train_results['y_real'], train_results['y_pred']))
print(metrics.classification_report(train_results['y_real'], train_results['y_pred']))

### Results on the Testing Set

- Accuracy-79%
- Recall on desired class (Yes)-0.65

With this being low, it is often disregarding the churners. This is the avenue we want to maximise.

In [None]:
print(metrics.confusion_matrix(test_results['y_real'], test_results['y_pred']))
print(metrics.classification_report(test_results['y_real'], test_results['y_pred']))

### Observing feature importance

In [None]:
Feature_imp = pd.DataFrame(
    {'Column': X_train_fe.columns,
     'Importance': dt.steps[1][1].feature_importances_
    }).sort_values('Importance',ascending=False)
Feature_imp.head()

# 6. Insights from Model 🤯

- The gradient boosting model produces a high accuracy and recall score on training.
- The recall of the testing set largely drops in the desired class from training to test (0.82 to 0.65).
- The model is massively overfitting.
- The contract type is by far the most important feature in splitting the churners and remaining customers.

**Areas to explore in the future**

- Observing different selections of features.
- Finding a method to reduce the overfitting of the model, potentially further hyperparameter tuning.