# Classification

## Objectives

*   Fit and evaluate a classification model to predict if a prospect will churn or not.


## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv
* instructions on which variables to use for data cleaning and feature engineering. They are found on its respectives notebooks.

## Outputs

* Train set (features and target)
* Test set (features and target)
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* features importance plot

## Additional Comments | Insights | Conclusions


---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
! pip install feature-engine==1.0.2
! pip install scikit-learn==0.24.2
! pip install imbalanced-learn==0.8.0
! pip install xgboost==1.2.1

# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

---

# Setup GPU

* Go to Edit â†’ Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Token: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

# Load Data For Modelling

In [None]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/TelcoCustomerChurn.csv")
      .drop(labels=['tenure','customerID','TotalCharges'],axis=1)  
                    # target variable for regressor, remove from classifier  
                    # drop other variables we will not need for this project
  )

df.info()

We know already in ufront that **Train Set Target (Churn) is imbalanced**
  * We will apply SMOTE technique to handle that. That was covered in Develop & Deploy an AI System - Target Imbalance
  * Therefore, we will produce 2 ML Pipelines:
    * One for Data Cleaning and Feature Engineering
    * Another for Feature Scaling, Feature Selection and Modeling
  * The pipelines will be used to train the pipeline, to test the pipeline and to predict on live data

---

# ML Pipeline with all available data: Sklearn

## ML pipeline for Data Cleaning and Feature Engineering

* Load Estimators for pipelines

In [None]:
from sklearn.pipeline import Pipeline

### Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier


* Data Cleaninig And Feature Engineering

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline(
      [
      ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary', 
                                                  variables = [ 'gender', 'Partner', 'Dependents', 'PhoneService',
                                                               'MultipleLines', 'InternetService', 'OnlineSecurity',
                                                               'OnlineBackup','DeviceProtection', 'TechSupport', 
                                                               'StreamingTV', 'StreamingMovies','Contract', 
                                                               'PaperlessBilling', 'PaymentMethod'])
      ),
       
      ("SmartCorrelatedSelection",SmartCorrelatedSelection(variables=None, method="spearman",
                                                           threshold=0.6, selection_method="variance")
      ),
       
    ]
  )

  return pipeline_base

## ML Pipeline for Modelling and Hyperparameter Optimization

Pipeline Optmization
* Feature Scaling
* Feature Selection
* Model

In [None]:
def PipelineClfSMOTE(model):
  pipeline_base = Pipeline(
      [
       ("scaler",StandardScaler() ),
       ("feat_selection",SelectFromModel(model) ),
       ("model",model ),
    ]
  )

  return pipeline_base

Custom Class for hyperparameter Optimization and search best model

In [None]:
from sklearn.model_selection import GridSearchCV

class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model=  PipelineClfSMOTE(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring, )
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches


## Split Train and Test Set

* Quick recap in our dataset

In [None]:
print(df.shape)
df.head(3)

* Split Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['Churn'],axis=1),
                                    df['Churn'],
                                    test_size = 0.2,
                                    random_state = 0,
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

## SMOTE: deal with Target Imbalance

Fit DataCleaning And FeatureEngineering Pipeline
  * It is used to process train data, so SMOTE can be applied before training the model

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Let's check how it looks like

In [None]:
X_train.head(3)

Check Train Set Target distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
y_train.value_counts().plot(kind='bar',title='Train Set Target Distribution')
plt.show()
print("\n* Class proportion on Train Set\n", y_train.value_counts(normalize=True).to_frame().round(2))
print("\n* Class proportion on Test Set\n", y_test.value_counts(normalize=True).to_frame().round(2))

Use SMOTE to balance Train Set target

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Check Train Set Target distribution after SMOTE

In [None]:
import matplotlib.pyplot as plt
y_train.value_counts().plot(kind='bar',title='Train Set Target Distribution')
plt.show()
print("\n* Class proportion on Train Set\n", y_train.value_counts(normalize=True).to_frame().round(2))
print("\n* Class proportion on Test Set\n",y_test.value_counts(normalize=True).to_frame().round(2))

## Grid Search CV - Sklearn

### Use standard hyper parameters to find most suitable model

Define models and parameters, for Quick Search

In [None]:
models_quick_search = {
    "XGBClassifier":XGBClassifier(random_state=0),
    "DecisionTreeClassifier":DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier":RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier":GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier":ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier":AdaBoostClassifier(random_state=0),
    "XGBClassifier":XGBClassifier(random_state=0),
    "LogisticRegression": LogisticRegression(random_state=0),
}

params_quick_search = {
    "XGBClassifier":{},
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
    "XGBClassifier":{},
    "LogisticRegression":{},
}

Quick GridSearch CV

In [None]:
from sklearn.metrics import f1_score, make_scorer
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
           scoring =  make_scorer(f1_score, pos_label=1),
           n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary 

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

### Do extensive search on most suitable model to find best hyperparameter configuration

Define model and parameters, for Extensive Search

In [None]:
models_search = {
    "XGBClassifier":XGBClassifier(random_state=0),
}

# documentation to help on hyperparameter list: 
# https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

# We will not conduct an extensive search, since the focus
# is on how to combine all knowledge in an applied project.
# In a workplace project, you may spend more time in this step
params_search = {
    "XGBClassifier":{
        'model__learning_rate': [1e-1,1e-2,1e-3], 
        'model__max_depth': [3,10],
    }
}

Extensive GridSearch CV

In [None]:
from sklearn.metrics import f1_score, make_scorer
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring =  make_scorer(f1_score, pos_label=1),
           n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary 

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model
* We are saving this content for later

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

Define the best clf pipeline

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

## Assess feature importance

* With the current model, we can assess with `.features_importances_`

In [None]:
best_features = X_train.columns[pipeline_clf['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Attribute': X_train.columns[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
.sort_values(by='Importance', ascending=False)
)

best_features = df_feature_importance['Attribute'].to_list() # re-assign best_features order

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Attribute'].to_list()}")

df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

We will save the most important features to fit a new pipeline

In [None]:
best_features_with_all_variables = best_features
best_features_with_all_variables

## Evaluate Classifier on Train and Test Sets

 Custom Function

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

def PredictionEvaluation(X,y,pipeline,LabelsMap):

  prediction = pipeline.predict(X)
  Map = list() 
  for key, value in LabelsMap.items():
    Map.append( str(key) + ": " + value)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in Map] ], 
        index= [ ["Prediction " + sub for sub in Map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction),"\n")



def PerformanceTrainTestSet(X_train,y_train,X_test,y_test,pipeline,LabelsMap):
  print("#### Train Set #### \n")
  PredictionEvaluation(X_train,y_train,pipeline,LabelsMap)

  print("#### Test Set ####\n")
  PredictionEvaluation(X_test,y_test,pipeline,LabelsMap)

 Evaluation

In [None]:
PerformanceTrainTestSet(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline_clf,
                        LabelsMap= {0:"No Churn", 1:"Yes Churn"})

# Refit pipeline with best features

## New ML Pipeline

In theory, a pipeline fitted **using only the most important features** has to give the same result as the one fitted with **all variables and feature selection**

* However in this project we have a step for feature augmentation, which is to balance the target Train Set using SMOTE()
* We should remember that the Train Set with all features is different from the Train Set with the best features we found (since it has less variables)
* Therefore the Train Set after applying the SMOE() will be slightly different, which means the performance will be slightly different. We should expect that. What we can't expect is to have a big difference 

This new pipeline should consider only the set of most important features

In [None]:
best_features_with_all_variables

## Rewrite ML pipeline for Data Cleaning and Feature Engineering

New Pipeline for DataCleaning And FeatureEngineering

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline(
      [

      ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary',
                                                  variables = [ 'InternetService', 'Contract']
                                                  )
      ),
       
    ]
  )

  return pipeline_base

## Rewrite ML Pipeline for Modelling

Function for Pipeline optmization

In [None]:
# Pipeline Optmization: Feature Scaling, and Model
# there is no feature selection
def PipelineClfSMOTE(model):
  pipeline_base = Pipeline(
      [
       ("scaler",StandardScaler()),
       # no feature selection here!!!
       ("model",model ),
    ]
  )

  return pipeline_base


## Split Train Test Set, considering only with best features

* Split Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['Churn'],axis=1),
                                    df['Churn'],
                                    test_size = 0.2,
                                    random_state = 0,
                                    )

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

We filter only the most important variables

In [None]:
X_train = X_train.filter(best_features_with_all_variables)
X_test = X_test.filter(best_features_with_all_variables)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
X_train.head(3)

## SMOTE: deal with Target Imbalance

* Fit DataCleaning And FeatureEngineering Pipeline
  * It is used to process train data, so SMOTE can be applied before training the model

In [None]:
pipeline_data_cleaning_feat_eng = PipelineDataCleaningAndFeatureEngineering()
X_train = pipeline_data_cleaning_feat_eng.fit_transform(X_train)
X_test = pipeline_data_cleaning_feat_eng.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

* Check Train Set Target distribution

In [None]:
import matplotlib.pyplot as plt
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()
print("\n* Class proportion on Train Set\n", y_train.value_counts(normalize=True).to_frame().round(2))
print("\n* Class proportion on Test Set\n",y_test.value_counts(normalize=True).to_frame().round(2))

* Use SMOTE to balance Train Set target

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

* Check Train Set Target distribution after SMOTE

In [None]:
import matplotlib.pyplot as plt
y_train.value_counts().plot(kind='bar',title='Train Set Target Distribution')
plt.show()
print("\n* Class proportion on Train Set\n", y_train.value_counts(normalize=True).to_frame().round(2))
print("\n* Class proportion on Test Set\n",y_test.value_counts(normalize=True).to_frame().round(2))

## Grid Search CV: Sklearn

* Using most suitable model from last section and it best hyper parameter configuration

We are using the same model fomr the last GridCV search

In [None]:
models_search

And the best parameters from the last GridCV search 

In [None]:
best_parameters

You will need to type in manually, since the hyperparameter values has to be a list. The previous dictonary is not in this format

In [None]:
params_search = {'XGBClassifier':  {
    'model__learning_rate': [0.01],   # the value should be in []
    'model__max_depth': [3]}, # the value should be in []
}
params_search

GridSearch CV

In [None]:
from sklearn.metrics import f1_score, make_scorer
quick_search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
quick_search.fit(X_train, y_train,
                 scoring =  make_scorer(f1_score, pos_label=1),
                 n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary 

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Define the best clf pipeline

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

## Assess feature importance

In [None]:
best_features = X_train.columns

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Attribute': X_train.columns,
    'Importance': pipeline_clf['model'].feature_importances_})
.sort_values(by='Importance', ascending=False)
)


# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Attribute'].to_list()}")

df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

## Evaluate Classifier on Train and Test Sets

In [None]:
PerformanceTrainTestSet(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline_clf,
                        LabelsMap= {0:"No Churn", 1:"Yes Churn"})

# Push files to Repo

We will generate the following files
* Train set
* Test set
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline
* features importance plot

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_churn/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

* Pay attention that the Train set and Test set are not in the same format as when it was splitted
* That is due to the fact we needed to apply SMOTE to the Train Set

## Train Set

* note that the variables **are transformed already** in X_train and the shape is 8266 - after SMOTE was appllied

In [None]:
print(X_train.shape)
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

## Test Set

* note that the variables are transformed already in X_test

In [None]:
print(X_test.shape)
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## ML Pipelines: Data Cleaning and Feat Eng pipeline and Modelling Pipeline

We will save 2 pipelines: 
* Both should be used in conjuntion to predict Live Data
* To predict on Train Set, Test Set we use only pipeline_clf, since the data is already processed



Pipeline responsible for Data Cleaning and Feature Engineering


In [None]:
pipeline_data_cleaning_feat_eng

In [None]:
joblib.dump(value=pipeline_data_cleaning_feat_eng ,
            filename=f"{file_path}/clf_pipeline_data_cleaning_feat_eng.pkl")

* Pipeline responsible for Feature Scaling, and Model

In [None]:
pipeline_clf

In [None]:
joblib.dump(value=pipeline_clf ,
            filename=f"{file_path}/clf_pipeline_model.pkl")

## Feature Importance plot

In [None]:
df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

In [None]:
df_feature_importance.plot(kind='bar', x='Attribute', y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

---

## **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "added-files-predict-churn"
! git add .
! git commit -m {CommitMsg}

* Git Push

In [None]:
! git push origin main