# Regression

## Objectives

*   Fit and evaluate a regression model to predict tenure levels for a prospect that will likely churn


## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv
* instructions on which variables to use for data cleaning and feature engineering. They are found on its respectives notebooks.

## Outputs

* Train set (features and target)
* Test set (features and target)
* ML pipeline to predict tenure
* labels map
* Feature Importance Plot

## Additional Comments | Insights | Conclusions


---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
! pip install feature-engine==1.0.2
! pip install scikit-learn==0.24.2
! pip install xgboost==1.2.1

# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

---

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Token: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the session, pushes it to the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

# Load Data for Modelling

In [None]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/TelcoCustomerChurn.csv")
      .query("Churn == 1")  # subset churned customer
      .drop(labels=['customerID','TotalCharges','Churn'],axis=1)  
                    # variables we will not need for this project
                    # we will not need Churn, since it is has only 1
  )

df.info()

# Regressor: Sklearn

* In the ML Pipeline, we are using the raw data, therefore we need to create the pipeline with data cleaninig and feature engineering steps
  * This pipeline will be used in the Train Set, Test Set and Live Data

### ML pipeline for Data Cleaning and Feature Engineering

Load Estimators

In [None]:
from sklearn.pipeline import Pipeline

### Feature Engineering
from feature_engine.encoding import OrdinalEncoder
from feature_engine.selection import SmartCorrelatedSelection


### Feat Scaling
from sklearn.preprocessing import StandardScaler

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### ML algorithms 
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge,RidgeCV
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

 Pipeline for Data Cleaning and Feat Eng

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline(
      [

      ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary', 
                                                  variables = [ 'gender', 'Partner', 'Dependents', 'PhoneService',
                                                               'MultipleLines', 'InternetService', 'OnlineSecurity',
                                                               'OnlineBackup','DeviceProtection', 'TechSupport', 
                                                               'StreamingTV', 'StreamingMovies','Contract', 
                                                               'PaperlessBilling', 'PaymentMethod']
                                                  )
      ),      
       

      ("SmartCorrelatedSelection",SmartCorrelatedSelection(variables=None, method="spearman",
                                                           threshold=0.6,selection_method="variance")
      ),
       
    ]
  )

  return pipeline_base

## ML Pipeline for Modelling and Hyperparameter Optimization

* Pipeline Optmization: Add Feature Scaling, Feature Selection and Model

In [None]:
def PipelineOptimization(model):
  pipe = PipelineDataCleaningAndFeatureEngineering()
  pipe.steps.append(["scaler", StandardScaler()])
  pipe.steps.append(["feat_selection", SelectFromModel(model)])
  pipe.steps.append(["model", model])
  
  return pipe

* Custom Class for hyperparameter Optmization

In [None]:
from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model=  PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches


## Split Train Test Set

Quick recap in our raw dataset

In [None]:
print(df.shape)
df.head(3)

Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(
                                    df.drop(['tenure'],axis=1),
                                    df['tenure'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

## Target Distribution Analysis

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style("whitegrid")

def Target_DistributionAndStats(y_train,y_test):

  figure, ax = plt.subplots(nrows=1, ncols=2,figsize=(12,3))
  sns.histplot(x=y_train, kde=True,ax=ax[0]).set(title='y train')
  sns.histplot(x=y_test, kde=True,ax=ax[1]).set(title='y test')
  plt.show();

  print("\n* Train set - target descriptive stats: \n", y_train.describe().to_frame().round(3))
  print(f"\n* Train set skewness: {y_train.skew().round(3)}, and kurtosis: {y_train.kurt().round(3)} \n\n")

In [None]:
Target_DistributionAndStats(y_train, y_test)

Target Transformation

In [None]:
import sklearn.preprocessing
target_transformer = sklearn.preprocessing.PowerTransformer(method='box-cox',standardize=True)

y_train = target_transformer.fit_transform(y_train.to_frame())
y_train = pd.Series(y_train.reshape(-1), name='tenure')

y_test = target_transformer.transform(y_test.to_frame())
y_test = pd.Series(y_test.reshape(-1), name='tenure')

# target_transformer.inverse_transform(y_test.values.reshape(-1,1)) # test for inverse_transform
Target_DistributionAndStats(y_train,y_test)

## Grid Search CV - Sklearn

### Use standard hyper parameters to find most suitable model

Define models for quick search

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "Ridge": Ridge(random_state=0),
    "RidgeCV": RidgeCV(),
    "BayesianRidge": BayesianRidge(),
    "SGDRegressor": SGDRegressor(random_state=0),
    "ElasticNet": ElasticNet(random_state=0),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    'Ridge': {},
    "RidgeCV": {},
    "BayesianRidge": {},
    "SGDRegressor":{},
    "ElasticNet": {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}


Do a hyperparameter optmization search 

In [None]:
search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

### Do extensive search on most suitable model to find best hyperparameter configuration

Define model and parameters, for Extensive Search

In [None]:
models_search = {
    "GradientBoostingRegressor":GradientBoostingRegressor(random_state=0),
}

# documentation to help on hyperparameter list: 
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

# We will not conduct an extensive search, since the focus
# is on how to combine all knowledge in an applied project.
# In a workplace project, you may spend more time in this step
params_search = {
    "GradientBoostingRegressor":{
        'model__n_estimators': [100,300],
        'model__learning_rate': [1e-1,1e-2,1e-3], 
        'model__max_depth': [3,10],
    }
}

Extensive GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Define the best regressor, based on search

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

## Assess feature importance

In [None]:
# after data cleaning and feat engine, the feature space changes
columns_after_data_cleaning_feat_eng = (PipelineDataCleaningAndFeatureEngineering()
                                        .fit_transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Attribute': columns_after_data_cleaning_feat_eng[best_regressor_pipeline['feat_selection'].get_support()],
          'Importance': best_regressor_pipeline['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Attribute'].to_list()}")

df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

## Evaluate Regressor on Train and Test Sets

Custom Functions

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error 
import numpy as np

def model_score_train_test_set(X_train, y_train, X_test, y_test,pipeline):

	print("Model Evaluation \n")
	print("* Train Set")
	PredictionEvaluation(X_train,y_train,pipeline)

	print("* Test Set")
	PredictionEvaluation(X_test,y_test,pipeline)



def PredictionEvaluation(X,y,pipeline):
  prediction = pipeline.predict(X)

  print('R2 Score:', r2_score(y, prediction).round(3))  
  print('Mean Absolute Error:', mean_absolute_error(y, prediction).round(3))  
  print('Mean Squared Error:', mean_squared_error(y, prediction).round(3))  
  print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y, prediction)).round(3))
  print("\n")

  

import plotly.graph_objects as go
from plotly.subplots import make_subplots

def PredictionVsActual_TrainTestSets(X_train, y_train, X_test, y_test,pipeline):
  pred_train = pipeline.predict(X_train)
  pred_test = pipeline.predict(X_test)
  Plot_Prediction_vs_Actual(y_train,pred_train,y_test, pred_test)



def Plot_Prediction_vs_Actual(TrainActual,TrainPred,TestActual,TestPred):

  fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,6))

  sns.scatterplot(x=TrainActual , y=TrainPred, alpha=0.2, ax=axes[0])
  sns.lineplot(x=TrainActual , y=TrainActual, color='red', ax=axes[0])
  axes[0].set_xlabel("Actual")
  axes[0].set_ylabel("Predictions")
  axes[0].set_title("Train Set")

  sns.scatterplot(x=TestActual , y=TestPred, alpha=0.2, ax=axes[1])
  sns.lineplot(x=TestActual , y=TestActual, color='red', ax=axes[1])
  axes[1].set_xlabel("Actual")
  axes[1].set_ylabel("Predictions")
  axes[1].set_title("Test Set")

  plt.show()

Evaluate Performance

In [None]:
model_score_train_test_set(X_train, y_train, X_test, y_test,best_regressor_pipeline)
PredictionVsActual_TrainTestSets(X_train, y_train, X_test, y_test,best_regressor_pipeline)

# Regressor add PCA() in the pipeline: Sklearn

## Rewrite ML Pipeline for Modelling: add PCA step

* It will change only the PipelineOptimization() function

In [None]:
### PCA
from sklearn.decomposition import PCA

def PipelineOptimization(model):
  pipe = PipelineDataCleaningAndFeatureEngineering()
  pipe.steps.append(["PCA",PCA(n_components=3, random_state=0)])
  pipe.steps.append(["scaler", StandardScaler()])
  pipe.steps.append(["model", model])
  
  return pipe

## Grid Search CV – Sklearn

### Use standard hyper parameters to find most suitable model

In [None]:
models_quick_search = {
    'LinearRegression': LinearRegression(),
    "Ridge": Ridge(random_state=0),
    "RidgeCV": RidgeCV(),
    "BayesianRidge": BayesianRidge(),
    "SGDRegressor": SGDRegressor(random_state=0),
    "ElasticNet": ElasticNet(random_state=0),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_quick_search = {
    'LinearRegression': {},
    'Ridge': {},
    "RidgeCV": {},
    "BayesianRidge": {},
    "SGDRegressor":{},
    "ElasticNet": {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}


Do a quick optmization search 

In [None]:
quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

Check best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

### Do extensive search on most suitable model to find best hyperparameter configuration

Define model and parameters, for Extensive Search

In [None]:
models_search = {
    "GradientBoostingRegressor":GradientBoostingRegressor(random_state=0),
}

# documentation to help on hyperparameter list: 
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

# We will not conduct an extensive search, since the focus
# is on how to combine all knowledge in an applied project.
# In a workplace project, you may spend more time in this step
params_search = {
    "GradientBoostingRegressor":{
        'model__n_estimators': [100,300],
        'model__learning_rate': [1e-1,1e-2,1e-3], 
        'model__max_depth': [3,10],
    }
}

Extensive GridSearch CV

In [None]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train, scoring = 'r2', n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Define the best regressor, based on search

In [None]:
best_regressor_pipeline = grid_search_pipelines[best_model].best_estimator_
best_regressor_pipeline

## Evaluate Regressor on Train and Tests Sets

In [None]:
model_score_train_test_set(X_train, y_train, X_test, y_test,best_regressor_pipeline)
PredictionVsActual_TrainTestSets(X_train, y_train, X_test, y_test,best_regressor_pipeline)

# TensorFlow

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam

In [None]:
def CreateTensorFlowModel():
  model = Sequential()

  model.add(Dense(19,activation='relu'))
  model.add(Dense(19,activation='relu'))
  model.add(Dense(19,activation='relu'))
  model.add(Dense(19,activation='relu'))
  model.add(Dense(1))

  model.compile(optimizer='adam',loss='mse')
  return model

In [None]:
# pipeline_before_model =  PipelineDataCleaningAndFeatureEngineering()
# X_train_tf = pipeline_before_model.fit_transform(X_train)
# X_train_tf = pipeline_before_model.transform(X_test)

In [None]:
model = CreateTensorFlowModel()

In [None]:
model.fit(x=X_train_tf,y=y_train.values,
          validation_data=(X_train_tf,y_test.values),
          epochs=2)

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.plot()

# Convert Regression to Classification: Sklearn

### Convert target to bins and check if it is balanced

Convert numerical continious target to equal frequency bins

In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser
disc = EqualFrequencyDiscretiser(q=3, variables=['tenure'])

df_clf = disc.fit_transform(df)
df_clf['tenure'].value_counts()

Visualize target distribution and range levels

In [None]:
print(f"* The classes represent the following ranges: \n{disc.binner_dict_} \n")
sns.countplot(data=df_clf, x='tenure')
plt.show()

## Rewrite ML Pipeline for Modelling

In [None]:
def PipelineOptimization(model):
  pipe = PipelineDataCleaningAndFeatureEngineering()
  pipe.steps.append(["scaler", StandardScaler()])
  pipe.steps.append(["feat_selection", SelectFromModel(model)])
  pipe.steps.append(["model", model])
  return pipe

## Load estimators for classification

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

## Split Train Test Sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['tenure'],axis=1),
                                    df_clf['tenure'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

Check if target is balanced

In [None]:
y_train.value_counts(normalize=True).to_frame().round(2)

## Grid Seach CV – Sklearn

### Use standard hyper parameters to find most suitable model

In [None]:
models_quick_search = {
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "XGBClassifier":{},
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
}

GridSearch CV

In [None]:
from sklearn.metrics import make_scorer, recall_score
quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train,
                 scoring = make_scorer(recall_score, pos_label=0, average='micro'),
                 n_jobs=-1,cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

Check best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

### Do extensive search on most suitable model to find best hyperparameter configuration

Define models and parameters

In [None]:
models_search = {
    "GradientBoostingClassifier":GradientBoostingClassifier(random_state=0),
}

# documentation to help on hyperparameter list: 
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

# We will not conduct an extensive search, since the focus
# is on how to combine all knowledge in an applied project.
# In a workplace project, you may spend more time in this step
params_search = {
    "GradientBoostingClassifier":{
        'model__n_estimators': [100,500],
        'model__learning_rate': [1e-1,1e-2,1e-3], 
        'model__max_depth': [3,10],
    }
}

Extensive GridSearch CV

In [None]:
from sklearn.metrics import make_scorer, recall_score
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring = make_scorer(recall_score, pos_label=0, average='micro'),
           n_jobs=-1, cv=5)

Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary


Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model
* We are saving this content for later

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

Define the best clf pipeline

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

## Assess feature importance

We can assess feature importane for this model with `.feature_importances_`

In [None]:
# after data cleaning and feat engine, the feature space changes
columns_after_data_cleaning_feat_eng = (PipelineDataCleaningAndFeatureEngineering()
                                        .fit_transform(X_train)
                                        .columns)

# best_features = columns_after_data_cleaning_feat_eng
best_features = columns_after_data_cleaning_feat_eng[pipeline_clf['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Attribute': columns_after_data_cleaning_feat_eng[pipeline_clf['feat_selection'].get_support()],
          'Importance': pipeline_clf['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

best_features = df_feature_importance['Attribute'].to_list() # reassign best features in order

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features}")

df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

In [None]:
best_features_clf = best_features
best_features_clf

## Evaluate Classifier on Train and Test Sets

Custom Function

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

def ClfPredictionEvaluation(X,y,pipeline,LabelsMap):

  prediction = pipeline.predict(X)

  Map = list() 
  for key, value in LabelsMap.items():
    Map.append( str(key) + ": " + value)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(prediction,y),
        columns=[ ["Actual " + sub for sub in Map] ], 
        index = [ ["Prediction " + sub for sub in Map ]]
        # index=['Prediction 0', 'Prediction 1']
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction),"\n")


def ClfPerformanceTrainTestSet(X_train,y_train,X_test,y_test,pipeline,LabelsMap):
  print("#### Train Set #### \n")
  ClfPredictionEvaluation(X_train,y_train,pipeline,LabelsMap)

  print("#### Test Set ####\n")
  ClfPredictionEvaluation(X_test,y_test,pipeline,LabelsMap)

Creates a dictionary that relates the class and numerical interval

In [None]:
n_classes = len(disc.binner_dict_['tenure']) - 1
classes_ranges = disc.binner_dict_['tenure'][1:-1]

labels_map = {}
for n in range(0,n_classes):
  if n == 0:
    labels_map[n] = f"<{classes_ranges[0]}"
  elif n == n_classes-1:
    labels_map[n] = f"+{classes_ranges[-1]}"
  else:
    labels_map[n] = f"{classes_ranges[n-1]} to {classes_ranges[n]}"

labels_map

In [None]:
ClfPerformanceTrainTestSet(X_train, y_train ,X_test, y_test,
                        pipeline_clf,
                        LabelsMap=labels_map )

# Which pipeline to choose?

We fitted multiple pipelines:
* Regression
* Regression with PCA
* Classifier

The regressor pipelines didn't reach the expected performance threshold of 0.7 for R2 score, for train and test set.

The classifier was tuned on recall for class 0, since we are interested to detect prospects that may churn soon. 
* It has reasonable performance for class 0 (<4 months) and can separate the opposite classes: class 0 (<4 months) and class 2 (+20 months).
* Class 1 (4 to 20 months) has weak performance. We will accept this limitation and will handle a prediction of class 1 as if it was a prediction of class 0.

# Refit pipeline with best features

## Split Train Test Set, only with best features

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['tenure'],axis=1),
                                    df_clf['tenure'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

Subset Best Features

In [None]:
X_train = X_train.filter(best_features_clf)

X_test = X_test.filter(best_features_clf)
print("* Train set:", X_train.shape, y_train.shape, "\n* Test set:",  X_test.shape, y_test.shape)

Check how training data looks like now

In [None]:
X_train.head(3)

## Rewrite Data Cleaning and Feature Engineering Pipeline

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline(
      [
      ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary', 
                                                  variables = ['MultipleLines', 'Contract', 
                                                               'PaymentMethod','Partner']
                                                  )
      ),
       
    ]
  )

  return pipeline_base

## Rewrite Modelling Pipeline (removes feat_selection)

In [None]:
def PipelineOptimization(model):
  pipe = PipelineDataCleaningAndFeatureEngineering()
  pipe.steps.append(["scaler", StandardScaler()])
  pipe.steps.append(["model", model])
  return pipe


## Grid Search CV – Sklearn

* using most suitable model from last section and it best hyperparameter configuration

We are using the same model from the last GridCV search

In [None]:
models_search

And the best parameters from the last GridCV search

In [None]:
best_parameters

You will need to type in manually, since the hyperparameter values has to be a list. The previous dictonary is not in this format

In [None]:
params_search = {'GradientBoostingClassifier':  {
    'model__learning_rate': [0.01],   # the value should be in []
    'model__max_depth': [3]}, # the value should be in []
    'model__n_estimators': [500] # the value should be in []
}
params_search

GridSearch CV

In [None]:
from sklearn.metrics import make_scorer, recall_score
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring = make_scorer(recall_score, pos_label=0, average='micro'),
           n_jobs=-1,cv=5)


Check results

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Check the best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

Parameters for best model

In [None]:
grid_search_pipelines[best_model].best_params_

Define the best clf pipeline

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

## Assess feature importance

In [None]:
# after data cleaning and feat engine, the feature space changes
columns_after_data_cleaning_feat_eng = (PipelineDataCleaningAndFeatureEngineering()
                                        .fit_transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Attribute': columns_after_data_cleaning_feat_eng,
          'Importance': pipeline_clf['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Attribute'].to_list()}")

df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

## Evaluate Classifier on Train and Test Sets

In [None]:
ClfPerformanceTrainTestSet(X_train, y_train ,X_test, y_test,
                        pipeline_clf,
                        LabelsMap=labels_map )

# Push files to the repo

We will generate the following files

* Train set
* Test set
* Modeling pipeline
* label map
* features importance plot

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_tenure/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

## Train Set : features and target

In [None]:
X_train.head()

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

## Test Set: features and target

In [None]:
X_test.head()

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## Modelling pipeline (that includes Data Cleaning and Feat Eng pipeline)

ML pipeline for predicting tenure

In [None]:
pipeline_clf

In [None]:
joblib.dump(value=pipeline_clf ,
            filename=f"{file_path}/clf_pipeline.pkl")

## Dictionary mapping target levels to bins

Map for converting numerical variable to categorical variable

In [None]:
labels_map

In [None]:
joblib.dump(value=labels_map ,
            filename=f"{file_path}/labels_map.pkl")

## Feature importance plot

In [None]:
df_feature_importance.plot(kind='bar', x='Attribute', y='Importance')
plt.show()

In [None]:
df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')

## **Push** generated/new files from this Session to GitHub repo

Now you can push the files to the GitHub repo!!!

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "added-files-predict-tenure"
! git add .
! git commit -m {CommitMsg}

* Git Push

In [None]:
! git push origin main

Good job! Save the notebook in your repo.

Then, terminate the session (Runtime - Manage Sessions - Terminate)