# Gradient Boosting

In this notebook, we will use a **Gradient Boosting** model to predict customer churn using the already pre-processed `customer_churn_processed.csv` dataset.

We will also evaluate the model for accuracy, precision and recall, and store the results in a file for comparative analysis of results with other models in later stages of this project phase.

In [14]:
# import dependencies
import time
import pandas as pd
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

import matplotlib.pyplot as plt
import numpy as np
# gradient boosting for classification in scikit-learn
from numpy import mean
from numpy import std

from sklearn import datasets, ensemble
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

# input file containing preprocessed data
input_csv = "../../data/customer_churn_processed.csv"
# output file to be saved containing model results
output_csv = "../model_results/[model_name]_results.csv"

## Data

Load and prepare data for training and testing the model.

In [12]:
# Load the data
data = pd.read_csv(input_csv)

# Split the data into X and y
X = data.drop('Churn', axis=1)
y = data['Churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=13)

params = {
    "n_estimators": 500,
    "max_depth": 4,
    "min_samples_split": 5,
    "learning_rate": 0.01,
    "loss": "squared_error",
}

## Fit the Model

Here we will initiate the greadient boosting regressors and fit it with our training data

We will define a **gradient boosting** model with default parameters, however, we will set the parameter `n_jobs=-1` to use all available cores in the machine.

In [15]:
# Initialize the model
model = GradientBoostingClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.998 (0.001)


## Model Training & Prediction

Let's now train the model on the training data and make predictions on the test data.

In [4]:
# Record the start time before training the model
start = time.time()

# Train the model
model.fit(X_train, y_train)

# Record the end time after the model has been trained
end = time.time()

# Record the training time
training_time = end - start

# Make predictions
y_pred = model.predict(X_test)

## Model Evaluation

Finally, we will evaluate the model for accuracy, precision and recall.

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# Create DataFrame to store evaluation results and training time
evaluation_results = pd.DataFrame({
    'Model': ['[model_name]'],
    'Accuracy': [accuracy],
    'Precision': [precision],
    'Recall': [recall],
    'Training Time': [training_time]
})

# Print the evaluation metrics
evaluation_results

As can be observed from above, the **[model_name]** model achieved a very high score for all measured metrics, which is a good indicator that the model is performing well on our dataset.

## Save Results

We will save the results in a file for comparative analysis of results with other models in later stages of this project phase.

In [6]:
# save results to output_csv
evaluation_results.to_csv(output_csv, index=False)