# Lab 7: Lead generation and churn

# Lab goals:
In this project, we will develop and apply different methods for lead generation and churn prediction. These methodologies are broadly used in business to multiple use cases as:
- Identify **new customers** in the market
- Identify customers in our internal Data Warehouse with **more likely** to buy a new product
- Identify unsatisfied customers and thus, likely to be **churners**

During this project we will follow the end-to-end **Machine Learning process** to identify new customers to capture and improve sales in a **marketing use case**:
1. Data understanding and preparation: exploration of the dataset and feature engineering (missing values, outlier identification, categorical variables management)
2. Model Training: training the Logistic Regression Model. Analysis of metrics (recall, precision, confusion metrics)
3. Creating a Business opportunity with Machine Learning: selection of the best model and identification of the most important features

# Practice Information:
**Due date:** during the lab session (6.30 -9pm)

**Submission procedure:** via Moodle.

**Name:** Luca Franceschi

**NIA:** 253885


# 0. Context:  

Let's assume that we are a Data Scientist team of a bank, and we want to predict which customers are going to leave us (**churn**) in order to be able to anticipate and take commercial action on them. To do this, the bank has been collecting tabular data on its customers including which customers have left us in the past, so we will use the power of **machine learning** to build a predictive model.

### Dataset:

Taking a closer look, we see that the dataset contains 14 columns (also known as features or variables). The first 13 columns are the independent variable, while the last column is the dependent variable that contains a binary value of 1 or 0. Here, 1 refers to the case where the customer left the bank after 6 months, and 0 is the case where the customer didn't leave the bank after 6 months.

It's important to mention that the data for the independent variables was collected 6 months before the data for the dependent variable, since the task is to develop a machine learning model that can predict whether a customer will leave the bank after 6 months, depending on the current feature values.

Let's discuss each column one by one:

1. **RowNumber** corresponds to the record (row) number and has no effect on the output. This column will be removed.
2. **CustomerId** contains random values and has no effect on customer leaving the bank. This column will be removed.
3. **Surname** the surname of a customer has no impact on their decision to leave the bank. This column will be removed.
4. **CreditScore** can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.
5. **Geography** a customer's location can affect their decision to leave the bank. We'll keep this column.
6. **Gender** it's interesting to explore whether gender plays a role in a customer leaving the bank. We'll include this column, too.
7. **Age** this is certainly relevant, since older customers are less likely to leave their bank than younger ones.
8. **Tenure** refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.
9. **Balance** also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.
10. **NumOfProducts** refers to the number of products that a customer has purchased through the bank.
11. **HasCrCard** denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.
12. **IsActiveMember** active customers are less likely to leave the bank, so we'll keep this.
13. **EstimatedSalary** as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.
14. **Exited** whether or not the customer left the bank. This is what we have to predict.

In [None]:
import math
import pandas as pd
import seaborn as sns
import arviz as az
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,roc_auc_score, roc_curve
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

# Set Pandas to show all the columns
pd.set_option('display.max_columns', None)

# **1. Data understanding and preparation**

Before diving into predictive models, it's crucial to have a good understanding of the data at hand. This phase will walk you through loading, exploring, and preparing the dataset for further analysis.

All this process is known as **Data Wrangling**. In particular, the whole data wrangling process implies:
- Define and apply an strategy for nulls and coding for categorical variables
- Analyze the variables distribution and correlation between them
- Remove outliers
- etc....


### **1.1 Read the data**

Let's open the csv with separator "," and assign to a dataframe variable (use read_csv from Pandas library). Let's see the top 5 elements.

In [None]:
# Define the parameters to read the data
root_path = "data/"
dataset = "Churn_Modelling.csv"
sep = ","
encoding = "utf-8"

# Read the data
original_data = pd.read_csv(filepath_or_buffer=os.path.abspath(os.path.join(root_path, dataset)), sep = sep, encoding = encoding)

data = original_data.copy()

In [None]:
data.head()


### **1.2 Dataset Exploratory Data Analysis (EDA)**

[**EX1**] Let's identify the type of the variables (integer, float, chart...) and the size of the dataset and the file. Which are the variable with more nulls? And with no nulls?

**Question**

- How many records does the data have?

- How many nulls does each attribute have?

- Has each attribute the right datatype?

Tip: [.info()](https://www.geeksforgeeks.org/python-pandas-dataframe-info/) is a function that reports the main characteristics of a dataframe.

**Solution**

Out of 10000 records, there is not a single NULL. All datatypes seem to be correct (object could be string but should be correct).

In [None]:
# CODE HERE
data.info()

[**EX2**] Explore the dataset using descriptive statistics to understand the distribution of numerical and categorical variables.

Tip: Use .describe() for descriptive statistics and seaborn or matplotlib for visualizations.

**Solution**

In [None]:
# CODE HERE
data.describe()

In [None]:
sns.pairplot(data=data[['CreditScore', 'Gender', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'Exited']], hue='Exited', markers=["o", "s"])
plt.show()

**Check the relationship along Churn attribute and numerical ones**

In [None]:
import pandas as pd
from collections import Counter
import seaborn as sns
import os
import numpy as np
from matplotlib import pyplot
import matplotlib.pyplot as plt

In [None]:
# Define the target column
target_variable = ["Exited"]

# CODE HERE

# Define the numerical variables # Remove 'RowNumber', 'CustomerId', 'Surname', 'Exited'
num_variables = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

# Define the categorical ones # Remove 'RowNumber', 'CustomerId', 'Surname', 'Exited'
categorical_variables = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']

#### Draw a box plot for numerical variables and distribution plot against each churn value

In [None]:
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Determine the number of rows and columns for the subplot grid
num_vars = len(num_variables)
cols = 2  # Two plots for each variable: boxplot and distribution plot
rows = num_vars  # One row for each variable

# Create a figure
fig, axs = plt.subplots(rows, cols, figsize=(15, 5*rows))

# Box plots and distribution plots for numerical variables
for i, column in enumerate(num_variables):
    sns.boxplot(x='Exited', y=column, data=data, ax=axs[i, 0])
    axs[i, 0].set_title(f'Box plot of {column}')

    sns.histplot(data=data, x=column, hue='Exited', kde=True, element='step', ax=axs[i, 1])
    axs[i, 1].set_title(f'Distribution of {column} against churn value')

# Adjust layout
plt.tight_layout()
plt.show()

**Check the relationship along Churn attribute and categorical ones**

[**EX3**] Draw a box plot for categorical variables and distribution plot against each churn value.

Tip: Use the same structure used for numerical variables

**Solution**

In [None]:
# CODE HERE (Tip: Similar to numeric features)

# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Determine the number of rows and columns for the subplot grid
num_vars = len(categorical_variables)
cols = 2  # Two plots for each variable: boxplot and distribution plot
rows = num_vars  # One row for each variable

# Create a figure
fig, axs = plt.subplots(rows, cols, figsize=(15, 5*rows))

# Box plots and distribution plots for numerical variables
for i, column in enumerate(categorical_variables):
    sns.boxplot(x='Exited', y=column, data=data, ax=axs[i, 0])
    axs[i, 0].set_title(f'Box plot of {column}')

    sns.histplot(data=data, x=column, hue='Exited', kde=True, element='step', ax=axs[i, 1])
    axs[i, 1].set_title(f'Distribution of {column} against churn value')

# Adjust layout
plt.tight_layout()
plt.show()

[**EX4**]

List 2 insights that you can extract from the plots of the numerical and categorical variables.

**Solution**

Higher salary estimations seem to correlate to not exiting. Also, box plots are not useful with categorical variables.

In [None]:
sns.pairplot(data[num_variables + target_variable], hue="Exited", height=5.5,diag_kind="kde")
plt.show()

#### Correlation matrix:

In [None]:
corr_matrix = data.corr(numeric_only=True)
sns.set (rc = {'figure.figsize':(15, 10)})
sns.heatmap(corr_matrix, annot=True)
plt.show()

[**EX5**]

Do you appreciate any variable that is highly related with the churn value? Which and why?

**Solution**

Number of products and Age seem to be quite correlated with churn value. The former negatively and the latter positively.

### **1.3 Data wrangling**

Once the dataset has been explored, the next step is to clean it. This process, known as Data Wrangling, is composed by processes to remove null values, standardize them, removing fields that are not of interest for our application, and so on.

In this case we do not detect null vales.

#### Encode binary cathegorical data as Gender:

In [None]:
data['Gender'] = np.where(data['Gender']=='Female', 1, 0)
data['Gender'].tail()

[**EX 6**]

Encode categorical variables (those that are not binary) with one-hot-encoding

Tip: [.get_dummies()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) Converts categorical variable into dummy/indicator variables.

- Which feature of our dataset is a candidate for performing a one-hot-encoding?

**Solution**

Categorical variables such as Geography and Gender

In [None]:
# CODE HERE
# Define the categorical variables to be encoded
multi_cols = ['Geography']

# Performs the one-hot-encoding
data = pd.get_dummies(data = data, columns = multi_cols)
data.iloc[:,-3:].head(3)

[**EX 7**]

In order to build up the model and increase its efficiency, scale numerical variables to mean = 0 and std = 1.

Tip: [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) Standardize features by removing the mean and scaling to unit variance

- Which numerical variables are candidates to performe a standarization?

**Solution**

The numerical variables to be scaled are:

CreditScore, Age, Tenure, Balance, NumberOfProducts, EstimatedSalary are candidates. Really we only need CreditScore, Balance, and EstimatedSalary since those are the ones that have the largest numbers.

In [None]:
# Build up the standard scaler
std = StandardScaler()

# CODE HERE
to_scale_variables = ['CreditScore', 'Balance', 'EstimatedSalary']

# Transform the data
scaled = std.fit_transform(data[to_scale_variables])
scaled = pd.DataFrame(scaled, columns=to_scale_variables)

# Join both dataframes
data.rename(columns = {column: column + "_original" for column in to_scale_variables}, inplace=True)
data = data.merge(scaled,left_index=True,right_index=True,how = "left")
data.drop(columns = [column for column in data.columns if "_original" in column], inplace = True)

In [None]:
data.head()

# **2. Model Training: Building a churn prediction model**

Now we are ready to enter in the training stage of the machine learning models. The common way to procedure is starting with baseline models (i.e. SVM, Decision Trees, Naive Bayes, etc....) and later, try to improve it adjusting hyperparameters of the models or creating more complex models architectures as ensembles. In this case we will only improve the model by adjusting its hyperparameters.

There is a very important step when building a model, and it is to choose a right model threshold (usually known as model cutoff value). This threshold modifies the output class probability density function in order to balance the model and fit our requirements. for example, imagine a case in which you have to design a model to detect sick people in order to speed up their medical treatment. Is it better either to classify healthy people as sick people or sick people as healthy ones? Normally, it is better to classify healthy people as sick ones, because they will be removed after the medical tests, than sick people as healthy ones, because they won't be treated and maybe will die ... Therefore, the model cutoff plays a really important role when designing a model. Some libraries already has the cutoff as an input parameter, but normally it is not and by default is assigned to 0.5. In this case, what you have to do is to train your model, predict (using probabilities) and modify the prediction value according to your threshold.

The function *train_model* is a function that trains the model and plots the most interesting, and important, metrics that will help you to rightly define the model hyperparameters.

In [None]:
def train_model_and_performance_eval(logit,train_x,test_x,train_y,test_y, cols, cutoff = 0.5, cf = 'coefficients'):
    logit.fit(train_x,train_y)

    predictions = logit.predict(test_x)
    probabilities = logit.predict_proba(test_x)

    predictions = np.where(probabilities[:,1] >= cutoff, 1, 0)

    # Calculate the coefficients dataframe depending on if they are real coefficients or features
    if   cf == "coefficients" :
        coefficients  = pd.DataFrame(logit.coef_.ravel())
    elif cf == "features" :
        coefficients  = pd.DataFrame(logit.feature_importances_)

    # Set the coefficients to be shown
    column_df = pd.DataFrame(cols)
    coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
                              right_index= True, how = "left"))
    coef_sumry.columns = ["coefficients","features"]
    coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)

    print(f'{logit}\n\nClassification report:\n{classification_report(test_y,predictions)}')
    print(f'\nAccuracy Score: {accuracy_score(test_y,predictions)}')

    # Plot confusion matrix (see plot_confusion_matrix function)
    # cm = plot_confusion_matrix(logit, test_x, test_y, cmap=plt.cm.Reds)
    cm = confusion_matrix(test_y, predictions, labels=logit.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=logit.classes_)
    disp.plot()
    # disp.show()

    fig = disp.ax_.get_figure()
    fig.set_figwidth(5)
    fig.set_figheight(5)
    plt.grid(False)

    # Plot the probability density chart
    model_and_real_values = pd.DataFrame({"real_value": predictions, "predicted_value": probabilities[:, 1]})
    plt.figure(figsize=(15, 10))
    sns.kdeplot(data=model_and_real_values, x="predicted_value", fill=True, hue="real_value", label='Predicted Probability Density')
    plt.axvline(x=cutoff)
    plt.title('Probability Density Chart')

    # Plot feature importance bar
    # Prepare Data
    coef_sumry.reset_index(inplace=True)
    coef_sumry['colors'] = ['red' if x < 0 else 'green' for x in coef_sumry['coefficients']]
    coef_sumry

    # Draw plot
    plt.figure(figsize=(14,10), dpi= 80)
    plt.hlines(y=coef_sumry.index, xmin=0, xmax=coef_sumry.coefficients, color=coef_sumry.colors, alpha=0.4, linewidth=5)

    # Decorations
    plt.gca().set(ylabel='$Features$', xlabel='$Coefficients$')
    plt.yticks(coef_sumry.index, coef_sumry.features, fontsize=12)
    plt.title('Feature Importance', fontdict={'size':20})
    plt.grid(linestyle='--', alpha=0.5)
    plt.show()

    return probabilities

In [None]:
# Getting columns to be used when training
cols = [i for i in data.columns if i not in ('RowNumber', 'CustomerId', 'Surname', 'Exited')]

# Split train, validation and test data
train_val, test = train_test_split(data, test_size = .1 ,random_state = 1)
train, val = train_test_split(train_val, test_size = .3 ,random_state = 1)

train_x = train[cols]
train_y = train["Exited"]
val_x = val[cols]
val_y = val["Exited"].values.ravel()

# Change the hyperparameters in order and try to get the best model
logit  = LogisticRegression(C=1.0, max_iter=100, solver='liblinear')

# Train the model
predictions = train_model_and_performance_eval(logit, train_x, val_x, train_y, val_y, cols, cutoff = 0.6)

[**EX 8**]  [**Play with the hyperparameters of the functions above**]

Calculate the best model and define the process you have followed to achieve it. Some questions to answer are:

- Which variables you used and why to train the model?

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.exceptions import ConvergenceWarning
import warnings

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 500]
}

# https://stackoverflow.com/questions/66938102/hide-scikit-learn-convergencewarning-increase-the-number-of-iterations-max-it
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=ConvergenceWarning)
    
    # https://datascience.stackexchange.com/questions/30881/when-is-precision-more-important-over-recall
    grid = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='recall')
    grid.fit(train_x, train_y)
    # we should take into account validation set too

print("Best params:", grid.best_params_)
print("Best recall:", grid.best_score_)

best_logit = grid.best_estimator_

In [None]:
# Change the hyperparameters in order and try to get the best model
logit  = LogisticRegression(C=10, max_iter=100, solver='liblinear', penalty='l1')

# Train the model
predictions = train_model_and_performance_eval(logit, train_x, val_x, train_y, val_y, cols, cutoff = 0.6)

#### Solution:

Since the dataset is relatively small we can bruteforce it using grid search to find the most optimal parameters. We want to maximize recall in order to minimize false negatives, since in this use case it is more important to not lose a client and thinking that it wasn't going to churn than to not lose it and thinking he was going to leave. With this approach we find that the best model among the tried ones is 
```
LogisticRegression(C=10, max_iter=100, solver='liblinear', penalty='l1')
```

However it might not be worth it since there are too many false positives, so we might need to take into account other metrics.

[**EX 9**]

- Do you think the density chart is balanced?

#### Solution:

The density chart does not seem to be balanced. Both churn and not churn resemble a poisson distribution.

[**EX 10**]  **Confusion Matrix**
- What do the confusion matrix values mean?
- In this case do you think it is worse to have false positives or false negatives?
- Do you think the model works well based on the confusion matrix?

#### Solution:

- The confusion matrix values are (clock-wise order): true negatives, false negatives, true positives, false positives.
- It is worse to have more false negatives.
- This model prioritizes false negatives, however it might be too much. There are too many false positives.

**[EX 11]  [Play with the cutoff]**

- What happens if you move the cutoff to the right?
- and to the left?
- How many customers are at risk of churn?

#### Solution:

If we move the cutoff to the right (e.g.: 0.9) we only churn if we are more than 90% sure that they are going to churn. The opposite happens with the left. The number of customers "at risk" depends on this parameter.

[**EX 12**]

Predict over the test dataset (using the original dataset values) which customers are possible churners or not.

- PossibleChurner: value 0/1
- Possiblechurner_Proba: probabilities (0, ..., 1)

**Solution**

In [None]:
original_test = original_data[original_data.CustomerId.isin(test.CustomerId.values)].sort_values(by = ["CustomerId"])
test = test.sort_values(by = ["CustomerId"])

original_test['PossibleChurner_Proba'] = logit.predict_proba(test[cols])[:,1]
original_test['PossibleChurner'] = np.where(original_test['PossibleChurner_Proba'] >= 0.6, 1, 0)

In [None]:
original_test

[**EX 13**]

Plot the probability density chart of the new predictions

In [None]:
# CODE HERE

# Plot the probability density chart
model_and_real_values = pd.DataFrame({"real_value": original_test['PossibleChurner'], "predicted_value": original_test['PossibleChurner_Proba']})
plt.figure(figsize=(15, 10))
sns.kdeplot(data=model_and_real_values, x="predicted_value", fill=True, hue="real_value", label='Predicted Probability Density')
plt.axvline(x=0.6)
plt.title('Probability Density Chart')
plt.show()

[**EX 14**]

- In the new prediction, how do you think the model is performing?
- How could we improve it?

**Solution**

I think that the model generalizes quite well. I do not know how would I improve it.