# **PANDA - Final Project**

## Author: Pedro Malandrin Klesse

## Access to the Dataset

- https://www.kaggle.com/datasets/muhammadshahidazeem/customer-churn-dataset

## Atributes

- **CustomerID**: Customer Identification
- **Age**: Customer age
- **Gender**: Customer Gender (Male or Female)
- **Tenure**: Period of time that the person was a company customer
- **Usage Frequency**: Frequency of company services usage by the client
- **Support Calls**: Number os support calls that the customer requested
- **Payment Delay**: Time that the customer delays to pay the service bill after the deadline
- **Subscription Type**: Service Type that the customer subscribed (Standand, Basic or Premium)
- **Contract Length**: Literally the contract length (Monthly, Quarterly or Annual)
- **Total Spend**: Total spend by the customer with company services
- **Last Interaction**: Time of the last interaction by the customer with the company
- **Churn**: If the customer churn or not (1 ou 0, respectively)


## PART 1: *Exploratory Data Analysis and Training Machine Learning Models - Customer Churn Dataset*

## Libraries and Loading Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from IPython.display import Image
import jovian

In [None]:
dataframe = pd.read_csv('customer_churn_dataset-training-master.csv')

In [None]:
dataframe['Contract Length'].unique()

## Descriptive Statistics

In [None]:
dataframe.head(5)

## Dataset Length

In [None]:
print(f'Linhas: {dataframe.shape[0]}')
print(f'Colunas: {dataframe.shape[1]}')

## Types of Data and Non-null Values

In [None]:
dataframe.info()

## Verifying Null Values

In [None]:
dataframe.isna().sum()

## Removing Tuples with Null Values

In [None]:
dataframe = dataframe.dropna()
dataframe.isna().sum()

## Statistical Resume about the Dataset

In [None]:
dataframe.describe()

## Classes Distribution

In [None]:
total = dataframe['Churn'].value_counts()[0] + dataframe['Churn'].value_counts()[1]
print(f"Deu Churn: {dataframe['Churn'].value_counts()[0]/total}\nNão deu Churn: {dataframe['Churn'].value_counts()[1]/total}")

## Data Distribution Analysis

In [None]:
Image(filename='skew.png')

In [None]:
dataframe.drop(columns=['Gender','Subscription Type', 'Contract Length']).skew()

## Outliers Analysis

In [None]:
# Age
print('===Age===')
print(f'Mean between max e min: {(dataframe["Age"].max()+dataframe["Age"].min())/2}')
print(f'All values mean: {round(dataframe["Age"].mean(),2)}')

print('\n')

# Tenure
print('===Tenure===')
print(f'Mean between max e min: {(dataframe["Tenure"].max()+dataframe["Tenure"].min())/2}')
print(f'All values mean: {round(dataframe["Tenure"].mean(),2)}')

print('\n')

# Usage Frequency
print('===Usage Frequency===')
print(f'Mean between max e min: {(dataframe["Usage Frequency"].max()+dataframe["Usage Frequency"].min())/2}')
print(f'All values mean: {round(dataframe["Usage Frequency"].mean(),2)}')

print('\n')

# Support Calls
print('===Support Calls===')
print(f'Mean between max e min: {(dataframe["Support Calls"].max()+dataframe["Support Calls"].min())/2}')
print(f'All values mean: {round(dataframe["Support Calls"].mean(),2)}')

print('\n')

# Payment Delay
print('===Payment Delay===')
print(f'Mean between max e min: {(dataframe["Payment Delay"].max()+dataframe["Payment Delay"].min())/2}')
print(f'All values mean: {round(dataframe["Payment Delay"].mean(),2)}')

print('\n')

# Total Spend
print('===Total Spend===')
print(f'Mean between max e min: {(dataframe["Total Spend"].max()+dataframe["Total Spend"].min())/2}')
print(f'All values mean: {round(dataframe["Total Spend"].mean(),2)}')

print('\n')

# Last Interaction
print('===Last Interaction===')
print(f'Mean between max e min: {(dataframe["Last Interaction"].max()+dataframe["Last Interaction"].min())/2}')
print(f'All values mean: {round(dataframe["Last Interaction"].mean(),2)}')

### p.s. - the means about all the values are not so distant to the mean between the min and max values

In [None]:
Image(filename='boxplot.jpg')

In [None]:
# BoxPlot

subplot_titles = list(dataframe.columns.drop(['CustomerID','Churn','Last Interaction']))
subplot_titles.append('')
subplot_titles.append('Last Interaction')
subplot_titles.append('')

fig = make_subplots(rows=4, 
                    cols=3, 
                    specs=[[{'type':'xy'},{'type':'xy'},{'type':'xy'}],
                           [{'type':'xy'},{'type':'xy'},{'type':'xy'}],
                           [{'type':'xy'},{'type':'xy'},{'type':'xy'}],
                           [{'type':'xy'},{'type':'xy'},{'type':'xy'}]],
                    subplot_titles=subplot_titles)

fig.add_trace(
    go.Box(y=dataframe['Age']),
    row=1,col=1
)

fig.add_trace(
    go.Box(y=dataframe['Gender']),
    row=1,col=2
)

fig.add_trace(
    go.Box(y=dataframe['Tenure']),
    row=1,col=3
)

fig.add_trace(
    go.Box(y=dataframe['Usage Frequency']),
    row=2,col=1
)

fig.add_trace(
    go.Box(y=dataframe['Support Calls']),
    row=2,col=2
)

fig.add_trace(
    go.Box(y=dataframe['Payment Delay']),
    row=2,col=3
)

fig.add_trace(
    go.Box(y=dataframe['Subscription Type']),
    row=3,col=1
)

fig.add_trace(
    go.Box(y=dataframe['Contract Length']),
    row=3,col=2
)

fig.add_trace(
    go.Box(y=dataframe['Total Spend']),
    row=3,col=3
)

fig.add_trace(
    go.Box(y=dataframe['Last Interaction']),
    row=4,col=2
)

fig.update_layout(height=1000,
                  width=1000,
                  title_text='Distributions about the Atributes',
                  font_color='white',
                  paper_bgcolor='black',
                  showlegend=False)
fig.show()

## Correlation Matrix

In [None]:
dataframe_numeric = dataframe.select_dtypes(['float64'])

corr = dataframe_numeric.corr()
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)]=True

plt.figure(figsize=(16,9))
sns.heatmap(corr, mask=mask, cmap='mako', annot=True)
plt.show()

### Some conclusions

- CustomerID has to be removed, because it has no a priori correlation with the churn atribute
- Tenure and Usage Frequency can be removed before a machine learning model training, because of the small correlation with the churn classification
- Support Calls and Payment Delay, respectively, has the majors correlation with churn atribute

## Numeric Atributes Analysis

## Univariate Analysis

In [None]:
# Histogram

subplot_titles = ['Support Calls', 'Total Spend', 'Payment Delay', 'Age', 'Last Interaction']

fig = make_subplots(rows=2, 
                    cols=3, 
                    specs=[[{'type':'xy'},{'type':'xy'},{'type':'xy'}],
                           [{'type':'xy'},{'type':'xy'},{'type':'xy'}]],
                    subplot_titles=subplot_titles)

fig.add_trace(
    go.Histogram(x=dataframe['Support Calls']),
    row=1,col=1
)

fig.add_trace(
    go.Histogram(x=dataframe['Total Spend']),
    row=1,col=2
)

fig.add_trace(
    go.Histogram(x=dataframe['Payment Delay']),
    row=1,col=3
)

fig.add_trace(
    go.Histogram(x=dataframe['Age']),
    row=2,col=1
)

fig.add_trace(
    go.Histogram(x=dataframe['Last Interaction']),
    row=2,col=2
)

fig.update_layout(height=1000,
                  width=1000,
                  title_text='Análise Univariada',
                  font_color='white',
                  paper_bgcolor='black',
                  showlegend=False)
fig.show()

### Observations

- 4 out of 5 distributions are assimetric.
- The asymmetry offsets the mean.
- The asymmetry won't cause a desbalance training the models, because has some of equality in this asymmetrys distributions

## Bivariate Analysis

In [None]:
# Histogram

fig = plt.figure(figsize=(30,20))
fig.suptitle('Análise Bivariada')
fig.subplots_adjust(hspace=0.6, wspace=0.8)
ax = fig.add_subplot(2,3,1) #número de linhas | número de colunas | índice do subplot
sns.histplot(x=dataframe['Support Calls'],hue=dataframe['Churn'])
ax.set_title('Support Calls')
ax = fig.add_subplot(2,3,2)
sns.histplot(x=dataframe['Total Spend'],hue=dataframe['Churn'])
ax.set_title('Total Spend')
ax = fig.add_subplot(2,3,3)
sns.histplot(x=dataframe['Payment Delay'],hue=dataframe['Churn'])
ax.set_title('Payment Delay')
ax = fig.add_subplot(2,3,4)
sns.histplot(x=dataframe['Age'],hue=dataframe['Churn'])
ax.set_title('Age')
ax = fig.add_subplot(2,3,5)
sns.histplot(x=dataframe['Last Interaction'],hue=dataframe['Churn'])
ax.set_title('Last Interaction')
plt.show()

### Observations

#### Support Calls

- More the support calls, less the probabilty of the customer to churn.
- More support calls can be related with difficulties and insatisfaction about the product, so it leads to churn.

#### Total Spend

- More the total spend of the customer with the campony, more the probability to not churn.
- Attention: there are two different types of customer, those who buy expensive services and others who buy a lot of cheap services.


#### Payment Delay

- Less the delay of the customer payment, less the probability to churn.
- Maybe this company gives benefits for those who pay the bills on time.

#### Age

- More the age of the customer, more the probability to churn.
- Maybe older people have not enough patience to find difficulties with services.

#### Last Interaction

- More distant the last contact between the company and the client, more the probability to churn.
- Maybe the company has to contact those clients with a bigger distant in time of contact.

## Categorical Atributes Analysis - Gender, Subscription Type and Contract Length

### Gender

In [None]:
dataframe['Gender'].value_counts()

In [None]:
colors = sns.color_palette('pastel')[0:2]
plt.title('Gender')
plt.pie(dataframe['Gender'].value_counts(), labels = ['Male','Female'], colors = colors, autopct='%.0f%%')
plt.show()

In [None]:
# Histograma

px.histogram(dataframe,x='Gender',color='Churn')

In [None]:
total_female = dataframe[dataframe['Gender']=='Female'].shape[0]
total_male = dataframe[dataframe['Gender']=='Male'].shape[0]

female_churn = dataframe[(dataframe['Churn']==1) & (dataframe['Gender']=='Female')].shape[0]
male_churn = dataframe[(dataframe['Churn']==0) & (dataframe['Gender']=='Male')].shape[0]

print(f'Woman who churn out of total women: {(female_churn/total_female)*100}')
print(f'Man who churn out of total men: {(male_churn/total_male)*100}')

In [None]:
dataframe[(dataframe['Churn']==1)].groupby('Gender').count()

## Subscription Type

In [None]:
dataframe['Subscription Type'].value_counts()

In [None]:
colors = sns.color_palette('pastel')[0:3]
plt.title('Subscription Type')
plt.pie(dataframe['Subscription Type'].value_counts(), labels = ['Standard','Premium','Basic'], colors = colors, autopct='%.0f%%')
plt.show()

In [None]:
# Histogram

px.histogram(dataframe,x='Subscription Type',color='Churn')

## Contract Length

In [None]:
dataframe['Contract Length'].value_counts()

In [None]:
# Pie Chart

colors = sns.color_palette('pastel')[0:3]
plt.title('Contract Length')
plt.pie(dataframe['Contract Length'].value_counts(), labels = ['Annual','Quarterly','Monthly'], colors = colors, autopct='%.0f%%')
plt.show()

In [None]:
# Histogram

px.histogram(dataframe,x='Contract Length',color='Churn')

In [None]:
monthly_and_churn = dataframe[(dataframe['Contract Length']=='Monthly') & (dataframe['Churn']==1)].value_counts().sum()
monthly_and_not_churn = dataframe[(dataframe['Contract Length']=='Monthly') & (dataframe['Churn']==0)].value_counts().sum()
print(f'Total of tuples:{dataframe.shape[0]}')
print(f'Monthly Subscription Type and Churn:{monthly_and_churn}')
print(f'Monthly Subscription Type and Not Churn:{monthly_and_not_churn}')

### Some conclusions

- Proportionally, women churn more than men.
- The subscription type has no different values for a person who churn or not, has no apparently relation.
- Those who subscribe for a monthly plan always churn.

## Conclusions about the Exploratory Data Analysis
- Are not necessary to use these columns to train machine learning models: CustomerID, Tenure, Usage Frequency and maybe Subscription Type.
- Caracteristics like Female Gender, age above 60, more support calls, big distance in contact with the company, small spending with services and delay tend to the customer churn.
- Customers with Monthly Subscryption Type always churn.
- The dataset has a good distribution about the two classes (churn and not churn), but it is interesting to use strategies to bypass the sligth difference about this classes to be as fair as possible.

## Part 2: Training Machine Learning Models to Predict Churn

## Libraries and Loading Dataset

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split


## Remove the Columns and Separate the Dataset (60,20,20)

In this case, there is one dataset for training and other for testing, so I decided just to separate the training dataset in train (70%) and validation (30%).

In [None]:
dataframe = pd.read_csv('customer_churn_dataset-training-master.csv')
dataframe = dataframe.dropna()

In [None]:
dataframe = dataframe.drop(columns=['CustomerID', 'Tenure', 'Usage Frequency', 'Subscription Type'])

In [None]:
dataframe.columns

In [None]:
X = dataframe.drop(columns=['Churn'])

In [None]:
X.columns

In [None]:
y = dataframe['Churn']

In [None]:
# Split 70% train and 30% validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
dataframe_test = pd.read_csv('customer_churn_dataset-testing-master.csv')

In [None]:
dataframe_test = dataframe_test.drop(columns=['CustomerID', 'Tenure', 'Usage Frequency', 'Subscription Type']) # , 'Tenure', 'Usage Frequency', 'Subscription Type'

In [None]:
X_test, y_test = dataframe_test.drop(columns=['Churn']), dataframe_test['Churn']

## Ways to Bypass the Classes Distribution Problem

In this case I decided to not implement an estrategy to bypass the different distributions because they are very close in value.

## Separating Numerical and Categorical Columns

In [None]:
X_train.info()

In [None]:
numeric_cols = ['Age','Support Calls','Payment Delay','Total Spend','Last Interaction']
categorical_cols = ['Gender', 'Contract Length']

## Feature Engineering

## Imputation

Because there was only one line with no values for all the attributes, we just removed it from the dataset, so we don't need imputation

## Scale Numeric Features

In [None]:
scaler = StandardScaler().fit(dataframe[numeric_cols])

In [None]:
X_train[numeric_cols] = scaler.transform(X_train[numeric_cols])
X_val[numeric_cols] = scaler.transform(X_val[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

In [None]:
X_train.describe().loc[['min', 'max']]

In [None]:
X_val.describe().loc[['min', 'max']]

## Encoding Categorical Data

In [None]:
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore').fit(dataframe[categorical_cols])

In [None]:
encoded_cols = list(encoder.get_feature_names_out(categorical_cols))

In [None]:
X_train[encoded_cols] = encoder.transform(X_train[categorical_cols])
X_val[encoded_cols] = encoder.transform(X_val[categorical_cols])
X_test[encoded_cols] = encoder.transform(X_test[categorical_cols])


In [None]:
X_train

In [None]:
X_train = X_train[numeric_cols + encoded_cols]
X_val = X_val[numeric_cols + encoded_cols]
X_test = X_test[numeric_cols + encoded_cols]

## Creating Models

## Decision Tree - Training

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model = DecisionTreeClassifier(max_depth=8, random_state=42)

In [None]:
len(y_train) == len(X_train)

In [None]:
%%time
model.fit(X_train, y_train)

## Decision Tree - Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
val_preds = model.predict(X_val)

In [None]:
accuracy_score(val_preds, y_val)

In [None]:
# Score for validation
model.score(X_val, y_val)

In [None]:
# Score for test
model.score(X_test, y_test)

## Decision Tree - Hyperparameter Tuning

In [None]:
def test_params(**params):
    model = DecisionTreeClassifier(random_state=42, **params).fit(X_train, y_train)
    return model.score(X_train, y_train), model.score(X_val, y_val)

In [None]:
# Tuning max_depth

train_errors = []
val_errors = []

range_ = 20

for i in range(1, range_):
    result = test_params(max_depth=i)
    train_errors.append(1-result[0])
    val_errors.append(1-result[1])

plt.figure(figsize=(10,8))
plt.title('Max Depth Tuning')
plt.plot(range(1, range_),train_errors)
plt.plot(range(1, range_),val_errors) 
plt.legend(['Train Error','Val Error'])
plt.show()

The validation error start to increase in max_depth > 10, so I choose this value as the best to train my model

In [None]:
# Tuning max_leaf_nodes

train_errors = []
val_errors = []

range_ = 100

for i in range(2, range_):
    result = test_params(max_leaf_nodes=i)
    train_errors.append(1-result[0])
    val_errors.append(1-result[1])
    
plt.figure(figsize=(10,8))
plt.title('Max Leaf Nodes')
plt.plot(range(2, range_),train_errors)
plt.plot(range(2, range_),val_errors) 
plt.legend(['Train Error','Val Error'])
plt.show()

This parameter does not have such a good relevance with the validation error, so a pickup the valeu 100 for it.

In [None]:
# Final Decision Tree Prediction

model = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=100, random_state=42)
model.fit(X_train, y_train)
print(f'Train: {model.score(X_train,y_train)}')
print(f'Validation: {model.score(X_val,y_val)}')
print(f'Test: {model.score(X_test,y_test)}')


### Conclusions: 
- I tried to include the other columns to see if it makes difference in the prediction rate, but nothing happens as expected, because those columns literally does not have importance to predict churn.
- I know how a chart for hyperparameter tuning has to see, but these for decision tree are this way for some problem of insufficient categories maybe.
- The prediction rate is low because of the high generalization of the model or the test dataset having difference intervals os values compared with the training one.
- I've tried to see other people's works about the dataset in Kaggle but everyone else get low prediction rates for test data (every value near 50%)

## Random Forest - Training

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_rf = RandomForestClassifier(n_jobs=-1, random_state=42)

In [None]:
model_rf.fit(X_train, y_train)

## Random Forest - Evaluation

In [None]:
model_rf.score(X_train, y_train)

## Random Forest - Hyperparameter Tuning

In [None]:
def test_params(**params):
    model = RandomForestClassifier(random_state=42, **params).fit(X_train, y_train)
    return model.score(X_train, y_train), model.score(X_val, y_val)

In [None]:
# Testing max_estimators

train_errors = []
val_errors = []

range_ = 8

for i in range(1, range_):
    result = test_params(max_depth=i)
    train_errors.append(1-result[0])
    val_errors.append(1-result[1])

plt.figure(figsize=(10,8))
plt.title('Max Estimators Tuning')
plt.plot(range(1, range_),train_errors)
plt.plot(range(1, range_),val_errors) 
plt.legend(['Train Error','Val Error'])
plt.show

My PC doesn't have enough CPU/GPU power to tune this parameter in brute force, so I decided to pick the last value: 7.

In [None]:
# Tuning max_depth

train_errors = []
val_errors = []

range_ = 20

for i in range(1, range_):
    result = test_params(max_depth=i)
    train_errors.append(1-result[0])
    val_errors.append(1-result[1])

plt.figure(figsize=(10,8))
plt.title('Max Depth Tuning')
plt.plot(range(1, range_),train_errors, label='Train Error')
plt.plot(range(1, range_),val_errors, label='Val Error') 
plt.show()

We can see that over value 16 the max_depth for a Random Forest begins to increase the validation error

In [None]:
# Final Random Forest Prediction

model = RandomForestClassifier(n_estimators=7, max_depth=16, random_state=42)
model.fit(X_train, y_train)
print(f'Train: {model.score(X_train,y_train)}')
print(f'Validation: {model.score(X_val,y_val)}')
print(f'Test: {model.score(X_test,y_test)}')

### Conclusion:

- Random Forests are so much complex to tune hyperparameters because of its computacional complexity.
- Similar to the Decision Tree this model had a bad performance with the test dataset.
- Maybe the test dataset provided by the authors on kaggle are too much different in values than the training one.

## Compare the Model with a Dummy One

Using a Dummy Model we can see if a model it's really good to predict something. We analyze if choosing randomly numbers can beat the predictions of the model and prove that the model prediction is not good at all or not.

In [None]:
import random

number_correct = 0
number_incorrect = 0

for pred in y_test:
    if random.randint(0, 1) == pred:
        number_correct += 1
    else:
        number_incorrect += 1

print(f'Dummy Model Accuracy: {number_correct/(number_correct+number_incorrect)}')

Comparing to the best model (Decision Tree) we see that the dummy is lightly worst.

## Final Conclusion

- These predictions are a little greater than 50%.
- They are not the ideal one, but can be useful when you don't have tools to predict churn in the company, and you have to avoid churn.
- The predicitions values are low maybe because of some points: 1. Insufficient attributes to colaborate with the prediction 2. Test dataset extremely different in values with the training one (in case of patterns and how data correlate with itself).
- Overall this predictions could be used for some initial operation to predict churn for this company related with the dataset, because it is ligthly better than a dummy one, but it would be necessary to see again the datasets and it's values, distributions, patterns, etc. to improve the data used to train the model and get better results.
- With more attributes or different test datasets (one with the same pattern as the training one, sfor example with people who paid for monthly services always churn) we may have better results.
- No one in kaggle had great results using classic machine learning models, just one guy had a little increase to 54% of accuraccy using a deep neural network.