# __Milestone 1: Business Understanding__

## Problem Statement

Predict whether customers are likely to churn based on their past behaviour and demographics. 

## Data identification

In order for us to build a machine learning algorithm to predict customer churning, we will need a combination of features capturing the customer's interactions with our service as well as customer demographic information. Features that we will be uitilizing in our machine learning model will include:

## Hypothesis 

## Collect and clean the data

We have collected raw data based on the desired features and target attributes for our churn prediction model. This raw data has been stored in the train.csv file in our data folder. We will now import this data into a dataframe and start cleaning the data.

### Import

In [25]:
# Supress warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

import pandas as pd # data wrangling
import seaborn as sns # data visualization
import plotly.express as px
import matplotlib.pyplot as plt

# for cat features
from category_encoders import OneHotEncoder

from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.pipeline import make_pipeline

from skimpy import clean_columns

In [26]:
df = pd.read_csv('./data/train.csv') #reading the data from the csv file to our dataframe
df.head() #display the first few data entries as well as column headings

Unnamed: 0,CustomerID,Gender,Age,Income,TotalPurchase,NumOfPurchases,Churn
0,1,Female,24,30000.0,1000,4.0,Yes
1,2,Male,28,35000.0,1200,5.0,Yes
2,3,Female,22,28000.0,800,,Yes
3,4,Male,25,32000.0,900,4.0,Yes
4,5,Female,30,40000.0,1500,6.0,Yes


We notice that our raw data has 6 features as well as a target feature called Churn. This data is not yet ready to be modelled and needs to be cleaned and prepared.

### Preprocessing data

__Removing irrelevent features__

As we will not need to know the customer ID to determine if they will churn or not, it is not a relevent feature for machine learning modelling and can therefor be dropped.

In [27]:
#removing the irrelevent feature
df.drop(
    columns='CustomerID',
    inplace=True
)

df.head() #inspecting the dataframe without the irrelevent feature

Unnamed: 0,Gender,Age,Income,TotalPurchase,NumOfPurchases,Churn
0,Female,24,30000.0,1000,4.0,Yes
1,Male,28,35000.0,1200,5.0,Yes
2,Female,22,28000.0,800,,Yes
3,Male,25,32000.0,900,4.0,Yes
4,Female,30,40000.0,1500,6.0,Yes


__Changing the target, Churn, to numeric values__

We want to convert the target data type from string values to integer values for more accurate machine learning modelling.

In [28]:
# Replacing the yes and no values with 1 and 0
df['Churn'].replace(
    {'Yes': 1, 'No': 0},
    inplace= True
)

df['Churn']

0      1
1      1
2      1
3      1
4      1
      ..
415    0
416    0
417    0
418    0
419    0
Name: Churn, Length: 420, dtype: int64

We have now converted the Churn datatype to int.

__Data profiling__

We will make use of the skimpy library to create a summary of desired data information.

In [29]:
import skimpy as sk #importing the skimpy library

sk.skim(df) #create a summary of df information

Some key takeaways of this skimpy summary is that we have now have 5 numeric features(including the target), and 1 categorical feature called Gender. We also notice that Income, NumOfPurchases, and Gender have some missing values that will need to be handled.

__Handling missing values__

In [30]:
num_col = ['Income','NumOfPurchases'] #creating a list of the numeric features with missing values

for col in num_col: #for each of the columns in the list replace the missing values with the mean of the column
    df[col].fillna(
        df[col]
        .dropna()
        .mean(),
        inplace= True
    )

df['Gender'].fillna( #replace the missing gender values with the mode of gender
    df['Gender']
    .mode()[0],
    inplace= True
)

df.isnull().sum()


Gender            0
Age               0
Income            0
TotalPurchase     0
NumOfPurchases    0
Churn             0
dtype: int64

We now have no missing values in our dataframe.

__Checking the cardinality of categorical features__

In [31]:
df.select_dtypes('object').nunique()

Gender    2
dtype: int64

As customer gender is our only categorical feature and it doesn't have very low or very high cardinality, we do not have to handle any feature cardinality.

__High collinearity__

We will now inspect the correlation between the features to detect any cases of high collinearity.

In [32]:
corr_df = df.select_dtypes('number').corr()
corr_df

Unnamed: 0,Age,Income,TotalPurchase,NumOfPurchases,Churn
Age,1.0,-0.686793,-0.676677,-0.511822,-0.783308
Income,-0.686793,1.0,0.987318,0.943952,0.840501
TotalPurchase,-0.676677,0.987318,1.0,0.944223,0.832349
NumOfPurchases,-0.511822,0.943952,0.944223,1.0,0.69472
Churn,-0.783308,0.840501,0.832349,0.69472,1.0


In [33]:
fig = px.imshow(corr_df, color_continuous_scale='Spectral')
fig.update_layout(title='Heat Map: Correlation of Features', font=dict(size=12))
fig.show()

We notice that the highest collinearity is between TotalPurchase and Income. As Income of the customer is important for churn predicitons, we can look at maybe removing the TotalPurchase feature for better model accuracy.

## Storing the prepared data

__Creating a prepare data function__

We will now combine our data preparation code into a single function which will return a dataframe of prepared data ready for modelling.

In [34]:
def prepare_data(path): #declaring the function with paramater path which will be the file directory of the raw data
    prep_df = pd.read_csv(path) #reading the raw data from the path into a dataframe

    #removing the irrelevent feature
    prep_df.drop(
        columns='CustomerID',
        inplace=True
    )

    # Replacing the yes and no values with 1 and 0
    prep_df['Churn'].replace(
        {'Yes': 1, 'No': 0},
        inplace= True
    )

    num_col = ['Income','NumOfPurchases'] #creating a list of the numeric features with missing values

    for col in num_col: #for each of the columns in the list replace the missing values with the mean of the column
        prep_df[col].fillna(
            prep_df[col]
            .dropna()
            .mean(),
            inplace= True
        )

    prep_df['Gender'].fillna( #replace the missing gender values with the mode of gender
        prep_df['Gender']
        .mode()[0],
        inplace= True
    )

    return clean_columns(prep_df)

__Calling the prepare_data function__

In [35]:
prepared_df = prepare_data('./data/train.csv')
prepared_df.to_csv('./data/prepared_data.csv')

# __Milestone 2: Machine Learning Model Implementation__

## Data exploration

We will now explore our prepared data to gain more insights into their meaning and behaviour.

### Univariate analysis

We will start our analysis by looking at the state and behaviour of our target, Churn.

In [37]:
# Prepare data to display
labels = (
    prepared_df['churn']
    .astype('str')
    .str.replace('0','No', regex=True)
    .str.replace('1','Yes', regex=True)
    .value_counts()
)

# Create figure using Plotly
fig = px.bar(
    data_frame=labels, 
    x=labels.index, 
    y=labels.values, 
    title=f'Class Imbalance', 
    color=labels.index
)

# Add titles & Display figure
fig.update_layout(xaxis_title='Churn', yaxis_title='Number of Customers')
fig.show()

For business purposes, we want to focus on the customers that do churn. It is clear in this graph that the amount of customers that have churned is quite significant and the business would like to reduce this number.

### Bivariate/Multi-variate analysis

__Numeric Features__

We will now visualise the relationships of the numeric features against our target to understand their behaviour and impact.

In [41]:
plot_cols = ['age','income','total_purchase','num_of_purchases']

# Plot numeric features against target
plt.Figure(figsize=(3,4))
for col in plot_cols:
    fig = px.box(data_frame=prepared_df[plot_cols], x=col, color=prepared_df['churn'], title=f'BoxPlot for {col} Feature against the Target')
    fig.update_layout(xaxis_title=f'{col} Feature')
    fig.show()

Before we interperet these visuals, we first want to remove the outliers of the age feature as it may have an influence on the boxplot of the yes outcome of churn.

__Age boxplot whithout outliers__

In [46]:
mask = prepared_df['age'] > 24
masked_df = prepared_df[mask]

plt.Figure(figsize=(3,4))
fig = px.box(data_frame=masked_df, x='age', color=masked_df['churn'], title=f'BoxPlot for age Feature against the Target')
fig.update_layout(xaxis_title='Age Feature')
fig.show()

After handling the outliers we concluded the following:

Younger customers between the ages of 25 to 36 are most likely to churn

Customers with higher income is more likely to churn

Customers with higher total purchase amounts are churning

Customers with higher number of purchases are also churning

__Categorical features__

In [49]:
new_df = pd.DataFrame(
    prepared_df[['gender', 'churn']]
    .groupby(['churn'])
    .value_counts()
    .reset_index()
)

# Plot Category feature vs label
fig = px.bar(
    data_frame=new_df, 
    x='gender', 
    y='count', 
    facet_col='churn', 
    color=new_df['churn'].astype(str), # convert it to string to avoid continuous scale on legend
    title='Gender vs Target'
)

fig.update_layout(xaxis_title='gender', yaxis_title='Number of Customers')
fig.show()

When looking at the bar graph where the churn value is 'yes', we notice that more females are churning than males.

In [52]:
from sklearn.model_selection import train_test_split
label = 'churn'
x = prepared_df.drop(columns=[label], inplace=False)
y = prepared_df[label]

x_Train, x_Val, y_Train, y_Val = train_test_split(x, y, test_size=0.2, random_state=42)

print(
    f'Training dataset \
    \nx_Train: {x_Train.shape[0]/len(x)*100:.0f}% \ny_Train: {y_Train.shape[0]/len(x)*100:.0f}% \
    \n\nValidation dataset \
    \nx_Val: {x_Val.shape[0]/len(x)*100:.0f}% \ny_Val: {y_Val.shape[0]/len(x)*100:.0f}%'
)

accuracy_Base = y_Train.value_counts(normalize=True).max()

print("Baseline Accuracy:", round(accuracy_Base, 2))

Training dataset     
x_Train: 80% 
y_Train: 80%     

Validation dataset     
x_Val: 20% 
y_Val: 20%
Baseline Accuracy: 0.52


In [53]:
from sklearn.linear_model import LogisticRegression
regression_Model = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    LogisticRegression(max_iter=2500)
)
regression_Model.fit(x_Train, y_Train)

# Display accuracy scores
lr_train_acc = regression_Model.score(x_Train, y_Train)
lr_val_acc = regression_Model.score(x_Val, y_Val)
print("Logistic Regression training accuracy:", lr_train_acc)
print("Logistic Regression validation accuracy:", lr_val_acc)




Logistic Regression training accuracy: 0.9970238095238095
Logistic Regression validation accuracy: 1.0
