As an analyst for Model Fitness this is a step towards developing a customer interaction survey. The goal is to identify what type of customers are more prone to churning. The data provided contains churn data for the current and preceding month as well as a customer's account information. Through machine learning we will be able to identify the type of customers that stay and those who are more likely to churn.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
import seaborn as sns


# Data Preprocessing

The first step is to check the quality of the data to verify if it has any missing values or categorical values. Investing the mean values as well will allow me to understand the data better and make some initial insights. 


In [2]:
gym_df= pd.read_csv('/datasets/gym_churn_us.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/gym_churn_us.csv'

In [None]:
gym_df.info()

In [None]:
gym_df.describe()

In [None]:
gym_df.head()

I have there are no missing values and  no categorical data types. The mean values suggest that there is an even split between men and women. With a mean value of 0.84 and a standard deviation of .36 it also seems that most of the customers live near the gym's location. A small mean value of 0.31 suggests that most of the clients did not start their membership as part of the 'Bring a friend' promotion. The average customer age is 29 and has 4 months left on their contract. 


# Initial Data Analysis

My next step is to split the data into two groups, those who churned and those who did not and study the mean values for each group.


In [None]:
churn_df=gym_df.groupby('Churn')
churn_df.head()

In [None]:
pd.set_option('max_rows', None)
churn_df.describe().T


The mean values for customers who did NOT churn:

There is still an even split of gender (0.51), more live closer to the gym with a mean value 0.87 with sd of 0.33. Those who did not churn have a higher mean value of belonging to partner companies (0.534) and not many came from promo friends but higher than the churn group at 0.35. This group also has a longer contract period, avg of 5.75, higher group visits at 0.46, and accrued more additional chrges at 158. This group also went to more classes with a mean value of 2.02.

The mean values for customers who DID churn:

There is an even split of gender between the genders (0.51) and most live near the location(.77). There is a mean value of 0.355 for being part of a partner company whih suuggests that most are not. This group has a lower mean value for their contract period at 1.72 with a standard deviation of 2.13. The mean value of group visits is also lower at 0.27. The mean value for this group's additional charge is 115 with a standard deviation of 77.7. 

Let's plot histograms that will display the feature distribution for the group of cients that did not churn.

In [None]:
for column in gym_df.columns:
    sns.histplot(gym_df, x=column, bins=10, hue='Churn')
    plt.title(column)
    plt.show()

The values for the average additional charge total is skewed to the right, as well is the lifetime feature. Average class frequency is a little skewed and there don't seem to be any major outliers.

Now I would like to check if there is any correlation between the features and the target (churn) or between the features themselves.

In [None]:
gym_corr=gym_df.corr()
plt.figure(figsize=(15,15))
sns.heatmap(gym_corr,annot=True,square=True)
plt.show()

There is a strong correlation between contract period and month to end of contract and another strong correlation between avg_ class frequency and avg class frequency current month. This makes sense as both features are measuring the same thing. In an attempt to optimize the model, I will drop month to end and avg classs frequency. 

In [None]:
gym_df=gym_df.drop(['Month_to_end_contract','Avg_class_frequency_current_month'],axis=1)

# Model Building

Now that I am done pre-processing and the data and checking the distributions, I will build a binary classification model for customers where the target feature is the user's leaving next month

First step is to divide the data into train and validation sets using the train_test_split() 

In [None]:
#creating the feature matrix and the target variable
X = gym_df.drop(['Churn'], axis=1)
y= gym_df['Churn']

In [None]:
#splitting the data
X_train,X_test,y_train,y_test= train_test_split(X,y, test_size=0.2, random_state=0, stratify=y)

I will be working with and comparing two models: Logistic Regression and Random Forest. I will train the models, have them calculate predictions and then compare their metrics.

In [None]:
X_train

In [None]:
#logistic regression
lo_model= LogisticRegression(random_state=0, max_iter=1000)
lo_model.fit(X_train,y_train)
lo_pred= lo_model.predict(X_test)
lo_prob= lo_model.predict_proba(X_test)[:,1]

In [None]:
#metrics for logistic regression
print(f'Accuracy: {accuracy_score(y_test,lo_pred)}')
print(f'Precision: {precision_score(y_test,lo_pred):.2f}')
print(f'Recall: {recall_score(y_test,lo_pred):.2f}')



In [None]:
#random forest
rf_model= RandomForestClassifier(random_state=0)
rf_model.fit(X_train,y_train)
rf_pred=rf_model.predict(X_test)
rf_prob= rf_model.predict_proba(X_test)[:,1]

In [None]:
#metrics for random forest
print('Accuracy: {:.2f}'.format(accuracy_score(y_test,rf_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test,rf_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test,rf_pred)))

Though the models are very close in similarity, Random Forst performs a bit better as evidenced by having higher metrics of Accuracy, Precision, and Recall. 

# User Clusters

Now I will split the data into clusters to ascertain which type of customer is more likely to churn. The first step to this is to standardize the data.

In [None]:
#Standardize the data using StandardScaler()
sc = StandardScaler()
X_sc=sc.fit_transform(X)


I will also be using the linkage() function to build a matrix of distances based on the standardized feature matrix (X_sc) and then I will visualize this using a dendogram which I will analyze to determine how many clusters I want. 

In [None]:
link=linkage(X_sc, method='ward')


In [None]:
plt.figure(figsize=(10, 10))  
dendrogram(link, orientation='top')
plt.title('Clustering for Gym Churn')
plt.show()

Based on the dendogram I will be using 5 clusters in my KMeans model. 


In [None]:
km=KMeans(n_clusters=5, random_state=0)
labels=km.fit_predict(X_sc)

Now that we have our cluster, I will be looking at the mean feature values for each.

In [None]:
gym_df['cluster']=labels

In [None]:
pd.set_option('max_columns', None)
gym_df.groupby('cluster').describe().T

There is a big difference in mean values for the 'partner' parameter, with those in cluster 2 and 3 having a low value of .24 and .25 and those in cluster 1 having a much higher value at .95(meaning most of them were affiliated with partner companies).

Cluster 1 has the highest mean value for lifetime membership at 4.4 with clsuter 0 having the lowest at 3.06.

When it comes to average additional charges, cluster 1 has the highest mean value at 155 with cluster 0 having the lowest at 137.5.

Those in cluster 1 have a much higher contract period at a mean value of 7.67 while the other clusters have mean values that range from 2 to 4 which also lends to a higher mean value for month to end of contract. Cluster 3 had the highest mean value for group visits and cluster 2 had the lowest.

Now I am going to plot the distribution of the numerical features of each cluster.

In [None]:
#Contract period
sns.boxplot(data=gym_df, x='cluster', y='Contract_period')
plt.title('Contract Period')
plt.show()

In [None]:
#Group visits
sns.boxplot(data=gym_df, x='cluster', y='Group_visits')
plt.title('Group Visits')
plt.show()

In [None]:
#Average Additional Charges Total
sns.boxplot(data=gym_df, x='cluster', y='Avg_additional_charges_total')
plt.title('Average Additional Charges Total')
plt.show()

In [None]:
#Lifetime
sns.boxplot(data=gym_df, x='cluster', y='Lifetime')
plt.title('Lifetime')
plt.show()

In [None]:
#Average Class Frequency Total
sns.boxplot(data=gym_df, x='cluster', y='Avg_class_frequency_total')
plt.title('Average Class Frequency Total')
plt.show()

In [None]:
#Age
sns.boxplot(data=gym_df, x='cluster', y='Age')
plt.title('Age')
plt.show()

For the most part these distributions are similar to the ones above where the Lifetime and average additional charges are skewed to the right. The age features tends to be normally distributed for all clusters.
For average additional charges total cluster 0 has the most variance. Cluster 1 has the highest observation for average class frequency.

My final step will be calculating the churn rate for each cluster to see if they differ. This will also shend light on which clusters are prone to leaving.
I will be calculating the churn rate within the individual clusters and also from the dataframe as a whole.

In [None]:
#cluster 0 churn rate
gym_df.query('cluster ==0')['Churn'].mean()

In [None]:
#cluster 1 churn rate

gym_df.query('cluster ==1')['Churn'].mean()

In [None]:
#cluster 2 churn rate
gym_df.query('cluster ==2')['Churn'].mean()

In [None]:
#cluster 3 churn rate
gym_df.query('cluster ==3')['Churn'].mean()

In [None]:
#cluster 4 churn rate
gym_df.query('cluster ==4')['Churn'].mean()

Churn rate for clusters:

0- 40%

1- 12%

2- 38%

3- 20%

4- 27%

There's a great difference between clusters with those 40% of clients in cluster 0 churning versus only 12% in cluster 1. Those prone to leaving belong in clusters 0 and 2, with those that are more likely to be loyal belonging to cluster 1 and 3. 

<div class="alert alert-block alert-warning">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

We could do it in one line, but ok.
</div>


# Conclusions and Recommendations

In [None]:
gym_df.groupby('cluster').describe().T

The goal of this report was to identify the type of customer that is more likely to churn and the type of customer that tends to stay loyal to Model Fitness. To do this first I had to pre-proess the data to determine if there were any missing values or categorical data types that would need to be converted. After this I analyzed the mean values of all the features of the dataframe as a whole and then split into two groups (does who churned and does who did not). There were differences in the mean values of these groups with loyal customers taking classes more frequently and spending more money on additional charges. The mean values also showed that the loyal customers had longer membership plans. 

My next step was then training two models (Logistic Regression and Random Forest) and then comparing their metrics (Precision, Accuracy, and Recall) to see which model was better at predicting whether or not a customer would churn. Though the metrics were close, Random Forest Classifier performed better.

The next goal was to form clusters with the clients to be able to pinpoint the commonalities between customers and to find the tpye of the customer that is more likely to churn. In plotting a dendogram and using KMeans I was left with 5 different clusters that had unique mean values across the features. 

My conclusion from my analyses is that customers who belong in cluster 0 or 2 are more likely to churn while those in cluster 1 and 4 are more prone to staying loyal.

My recommendation would be to focus on the following principles--

Additional charges: Gear more marketing resources towards promoting and advertising the secondary services the gym provides such as the cafe, athletic goods and massages. I would also suggest perhaps making a sale with the massage services such as get 1 massage and get the next one at a discounted rate. The data shows that those who spent more money additional charges were less likely to churn.

Group visits: As customers with higher group visits are more likely to remain loyal, I would offer promotions that are geared towards groups. Perhaps also throwing in a discounted or free service such as merchandise to clients who come in with groups of 3 or more. 

Class attendance: In order to boost class attendance I would recommend emailing out a survey to the customer base to get their thoughts on what kind of classes they would like to see offered. Getting new classes in that cater to the interest of the clients may encourage them to attend more classes. Customers with high class attendance are not as likely to churn.

For example: Sugggestions in action for cluster 0--


For clients in cluster 0 I would suggest offering them a discount: a buy one get one at 50% discount rate for massages and also give them a punch card for the cafe. Once they make their 10th purchase, they can get a free beverage. I would also send out surveys to those in cluster 0 asking them to fill in what kind of classes they would like to see offered. Once they complete that survey, they will be rewarded with a one-time free pass for a class.
