# **Sparkify Churn Prediction**
---


## **Problem Introduction**

There are many online apps that allow us to listen to our favorite songs. And it's really common among these apps to split between free tier and paid. Since the users from both tiers could stop using the app for many reasons, the analysis and prediction of which user will stop using the service should be helpful and necessary for running business. That also helps entrepreneurs decide to actively launch the appropriate discounts/promotions for these specific users before they leave. And as the result of that, we could say predictive analysis plays extremely crucial role in business and data science could be a great aspect to concern.

Sparkify is a fictional online music app which is used for a specific project on Udacity. And particularly in this project, I have chosen a medium-sized data to implement some analysis and build the machine learning model based on the provided data to predict the churn rate of users. The churn rate prediction is an interesting and common task for a data scientist to help entrepreneurs make decisions on some aspects that can improve their business.

## **Strategy to solve the problem**

Firstly, it would be essential for us to understand the data provided in the Project. Then, we can perform some analysis from that data to discover and answer the related questions. After that, we are good to move on the next step which is to build models that will be applied to predict the churn rate of users.

## **Metrics**

There are some methods that we could use to evaluate the machine learning model as follow:
- `Accuracy`
- `Precision`
- `Recall`
- `F1 Score`
- `ROC/AUC`

The last 2 methods can perform better in evaluating the model performance in case the data is not balanced. Therefore, we can consider using `F1 Score` in this Project.

## **Exploration Data Analysis**

The data contains **286,500** records, each row is an interaction of a user on Sparkify from `October 1st, 2018` to `December 3rd, 2018`. For each interaction, there are 18 attributes which have some information regarding the customers (gender, name, id), the songs (artist name), and the actions of a user (log in, next song, Downgrade, Cancel,..). 

The amount of user recorded in this data is **225** users in which there are **52** users who stopped using Sparkify. That's also approximately **23.11%** of Sparkify's customers who canceled their account during the last 3 months of 2018. This may be considered a "red alarm" for a business.

Therefore, it would be essential for an analysis to be created on the differences between churn user and not-churn user:

<img src="output.png" style="background-color:white;"/>

From the bar chart above, we can obviously see whether a free user tend to cancel their account when comparing to the paid one. And it shows that there is a larger amount of users who paid for the app decided to cancel the app. 

Let's have a look into the proportion of the user's gender:

<img src="output_2.png" style="background-color:white;"/>

This second bar chart shows us that male users seems to be more active to use this app than female users are. It somehow has the same tendency with the ratio of canceling the app between the male and female users. We can see that the number of churn male users is nearly twice as many as the one of churn female user.  

Beside the user's attribute, we also could have a look into the distribution of churn rate among the days in month:

<img src="output_3.png" style="background-color:white;"/>

This bar chart shows the interaction of the users through days in a month. And there are some days in a month we had many users who decided to cancel the application such as: 2nd, 12th, 17th, 20th of a month.

## **Feature Engineering**


As the data contains user's actions on Sparkify, I've decided to predict churn rate based on user's interaction on Sparkify as well as their type of tiers and genders. The features selected are:
- `churn (Label)`
- `level`                    : Paid or free
- `gender`                   : Male/Female
- `avg_daily_session`        : Avg Number of distinct session daily
- `avg_monthly_session`      : Avg Number of distinct session monthly
- `avg_daily_item`    : Avg Number of user's interaction item daily
- `avg_monthly_item`  : Avg Number of user's interaction item monthly

In order to generate the average statistic, I had to aggregate the data on userID to get the neccesary data. In the table above, userID is unique and churn shall be 1 if the user canceled their accounts. Then, pre-processing the gender and level column to numerical data are performed. The provided dataset has many information and I considered using some combinations and calculations to extract useful information. After that, I could select the average daily and monthly sessions as well as of the item interacted by each user based on the page events(Next song, Cancel, Thumbs Up...). 

The final result (data for modelling) should be as below:

|userId|gender|level| avg_daily_session|avg_monthly_session|   avg_daily_item|avg_monthly_item|churn|
|------|------|-----|------------------|-------------------|-----------------|----------------|-----|
|100010|     0|    0| 54.42857142857143|              190.5|6.714285714285714|            10.5|    0|
|200002|     1|    1| 67.71428571428571|              237.0|6.857142857142857|            11.0|    0|
|   125|     1|    0|              11.0|               11.0|              4.0|             4.0|    1|
|   124|     0|    1|            192.76|             2409.5|             8.64|            12.5|    0|
|    51|     1|    1|189.46153846153845|             2463.0|              8.0|            14.0|    1|
|     7|     1|    0|              25.0|              100.0|            4.375|             9.5|    0|
|    15|     1|    1|133.88235294117646|             1138.0|7.647058823529412|            12.5|    0|
|    54|     0|    1|143.16666666666666|             1718.0|              8.0|            15.0|    1|
|   155|     0|    1|           124.875|              999.0|             7.75|            14.0|    0|
|100014|     1|    1|              62.0|              155.0|              6.2|            10.0|    1|


I used some necessary libraries in Spark such as:
- VectorAssembler - a feature transformer that assembles multiple columns into a vector. 
- Pipeline - The needed steps should be added as stages into a pipeline which will be applied to build a model for dataset.
- Appropriate models for training data

## **Modelling**

As the problem is that we need to predict whether a specific user will churn or not (which is represented as 1 or 0). We also need to find out which features that affects the user's decision the most. 

As the upcoming results, there are 3 completely built models which are:
- `Logistic Regression`: suitable for building model with the binary classification
- `Random Forest Classification` and `Gradient Boosting Classifier` are highly recommended to use for classification problem due to its effectiveness

Those models should be imported to use from `PySpark MLlib`.

The cross-validation method is also applied to avoid overfitting when training the model as it will split the dataset by folds.

Since the data is not balanced (the churn rate is only **23.11%**), I would use the `F1 Score` to evaluate the built model. The dataset will also be splitted into **80%** training dataset and **20%** test dataset. And I used k=5 folds for cross-validation.

## **Hyperparameter tuning**

Here are some default parameters used in the models:

`Logistic Regression`
maxIter (Maximum number of iterations): 100

`Random Forest`
maxDepth (Maximum Tree Depth): 5 

`Gradient Boosting`
maxDepth (Maximum Tree Depth): 5

As the result that F1 Score shows us is quite high which is also good enough and acceptable for us to apply. However, it's worth trying using different parameters of the **Gradient Boosting** model to check the improvement of the model.

## **Results**

`Logistic Regression`: The F1 score is **0.90**


`Random Forest Classification`: The F1 score is **0.845**


`Gradient Boosting Classifier`: The F1 score is **0.82**

`Gradient Boosting Classifier`(with hyperparameter - MaxDepth: 3): The F1 score is **0.88**

It seems that the F1 scores of the above models are pretty good as they are approximately **85%**. The model that has the highest F1 score is `Logistic Regression` which accuracy is nearly `90%`. However, we should notice that we are not using all the data but only a subset of it. Hence, the appropriate features as well as parameters of the built models should be properly chosen when we decide to apply for the full-sized dataset.

Ultimately, it's worth finding out the features that has the most impact on the user's cancellation behavior:

`userId` : 0.005301542479173895 

`gender` : 0.018047505349156952  

`level` : 0.22080577718792696 

`avg_daily_session` : 0.1797104088859551 

`avg_monthly_session` : 0.27404281333677766 

`avg_daily_item` : 0.30209195276100953 

It seems to me that `Level (Free/Paid)`, `Average daily item interacted` and `Average monthly session` are the top 3 impacted features.

## **Conlusion**

In short, churn rate prediction could be considered as a very interesting and necessary aspect for running business. The built models above are implemented on a subset dataset of Sparkify. The final dataset used to train the prediction model has 6 features in which there are 4 features calculated based on the provided dataset. The best built model for churn rate prediction is `Logistic Regression` which has the F1 score of `90%`. 

## **Improvements**

Besides, we could improve the prediction result of the model by applying it with the full-sized dataset, a cloud service (like AWS EMR) would be helpful for us to be able to work on such a large dataset like this. 

### Reference:
- The code used and details in this article should be found in the attached notebook in Github repository.