## Introduction
### Project scenario
A Data Science Coding project working with a real-world dataset to build a prediction or classification model. Python and Jupyter Notebook are used to tackle the problem, focusing on data manipulation, feature engineering, and model evaluation.

### Summary
Predicting whether or not subscriptions on a video stream platform will churn. This can help platform managers assess current and future plans to keep and gain more viewers.

### Solution
This project was aimed at maximizing the are under the ROC curve (ROC AUC),  which shows the model's ability to distinguish between positive and negative classes effectively. 
Through data cleaning, preprocessing, visualization, and modeling, it was possible to train a model that yields an ROC AUC of 75.07%. The yielded performance is better than 93% of all the models trained by other competitors in the challenge, thus, can help detect churn better.

### Approach
1. Understanding and loading the data (pandas library)
1. Preprocessing and encoding the data (pandas library)
1. Visualizing and analyzing the data (Tableau public)
1. Data modeling and model selection (Scikit-Learn library)
1. Model training and evaluation (Scikit-Learn library)

## Importing packages

In [2]:
import joblib
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier

## Understanding the Data

In [47]:
data_descriptions = pd.read_csv('/kaggle/input/predictive-analytics-for-customer-churn-dataset/data_descriptions.csv')
pd.set_option('display.max_colwidth', None)
data_descriptions

Unnamed: 0,Column_name,Column_type,Data_type,Description
0,AccountAge,Feature,integer,The age of the user's account in months.
1,MonthlyCharges,Feature,float,The amount charged to the user on a monthly basis.
2,TotalCharges,Feature,float,The total charges incurred by the user over the account's lifetime.
3,SubscriptionType,Feature,object,"The type of subscription chosen by the user (Basic, Standard, or Premium)."
4,PaymentMethod,Feature,string,The method of payment used by the user.
5,PaperlessBilling,Feature,string,Indicates whether the user has opted for paperless billing (Yes or No).
6,ContentType,Feature,string,"The type of content preferred by the user (Movies, TV Shows, or Both)."
7,MultiDeviceAccess,Feature,string,Indicates whether the user has access to the service on multiple devices (Yes or No).
8,DeviceRegistered,Feature,string,"The type of device registered by the user (TV, Mobile, Tablet, or Computer)."
9,ViewingHoursPerWeek,Feature,float,The number of hours the user spends watching content per week.


## Loading the Data

In [3]:
train_df = pd.read_csv("/kaggle/input/predictive-analytics-for-customer-churn-dataset/train.csv")
print('train_df Shape:', train_df.shape)
test_df = pd.read_csv("/kaggle/input/predictive-analytics-for-customer-churn-dataset/test.csv")
print('test_df Shape:', test_df.shape)

train_df Shape: (243787, 21)

test_df Shape: (104480, 20)


## Explore, Clean, Validate, and Encode the Data

In [4]:
# Checking existence of null and duplicate values
null_count = train_df.isnull().sum()
duplicate_count = train_df.duplicated().sum()
# No missing or duplicate values found

# =============================
# Observing unique values
unique_values = train_df.nunique()
print("Table variables and the number of their unique values:")
print(unique_values)
print("\nDataframe shape:", train_df.shape)

# Categorical feature encoding (one-hot encoding)
columns = ["SubscriptionType", "PaymentMethod", "PaperlessBilling", "ContentType", 
           "MultiDeviceAccess", "DeviceRegistered", "GenrePreference", "Gender", 
           "ParentalControl", "SubtitlesEnabled"]
train_df = pd.get_dummies(train_df, columns=columns)
test_df = pd.get_dummies(test_df, columns=columns)

xtrain = train_df[train_df.columns.drop(["CustomerID", "Churn"])]
ytrain = train_df["Churn"]

xtest = test_df[test_df.columns.drop(["CustomerID"])]

Table variables and the number of their unique values:

AccountAge                     119

MonthlyCharges              243787

TotalCharges                243787

SubscriptionType                 3

PaymentMethod                    4

PaperlessBilling                 2

ContentType                      3

MultiDeviceAccess                2

DeviceRegistered                 4

ViewingHoursPerWeek         243787

AverageViewingDuration      243787

ContentDownloadsPerMonth        50

GenrePreference                  5

UserRating                  243787

SupportTicketsPerMonth          10

Gender                           2

WatchlistSize                   25

ParentalControl                  2

SubtitlesEnabled                 2

CustomerID                  243787

Churn                            2

dtype: int64



Dataframe shape: (243787, 21)


- No missing values or duplicates were found.
- Monthly charges, total charges, viewing hours per week, average viewing duration, and user rating for every customer is unique.
- Not even two customers had the same value for the features stated above, which is interesting!

## Data Modeling and Selection

#### Ridge Classifier

In [50]:
# Model initialization and cross validation
model = RidgeClassifier()
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7493729539102307, std=0.004512814739921477


#### Logistic Regression

In [51]:
model = LogisticRegression(n_jobs=-1, C=1/9)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7476478367086254, std=0.0034229465553378873


#### Decision Tree

In [52]:
model = DecisionTreeClassifier()
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.5600960658534893, std=0.0016015450517860054


#### Random Forest

In [53]:
model = RandomForestClassifier(n_estimators=150, n_jobs=-1)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7316263978356097, std=0.004485483849882397


#### Tree Ensemble

In [54]:
model = BaggingClassifier(n_estimators=150, n_jobs=-1)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7218510738360562, std=0.005005303156959627


#### Neural Networks

In [55]:
model = MLPClassifier(hidden_layer_sizes=(128, 64, 32, 16, ), max_iter=400, early_stopping=True, validation_fraction=0.1)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7481568077721659, std=0.003505186411175766


In [58]:
model = MLPClassifier(hidden_layer_sizes=(128, 64, 32, 16, 8, 4, 2, ), max_iter=400, early_stopping=True, validation_fraction=0.1)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7020643832708185, std=0.09241016382648089


In [60]:
model = MLPClassifier(hidden_layer_sizes=(256, 256, 256, ), max_iter=400, early_stopping=True, validation_fraction=0.1)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.7489541287761324, std=0.004457863078918097


In [65]:
model = MLPClassifier(hidden_layer_sizes=(8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ), max_iter=400, early_stopping=True, validation_fraction=0.1)
scores = cross_validate(model, xtrain, ytrain, scoring=["roc_auc"], n_jobs=-1)
print("Cross validation ROC AUC: mean={0}, std={1}".format(np.mean(scores['test_roc_auc']), np.std(scores['test_roc_auc'])))

Cross validation ROC AUC: mean=0.749027514369645, std=0.004364201658154151


## Training and Testing

In [5]:
model = MLPClassifier(hidden_layer_sizes=(8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ), max_iter=400, early_stopping=True, validation_fraction=0.1) # best
# model = MLPClassifier(hidden_layer_sizes=(256, 256, 256, ), max_iter=400, early_stopping=True, validation_fraction=0.1) # same performance as above
model.fit(xtrain, ytrain)
joblib.dump(model, 'NN.sav')
y_score = model.predict_proba(xtrain)[:, 1]
prediction_df = pd.DataFrame(model.predict_proba(xtest))
prediction_df[0] = test_df["CustomerID"]
prediction_df.rename(columns={0:"CustomerID", 1:"PredictionProbability"}, inplace=True)
preds = model.predict(xtrain)
print("ROC AUC =", roc_auc_score(ytrain, y_score))
print("Accuracy =", accuracy_score(ytrain, preds))

ROC AUC = 0.7497206887409614

Accuracy = 0.8240636293157552


## Final Evaluation
The automated evaluation of the model's performance on the test set is carried out by Coursera.org and the test performance of the model is as follows:
- ROC AUC = 75.07%