## Welcome to the Data Science Coding Challange!¶
Test your skills in a real-world coding challenge. Coding Challenges provide CS & DS Coding Competitions with Prizes and achievement badges!

CS & DS learners want to be challenged as a way to evaluate if they’re job ready. So, why not create fun challenges and give winners something truly valuable such as complimentary access to select Data Science courses, or the ability to receive an achievement badge on their Coursera Skills Profile - highlighting their performance to recruiters.

## Introduction
In this challenge, you'll get the opportunity to tackle one of the most industry-relevant maching learning problems with a unique dataset that will put your modeling skills to the test. Subscription services are leveraged by companies across many industries, from fitness to video streaming to retail. One of the primary objectives of companies with subscription services is to decrease churn and ensure that users are retained as subscribers. In order to do this efficiently and systematically, many companies employ machine learning to predict which users are at the highest risk of churn, so that proper interventions can be effectively deployed to the right audience.

In this challenge, we will be tackling the churn prediction problem on a very unique and interesting group of subscribers on a video streaming service!

Imagine that you are a new data scientist at this video streaming company and you are tasked with building a model that can predict which existing subscribers will continue their subscriptions for another month. We have provided a dataset that is a sample of subscriptions that were initiated in 2021, all snapshotted at a particular date before the subscription was cancelled. Subscription cancellation can happen for a multitude of reasons, including:

the customer completes all content they were interested in, and no longer need the subscription
the customer finds themselves to be too busy and cancels their subscription until a later time
the customer determines that the streaming service is not the best fit for them, so they cancel and look for something better suited
Regardless the reason, this video streaming company has a vested interest in understanding the likelihood of each individual customer to churn in their subscription so that resources can be allocated appropriately to support customers. In this challenge, you will use your machine learning toolkit to do just that!

## Understanding the Datasets
Train vs. Test
In this competition, you’ll gain access to two datasets that are samples of past subscriptions of a video streaming platform that contain information about the customer, the customers streaming preferences, and their activity in the subscription thus far. One dataset is titled train.csv and the other is titled test.csv.

train.csv contains 70% of the overall sample (243,787 subscriptions to be exact) and importantly, will reveal whether or not the subscription was continued into the next month (the “ground truth”).

The test.csv dataset contains the exact same information about the remaining segment of the overall sample (104,480 subscriptions to be exact), but does not disclose the “ground truth” for each subscription. It’s your job to predict this outcome!

Using the patterns you find in the train.csv data, predict whether the subscriptions in test.csv will be continued for another month, or not.

Dataset descriptions
Both train.csv and test.csv contain one row for each unique subscription. For each subscription, a single observation (CustomerID) is included during which the subscription was active.

In addition to this identifier column, the train.csv dataset also contains the target label for the task, a binary column Churn.

Besides that column, both datasets have an identical set of features that can be used to train your model to make predictions. Below you can see descriptions of each feature. Familiarize yourself with them so that you can harness them most effectively for this machine learning task!

In [None]:
import pandas as pd
data_descriptions = pd.read_csv('data_descriptions.csv')
pd.set_option('display.max_colwidth', None)
data_descriptions

In [None]:
# Importing required packages

# Data packages
import pandas as pd
import numpy as np

# Machine Learning / Classification packages
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

# Visualization Packages
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Importing other packages which will be used
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier

## Load the Data
Let's start by loading the dataset train.csv into a dataframe train_df, and test.csv into a dataframe test_df and display the shape of the dataframes.

In [None]:
train_df = pd.read_csv("train.csv")
print('train_df Shape:', train_df.shape)
train_df.head()

In [None]:
test_df = pd.read_csv("test.csv")
print('test_df Shape:', test_df.shape)
test_df.head()

## Exploring, Cleaning, Validating, and Visualizing the Data 
Here we need to explore, clean, validate, and visualize the data however you see fit for this competition to help determine or optimize your predictive model. Please note - the final autograding will only be on the accuracy of the prediction_df predictions.

In [None]:
train_df.isnull().sum()

In [None]:
train_df.duplicated().sum()

In [None]:
train_df['Churn'].value_counts()

In [None]:
train_df['Churn'].value_counts().plot(kind='bar')

plt.title('Churn distribution using Training Data')
plt.xlabel('Churn')
plt.ylabel('Count')

## ML MODEL
### Using XGBoost classifier

In [None]:
# If your system is not having XGBoost install using command
!pip install xgboost

In [None]:
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
X_train = train_df.drop(['CustomerID','Churn'], axis=1)
y_train = train_df['Churn']
X_test = test_df.drop(['CustomerID'], axis=1)

## Feature Engineering

In [None]:
# Usage Patterns
X_train['AverageViewingDurationPerWeek'] = X_train['AverageViewingDuration'] / (X_train['ViewingHoursPerWeek'] + 1e-8)
X_test['AverageViewingDurationPerWeek'] = X_test['AverageViewingDuration'] / (X_test['ViewingHoursPerWeek'] + 1e-8)

In [None]:
# Aggregated Features
X_train['AverageViewingDurationBySubscriptionType'] = X_train.groupby('SubscriptionType')['AverageViewingDuration'].transform('mean')
X_test['AverageViewingDurationBySubscriptionType'] = X_test.groupby('SubscriptionType')['AverageViewingDuration'].transform('mean')

In [None]:
# Interaction Ratios
X_train['ContentDownloadsToViewingHoursRatio'] = X_train['ContentDownloadsPerMonth'] / (X_train['ViewingHoursPerWeek'] * 4 + 1e-8)
X_test['ContentDownloadsToViewingHoursRatio'] = X_test['ContentDownloadsPerMonth'] / (X_test['ViewingHoursPerWeek'] * 4 + 1e-8)

In [None]:
label_encoder = LabelEncoder()
categorical_columns = X_train.select_dtypes(include=['object', 'string']).columns
for col in categorical_columns:
    X_train[col] = label_encoder.fit_transform(X_train[col])
    X_test[col] = label_encoder.transform(X_test[col])

In [None]:
xgb_model = XGBClassifier(n_estimators=100, random_state=0)
xgb_model.fit(X_train, y_train)

In [None]:
predicted_probability = 1-xgb_model.predict_proba(X_test)[:, 0]

In [None]:
# # Combine predictions with label column into a dataframe
prediction_df = pd.DataFrame({'CustomerID': test_df[['CustomerID']].values[:, 0],
                              'predicted_probability': predicted_probability})

In [None]:
# # Ensuring it should contain 104,480 rows and 2 columns 'CustomerID' and 'predicted_probaility'
print(prediction_df.shape)
prediction_df.head(10)

## Final Test

In [None]:
# Writing to csv for autograding purposes
prediction_df.to_csv("prediction_submission.csv", index=False)
submission = pd.read_csv("prediction_submission.csv")

assert isinstance(submission, pd.DataFrame), 'You should have a dataframe named prediction_df.'

In [None]:
assert submission.columns[0] == 'CustomerID', 'The first column name should be CustomerID.'
assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'

In [None]:
assert submission.shape[0] == 104480, 'The dataframe prediction_df should have 104480 rows.'

In [None]:
assert submission.shape[1] == 2, 'The dataframe prediction_df should have 2 columns.'

## Work Submitted