## Churn Predication:
To perform a predictive analysis on this dataset, we can follow these steps:
#### 1. Data Preprocessing:
* Handle missing values.
* Encode categorical variables.
* Normalize or scale numerical features.
#### 2. Feature Selection:
* Choose relevant features that influence the target variable (Churn).
#### 3. Model Selection:
* Choose appropriate machine learning algorithms for prediction (e.g., Logistic Regression, Decision Trees, Random Forest, etc.).
#### 4. Model Training and Evaluation:
* Train the model on the dataset.
* Evaluate the model using metrics such as accuracy, precision, recall, F1-score, etc.
#### 5. Hyperparameter Tuning:
* Optimize the model for better performance.
Let's start with the data preprocessing step. Here's an example code using Python with Pandas and Scikit-Learn:

### Import and Loading Libraries

In [1]:
# Import required packages

# Data packages
import pandas as pd
import numpy as np

# Machine Learning / Classification packages
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

# Visualization Packages
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

### Loading Data

In [33]:
# Load the training dataset
train_data = pd.read_csv('train.csv')

# Load the test dataset
test_data = pd.read_csv('test.csv')

In [34]:
train_data.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,...,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID,Churn
0,20,11.055215,221.104302,Premium,Mailed check,No,Both,No,Mobile,36.758104,...,10,Sci-Fi,2.176498,4,Male,3,No,No,CB6SXPNVZA,0
1,57,5.175208,294.986882,Basic,Credit card,Yes,Movies,No,Tablet,32.450568,...,18,Action,3.478632,8,Male,23,No,Yes,S7R2G87O09,0
2,73,12.106657,883.785952,Basic,Mailed check,Yes,Movies,No,Computer,7.39516,...,23,Fantasy,4.238824,6,Male,1,Yes,Yes,EASDC20BDT,0
3,32,7.263743,232.439774,Basic,Electronic check,No,TV Shows,No,Tablet,27.960389,...,30,Drama,4.276013,2,Male,24,Yes,Yes,NPF69NT69N,0
4,57,16.953078,966.325422,Premium,Electronic check,Yes,TV Shows,No,TV,20.083397,...,20,Comedy,3.61617,4,Female,0,No,No,4LGYPK7VOL,0


In [35]:
test_data.head()

Unnamed: 0,AccountAge,MonthlyCharges,TotalCharges,SubscriptionType,PaymentMethod,PaperlessBilling,ContentType,MultiDeviceAccess,DeviceRegistered,ViewingHoursPerWeek,AverageViewingDuration,ContentDownloadsPerMonth,GenrePreference,UserRating,SupportTicketsPerMonth,Gender,WatchlistSize,ParentalControl,SubtitlesEnabled,CustomerID
0,38,17.869374,679.036195,Premium,Mailed check,No,TV Shows,No,TV,29.126308,122.274031,42,Comedy,3.522724,2,Male,23,No,No,O1W6BHP6RM
1,77,9.912854,763.289768,Basic,Electronic check,Yes,TV Shows,No,TV,36.873729,57.093319,43,Action,2.021545,2,Female,22,Yes,No,LFR4X92X8H
2,5,15.019011,75.095057,Standard,Bank transfer,No,TV Shows,Yes,Computer,7.601729,140.414001,14,Sci-Fi,4.806126,2,Female,22,No,Yes,QM5GBIYODA
3,88,15.357406,1351.451692,Standard,Electronic check,No,Both,Yes,Tablet,35.58643,177.002419,14,Comedy,4.9439,0,Female,23,Yes,Yes,D9RXTK2K9F
4,91,12.406033,1128.949004,Standard,Credit card,Yes,TV Shows,Yes,Tablet,23.503651,70.308376,6,Drama,2.84688,6,Female,0,No,No,ENTCCHR1LR


### Encoding Categorical Variables
Convert all categorical columns to numerical using one-hot encoding.

In [36]:
# List of categorical columns
categorical_columns = ['SubscriptionType', 'PaymentMethod', 'PaperlessBilling', 'ContentType',
                       'MultiDeviceAccess', 'DeviceRegistered', 'GenrePreference', 'Gender',
                       'ParentalControl', 'SubtitlesEnabled']

# Apply one-hot encoding to the categorical columns
train_data = pd.get_dummies(train_data, columns=categorical_columns)
test_data = pd.get_dummies(test_data, columns=categorical_columns)

# Ensure the columns in the train and test sets match after encoding
train_columns = set(train_data.columns)
test_columns = set(test_data.columns)

# Find the columns present in one set but not the other
missing_in_train = test_columns - train_columns
missing_in_test = train_columns - test_columns

# Add missing columns with default value 0
for col in missing_in_train:
    train_data[col] = 0
for col in missing_in_test:
    test_data[col] = 0

# Ensure columns are in the same order
train_data = train_data.sort_index(axis=1)
test_data = test_data.sort_index(axis=1)


In [37]:
train_data.head()

Unnamed: 0,AccountAge,AverageViewingDuration,Churn,ContentDownloadsPerMonth,ContentType_Both,ContentType_Movies,ContentType_TV Shows,CustomerID,DeviceRegistered_Computer,DeviceRegistered_Mobile,...,SubscriptionType_Basic,SubscriptionType_Premium,SubscriptionType_Standard,SubtitlesEnabled_No,SubtitlesEnabled_Yes,SupportTicketsPerMonth,TotalCharges,UserRating,ViewingHoursPerWeek,WatchlistSize
0,20,63.531377,0,10,True,False,False,CB6SXPNVZA,False,True,...,False,True,False,True,False,4,221.104302,2.176498,36.758104,3
1,57,25.725595,0,18,False,True,False,S7R2G87O09,False,False,...,True,False,False,False,True,8,294.986882,3.478632,32.450568,23
2,73,57.364061,0,23,False,True,False,EASDC20BDT,True,False,...,True,False,False,False,True,6,883.785952,4.238824,7.39516,1
3,32,131.537507,0,30,False,False,True,NPF69NT69N,False,False,...,True,False,False,False,True,2,232.439774,4.276013,27.960389,24
4,57,45.356653,0,20,False,False,True,4LGYPK7VOL,False,False,...,False,True,False,True,False,4,966.325422,3.61617,20.083397,0


In [38]:
test_data.head()

Unnamed: 0,AccountAge,AverageViewingDuration,Churn,ContentDownloadsPerMonth,ContentType_Both,ContentType_Movies,ContentType_TV Shows,CustomerID,DeviceRegistered_Computer,DeviceRegistered_Mobile,...,SubscriptionType_Basic,SubscriptionType_Premium,SubscriptionType_Standard,SubtitlesEnabled_No,SubtitlesEnabled_Yes,SupportTicketsPerMonth,TotalCharges,UserRating,ViewingHoursPerWeek,WatchlistSize
0,38,122.274031,0,42,False,False,True,O1W6BHP6RM,False,False,...,False,True,False,True,False,2,679.036195,3.522724,29.126308,23
1,77,57.093319,0,43,False,False,True,LFR4X92X8H,False,False,...,True,False,False,True,False,2,763.289768,2.021545,36.873729,22
2,5,140.414001,0,14,False,False,True,QM5GBIYODA,True,False,...,False,False,True,False,True,2,75.095057,4.806126,7.601729,22
3,88,177.002419,0,14,True,False,False,D9RXTK2K9F,False,False,...,False,False,True,False,True,0,1351.451692,4.9439,35.58643,23
4,91,70.308376,0,6,False,False,True,ENTCCHR1LR,False,False,...,False,False,True,True,False,6,1128.949004,2.84688,23.503651,0


In [39]:
len(test_data)

104480

### Remove the CustomerID Column

In [40]:
# Extract CustomerID for the test set before dropping the column
customer_ids = test_data['CustomerID']

# Remove the CustomerID column from both datasets
train_data = train_data.drop(columns=['CustomerID'])
test_data = test_data.drop(columns=['CustomerID'])

# Ensure Churn column is not in the test data
if 'Churn' in test_data.columns:
    test_data = test_data.drop(columns=['Churn'])


### Separate Features and Target in the Training Set

In [41]:
# Separate features and target variable in the training set
X_train = train_data.drop(columns=['Churn'])
y_train = train_data['Churn']

# Features for the test set (test set doesn't have Churn column)
X_test = test_data.copy()


### Model Selection:
Choose appropriate machine learning algorithms for prediction (e.g., Logistic Regression, Decision Trees, Random Forest, etc.).

Let's proceed by training a RandomForestClassifier on the training data and using it to predict on the test data. Here’s the code to do that:

In [42]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [43]:
# Initialize the RandomForestClassifier
model = RandomForestClassifier(random_state=42)

### Model Training and Evaluation:
* Train the model on the dataset.
* Evaluate the model using metrics such as accuracy, precision, recall, F1-score, etc.


In [44]:
# Train the model on the training data
model.fit(X_train, y_train)

### Evaluate the Model

In [45]:
# Make predictions on the training data to evaluate the model
y_train_pred = model.predict(X_train)

In [46]:
# Evaluate the model on the training data
accuracy_train = accuracy_score(y_train, y_train_pred)
precision_train = precision_score(y_train, y_train_pred)
recall_train = recall_score(y_train, y_train_pred)
f1_train = f1_score(y_train, y_train_pred)
conf_matrix_train = confusion_matrix(y_train, y_train_pred)

print(f'Training Accuracy: {accuracy_train}')
print(f'Training Precision: {precision_train}')
print(f'Training Recall: {recall_train}')
print(f'Training F1 Score: {f1_train}')
print(f'Training Confusion Matrix:\n{conf_matrix_train}')

Training Accuracy: 0.9999876941756534
Training Precision: 1.0
Training Recall: 0.9999320990448599
Training F1 Score: 0.9999660483697559
Training Confusion Matrix:
[[199605      0]
 [     3  44179]]


### Making Predictions on the Test Data
Since the test set doesn't contain the Churn column, we will only make predictions on it and save these predictions for further use.

In [47]:
# Make probability predictions on the test data
y_test_prob = model.predict_proba(X_test)[:, 1]

### Create prediction_df
Make predictions on the test data and create the prediction_df with subscription_id and predicted_probability:

In [50]:
# Create the prediction_df with CustomerID and predicted_probability
prediction_df = pd.DataFrame({
    'CustomerID': customer_ids,
    'predicted_probability': y_test_prob
})

# Ensure prediction_df has the correct number of rows and columns
# assert prediction_df.shape == (217921, 2)



### Save the predictions to a CSV file

In [52]:
prediction_df.to_csv('test_predictions.csv', index=False)

### Hyperparameter Tuning:
* Optimize the model for better performance.

To improve the model's performance, you can perform hyperparameter tuning using GridSearchCV: