# Customer Churn Prediction

## Objective

The objective is to develop a machine learning model to predict customer churn based on historical customer data. You 
will follow a typical machine learning project pipeline, from data preprocessing to model deployment

### Data Preprocessing

In [13]:
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split

In [3]:
df  = pd.read_csv("customer_churn_large_dataset.csv")
df.head()

Unnamed: 0,CustomerID,Name,Age,Gender,Location,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Churn
0,1,Customer_1,63,Male,Los Angeles,17,73.36,236,0
1,2,Customer_2,62,Female,New York,1,48.76,172,0
2,3,Customer_3,24,Female,Los Angeles,5,85.47,460,0
3,4,Customer_4,36,Female,Miami,3,97.94,297,1
4,5,Customer_5,46,Female,Miami,19,58.14,266,0


In [5]:
df.dtypes

CustomerID                      int64
Name                           object
Age                             int64
Gender                         object
Location                       object
Subscription_Length_Months      int64
Monthly_Bill                  float64
Total_Usage_GB                  int64
Churn                           int64
dtype: object

In [6]:
df.describe()

Unnamed: 0,CustomerID,Age,Subscription_Length_Months,Monthly_Bill,Total_Usage_GB,Churn
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,44.02702,12.4901,65.053197,274.39365,0.49779
std,28867.657797,15.280283,6.926461,20.230696,130.463063,0.499998
min,1.0,18.0,1.0,30.0,50.0,0.0
25%,25000.75,31.0,6.0,47.54,161.0,0.0
50%,50000.5,44.0,12.0,65.01,274.0,0.0
75%,75000.25,57.0,19.0,82.64,387.0,1.0
max,100000.0,70.0,24.0,100.0,500.0,1.0


In [7]:
df.isnull().sum()

CustomerID                    0
Name                          0
Age                           0
Gender                        0
Location                      0
Subscription_Length_Months    0
Monthly_Bill                  0
Total_Usage_GB                0
Churn                         0
dtype: int64

It seems there are no missing data in the dataset

In [10]:
# Handling outliers
z_sub = stats.zscore(df['Subscription_Length_Months'])
df = df[(z_sub < 3) & (z_sub > -3)]

z_bill = stats.zscore(df['Monthly_Bill'])
df = df[(z_bill < 3) & (z_bill > -3)]

z_usage = stats.zscore(df['Total_Usage_GB'])
df = df[(z_usage < 3) & (z_usage > -3)]

z_churn = stats.zscore(df['Churn'])
df = df[(z_churn < 3) & (z_churn > -3)]


In [11]:
df = pd.get_dummies(df, columns=['Gender', 'Location'], drop_first=True)

In [12]:
X = df.drop('Churn', axis=1)
y = df['Churn']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)