<h2>Telco Customer Churn Prediction - Model Training</h2>

<h3>After performing EDA on the Telco dataset, a predictive model was created 
to classify whether a customer will churn.</h3>

In [65]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [66]:
data = pd.read_csv(r"D:\Documents\ML\Project\data_after_EDA.csv")
data.head()

Unnamed: 0,Tenure Months,Online Security,Online Backup,Tech Support,Contract,Monthly Charges,Churn Label,Churn Value
0,2,Yes,Yes,No,Month-to-month,53.85,Yes,1
1,2,No,No,No,Month-to-month,70.7,Yes,1
2,8,No,No,No,Month-to-month,99.65,Yes,1
3,28,No,No,Yes,Month-to-month,104.8,Yes,1
4,49,No,Yes,No,Month-to-month,103.7,Yes,1


In [67]:
data = data.drop(labels=["Churn Label"], axis=1)

In [68]:
data.dtypes

Tenure Months        int64
Online Security     object
Online Backup       object
Tech Support        object
Contract            object
Monthly Charges    float64
Churn Value          int64
dtype: object

Splitting Data into labels and features

In [69]:
X = data.drop(labels="Churn Value", axis=1)
X.head()

Unnamed: 0,Tenure Months,Online Security,Online Backup,Tech Support,Contract,Monthly Charges
0,2,Yes,Yes,No,Month-to-month,53.85
1,2,No,No,No,Month-to-month,70.7
2,8,No,No,No,Month-to-month,99.65
3,28,No,No,Yes,Month-to-month,104.8
4,49,No,Yes,No,Month-to-month,103.7


In [70]:
y = data["Churn Value"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Churn Value, dtype: int64

Converting categorical data into numerical data by OneHotEncoding

In [71]:
cat_cols = X.select_dtypes(include=["object"])
cat_cols = cat_cols.columns
cat_cols

Index(['Online Security', 'Online Backup', 'Tech Support', 'Contract'], dtype='object')

In [72]:
encoder = OneHotEncoder(sparse_output=False)

In [73]:
encoded_arr = encoder.fit_transform(X[cat_cols])
encoded_arr

array([[0., 0., 1., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 1.]])

In [74]:
encoded_df = pd.DataFrame(
    encoded_arr,
    columns=encoder.get_feature_names_out(cat_cols),
    index=X.index
)

In [75]:
encoded_df.head()

Unnamed: 0,Online Security_No,Online Security_No internet service,Online Security_Yes,Online Backup_No,Online Backup_No internet service,Online Backup_Yes,Tech Support_No,Tech Support_No internet service,Tech Support_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year
0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


In [76]:
X.head()

Unnamed: 0,Tenure Months,Online Security,Online Backup,Tech Support,Contract,Monthly Charges
0,2,Yes,Yes,No,Month-to-month,53.85
1,2,No,No,No,Month-to-month,70.7
2,8,No,No,No,Month-to-month,99.65
3,28,No,No,Yes,Month-to-month,104.8
4,49,No,Yes,No,Month-to-month,103.7


In [77]:
X = X.drop(labels=cat_cols, axis=1)

In [78]:
X = X.join(encoded_df)


In [79]:
X.head()

Unnamed: 0,Tenure Months,Monthly Charges,Online Security_No,Online Security_No internet service,Online Security_Yes,Online Backup_No,Online Backup_No internet service,Online Backup_Yes,Tech Support_No,Tech Support_No internet service,Tech Support_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year
0,2,53.85,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,2,70.7,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,8,99.65,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,28,104.8,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
4,49,103.7,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


In [80]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Churn Value, dtype: int64

Now, the data must be split into training and testing data

In [81]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

In [82]:
X_train.shape

(5634, 14)

In [83]:
X_test.shape

(1409, 14)

Now the model can be created and trained

In [84]:
model = RandomForestClassifier()

In [85]:
model.fit(X_train, y_train)

In [86]:
y_pred = model.predict(X_test)

Now test the accuracy of the model

In [None]:
from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, y_pred) * 100
score

78.85024840312278

The model has 78.8% accuracy, but this can be increased by tuning the hyperparamaters

In [97]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": [50, 100, 200, 300],        # number of trees
    "max_depth": [5, 10, 20, None],             # tree depth
    "min_samples_split": [2, 5, 10],            # min samples to split
    "min_samples_leaf": [1, 2, 4],              # min samples in a leaf
    "max_features": ["sqrt", "log2", None],     # features considered at each split
}

model = RandomForestClassifier()
rnd_model = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50, cv=5, verbose=2)

rnd_model.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END max_depth=20, max_features=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   1.1s
[CV] END max_depth=20, max_features=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   1.2s
[CV] END max_depth=20, max_features=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   1.1s
[CV] END max_depth=20, max_features=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   1.1s
[CV] END max_depth=20, max_features=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   1.0s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=   0.7s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=   0.6s
[CV] END max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimato

In [98]:
rnd_model.best_params_

{'n_estimators': 200,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 10}

In [99]:
rnd_y_pred = rnd_model.predict(X_test)
rnd_score = accuracy_score(y_test, rnd_y_pred) * 100
rnd_score

80.55358410220013

The accuracy of the model is now increased to 80.5% <br/>
Now this model must be saved

In [100]:
import pickle

pickle.dump(rnd_model, open("Random_Forest_Telco_Churn_Model_1.pk1", "wb"))

The saved model can directly be used later for making predictions without training it again