The Task is pretty easy to understand. But, to brief it, we have 21 different columns where CHURN is Y(dependent variable) and customerID is not needed and remaining 19 are X(dependent variables). Now we try all 3 classification methods i.e. Random forest, logistic regression and Gradient boosting. All the models has almost same accuracy, but Gradient Boosting has more accuracy. After finnetuning Randomforest using GridSearchCV, the accuracy of the model increases to 80.24%. Since for finetuning, we have to choose parameters manually(Hyperparameters), I have tried many possible combinations of parameters. The parameters with most accuracy of 80 percent are as follows:
param_grid = {
    'n_estimators': [100],
    'max_features': ['sqrt'],
    'max_depth': [10],
    'min_samples_split': [2],
    'min_samples_leaf': [2]
}
After finetuning the model, we can see that random forest's accuracy increases from 79.5 to 80.2 percent. That's why we choose random forest as final model for training and prediction.
I have loaded the model in a .pkl file using joblib.
If the user wants to predict output using this model, simpliy run the model using joblib.load.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

df = pd.read_csv('CHURN_DATASET.csv')

df = df[df['TotalCharges'] != ' ']
df['TotalCharges'] = df['TotalCharges'].astype(float)

label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    if column != 'customerID':
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        label_encoders[column] = le

X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [2]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression:")
print(classification_report(y_test, y_pred_logreg))
print("Accuracy:", accuracy_score(y_test, y_pred_logreg))
print("ROC AUC:", roc_auc_score(y_test, y_pred_logreg))

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("\nRandom Forest:")
print(classification_report(y_test, y_pred_rf))
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("ROC AUC:", roc_auc_score(y_test, y_pred_rf))

gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
print("\nGradient Boosting:")
print(classification_report(y_test, y_pred_gb))
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print("ROC AUC:", roc_auc_score(y_test, y_pred_gb))

Logistic Regression:
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1033
           1       0.62      0.49      0.55       374

    accuracy                           0.79      1407
   macro avg       0.73      0.69      0.70      1407
weighted avg       0.77      0.79      0.78      1407

Accuracy: 0.7853589196872779
ROC AUC: 0.6926311402850324

Random Forest:
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1033
           1       0.64      0.48      0.55       374

    accuracy                           0.79      1407
   macro avg       0.73      0.69      0.71      1407
weighted avg       0.78      0.79      0.78      1407

Accuracy: 0.7903340440653873
ROC AUC: 0.6917549735726376

Gradient Boosting:
              precision    recall  f1-score   support

           0       0.83      0.91      0.87      1033
           1       0.65      0.49      0.56       374

    accurac

In [46]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100],
    'max_features': ['sqrt'],
    'max_depth': [10],
    'min_samples_split': [2],
    'min_samples_leaf': [2]
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='roc_auc')
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

Fitting 3 folds for each of 1 candidates, totalling 3 fits


In [47]:
logreg = LogisticRegression(random_state=42, max_iter=1000)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
logreg_accuracy = accuracy_score(y_test, y_pred_logreg)
logreg_roc_auc = roc_auc_score(y_test, y_pred_logreg)

best_rf.fit(X_train, y_train)
y_pred_rf = best_rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_roc_auc = roc_auc_score(y_test, y_pred_rf)

gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)
gb_accuracy = accuracy_score(y_test, y_pred_gb)
gb_roc_auc = roc_auc_score(y_test, y_pred_gb)

print("Logistic Regression accuracy is : ",logreg_accuracy)
print("Random forest accuracy after finetuning is: ",rf_accuracy)
print("Gradient boosting accuracy is: ",gb_accuracy)


Logistic Regression accuracy is :  0.7853589196872779
Random forest accuracy after finetuning is:  0.8024164889836531
Gradient boosting accuracy is:  0.7953091684434968


In [48]:
import joblib
joblib.dump(best_rf, 'RandomForest.pkl')

['RandomForest.pkl']