
#                 Customer Churn Prediction by Naima Tanveer


Develop a model to predict customer churn for a subscription-
based service or business. Use historical customer data, including

features like usage behavior and customer demographics, and try
algorithms like Logistic Regression, Random Forests, or Gradient

Boosting to predict churn.

# Importing Libraries

In [20]:
import pandas as pd
import numpy as np
from transformers import AutoModel, BertTokenizerFast
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression


In [21]:
# Load the dataset
data = pd.read_csv("/kaggle/input/bank-customer-churn-prediction/Churn_Modelling.csv")
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [22]:
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [23]:
# Encode 'Geography' and 'Gender' columns using BERT embeddings
data['Geography_encoded'] = data['Geography'].apply(lambda x: tokenizer.encode(x, add_special_tokens=False))
data['Gender_encoded'] = data['Gender'].apply(lambda x: tokenizer.encode(x, add_special_tokens=False))


In the above code, the 'Geography' and 'Gender' columns in the 'data' DataFrame are being encoded using BERT embeddings. This is done to represent the textual information in these columns as numerical vectors, which can be used as input features for machine learning models. The 'tokenizer.encode' function is used to tokenize and convert the text in these columns into numeric representations without adding special tokens.

In [24]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_encoded,Gender_encoded
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,[2605],[2931]
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,[3577],[2931]
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,[2605],[2931]
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,[2605],[2931]
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,[3577],[2931]


In [25]:
# Split the data into features and target
X = data[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
          'Geography_encoded', 'Gender_encoded']]
y = data['Exited']

In [26]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The above code splits the data into training and testing sets, with 80% of the data used for training (X_train and y_train) and 20% for testing (X_test and y_test). This separation is crucial for assessing the performance of machine learning models by training them on one portion of the data and evaluating their predictions on the other, helping to avoid overfitting and gauge model generalization. The "test_size" parameter specifies the proportion of data allocated for testing, and "random_state" ensures reproducibility of the split.

In [27]:
# Flatten BERT embeddings lists
X_train['Geography_encoded'] = X_train['Geography_encoded'].apply(lambda x: x[0])
X_train['Gender_encoded'] = X_train['Gender_encoded'].apply(lambda x: x[0])
X_test['Geography_encoded'] = X_test['Geography_encoded'].apply(lambda x: x[0])
X_test['Gender_encoded'] = X_test['Gender_encoded'].apply(lambda x: x[0])

# Standardize numeric features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [29]:
# Logistic Regression
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)

# Random Forest
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)

# Gradient Boosting
gradient_boosting = GradientBoostingClassifier()
gradient_boosting.fit(X_train, y_train)


In [31]:
# Logistic Regression Evaluation
logistic_regression_predictions = logistic_regression.predict(X_test)
print("Logistic Regression Classification Report:")
print(classification_report(y_test, logistic_regression_predictions))

# Random Forest Evaluation
random_forest_predictions = random_forest.predict(X_test)
print("Random Forest Classification Report:")
print(classification_report(y_test, random_forest_predictions))

# Gradient Boosting Evaluation
gradient_boosting_predictions = gradient_boosting.predict(X_test)
print("Gradient Boosting Classification Report:")
print(classification_report(y_test, gradient_boosting_predictions))


Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.97      0.89      1607
           1       0.60      0.17      0.27       393

    accuracy                           0.81      2000
   macro avg       0.71      0.57      0.58      2000
weighted avg       0.78      0.81      0.77      2000

Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      1607
           1       0.76      0.46      0.57       393

    accuracy                           0.87      2000
   macro avg       0.82      0.71      0.75      2000
weighted avg       0.86      0.87      0.85      2000

Gradient Boosting Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.96      0.92      1607
           1       0.75      0.47      0.58       393

    accuracy                           0.87      2000
   macr

Conclusion:The Gradient Boosting model performs similarly to the Random Forest model, with high precision and recall for class 0 and class 1. It achieves good accuracy and F1-scores for both non-default and default cases.

 In summary, the Random Forest and Gradient Boosting models outperform the Logistic Regression model, with better precision, recall, and F1-scores for both classes. They offer a more balanced trade-off between correctly identifying non-default and default cases. The choice between Random Forest and Gradient Boosting would depend on specific application requirements and the importance of different evaluation metrics.
