<h2 style="color:red">Description</h2>

The bank customer churn dataset is a commonly used dataset for predicting customer churn in the banking industry. It contains information on bank customers who either left the bank or continue to be a customer. The dataset includes the following attributes:

1. `id`: A unique identifier for a record
2. `CustomerId`: A unique identifier for each customer
3. `Surname`: The customer's surname or last name
4. `CreditScore`: A numerical value representing the customer's credit score
5. `Geography`: The country where the customer resides (France, Spain or Germany)
6. `Gender`: The customer's gender (Male or Female)
7. `Age`: The customer's age
8. `Tenure`: The number of years the customer has been with the bank
9. `Balance`: The customer's account balance
10. `NumOfProducts`: The number of bank products the customer uses (e.g., savings account, credit card)
11. `HasCrCard`: Whether the customer has a credit card (1=yes, 0=no)
12. `IsActiveMember`: Whether the customer is an active member (1=yes, 0=no)
13. `EstimatedSalary`: The estimated salary of the customer
14. `Exited`: Whether the customer has churned (1=yes, 0=no)

<h2 style="color:red">Task</h2>

Predict whether a customer continues with their account or closes it (e.g., churns).

<h2 style="color:red">Evaluation Metric</h2>

Submissions are evaluated on `area under the ROC curve` between the predicted probability and the observed target.

In [None]:
# import required libraries
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)

from imblearn.over_sampling import SMOTE
# from sklearn.feature_selection import RFE

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

# to suppress the warnings
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# import the train & test dataset & create respective dataframes
churn_train_df = pd.read_csv("/kaggle/input/playground-series-s4e1/train.csv")
churn_test_df = pd.read_csv("/kaggle/input/playground-series-s4e1/test.csv")

In [None]:
# Top 5 records of train dataset
churn_train_df.head()

In [None]:
# Top 5 records of test dataset
churn_test_df.head()

In [None]:
# Shape of training dataset
churn_train_df.shape

In [None]:
# Shape of test dataset
churn_test_df.shape

In [None]:
# Training data info
churn_train_df.info()

In [None]:
# Test data info
churn_test_df.info()

<h2 style="color:red">Data Cleaning</h2>

<h3 style="color:red">1. Handle Missing Values</h3>

In [None]:
# Check missing values in train data
churn_train_df.isnull().sum()

<h4 style="color:red">We can see there are no missing values in the training data. Hence missing value treatment is not required.</h4>

In [None]:
# Check missing values in test data
churn_test_df.isnull().sum()

<h3 style="color:red">2. Handle Duplicate Values</h3>

In [None]:
# Check duplicate values in training data
churn_train_df[churn_train_df.duplicated()]

<h4 style="color:red">There are no duplicate values in the training dataset. Hence duplicate value treatment is not required.</h4>

In [None]:
# Check total number of duplicate values in test data
churn_test_df[churn_test_df.duplicated()]

<h2 style="color:red">Data Exploration</h2>

In [None]:
# Check the count of unique values in target variable of training data
churn_train_df['Exited'].value_counts()

In [None]:
sns.countplot(x='Exited',data=churn_train_df,palette='hls')
plt.show()

In [None]:
# Calculate the percentage of churned & non-churned customers
count_non_churn = len(churn_train_df[churn_train_df['Exited']==0])
count_churn = len(churn_train_df[churn_train_df['Exited']==1])

print("Percentage of non-churn customers: ", (count_non_churn/(count_non_churn+count_churn))*100)
print("Percentage of churned customers: ", (count_churn/(count_non_churn+count_churn))*100)

<h4 style="color:red">The dataset is imbalanced. The ratio of non-churn to churn customers instances is 78:21.</h4>

In [None]:
numerical_features = churn_train_df.select_dtypes(include=["int64","float64"]).columns.tolist()

# Create a dataframe with only numerical features
numerical_churn_df = pd.DataFrame(churn_train_df, columns=numerical_features)

In [None]:
numerical_churn_df.head()

In [None]:
# Group the average values of the features based on the churn & non-churn customers
numerical_churn_df.groupby('Exited').mean()

<h4 style="color:red">Observations:</h4>

- The average credit score of churned customers is lower than that of the non-churn customers.
- The average age of churned customers is higher than that of the non-churn customers.
- The average tenure of churned customers is lower than that of the non-churn customers.
- The average account balance of the churned customers is higher than that of the non-churn customers.
- The average number of bank products the churned customers use is lower than that of the non-churn customers.
- The customers having credit card is lower in case of churned customers than that of the non-churn customers.
- On an average 30% of the churned customers & 55% of the non-churn customers are active members. That means inactive customers tend to churn more.
- The average estimated salary of the churned customers is more than that of the non-churn customers.

In [None]:
# List categorical features
categorical_features = churn_train_df.select_dtypes(include=['object']).columns.tolist()
categorical_features

In [None]:
# drop 'id','CustomerId','Surname','Gender' columns from the main dataframe & create another dataframe for further analysis
df1 = churn_train_df.drop(columns=['id','CustomerId','Surname','Gender'], axis=1)
df1.head()

In [None]:
# Group the average values of the features based on 'Geography'
df1.groupby('Geography').mean()

<h4 style='color:red'>Observation:</h4>

- On an average, maximum number of customers churned are from Germany.
- The average credit score of the customers is highest in Spain  & lowest in France.
- The average age of customers in Germany is higher than France & Spain.
- The average customer account balance of Germany is much higher than France & Spain.
- The average estimated salary of customers of Germany is higher than France & Spain.
- The average number of customers who are active members, is lowest in Germany.

In [None]:
# drop 'id','CustomerId','Surname','Geography' columns from the main dataframe & create another dataframe for further analysis
df2 = churn_train_df.drop(columns=['id','CustomerId','Surname','Geography'], axis=1)
df2.head()

In [None]:
# Group the average values of the features based on 'Gender'
df2.groupby('Gender').mean()

<h4 style='color:red'>Observations:</h4>

- The average credit score of male customers is more than that of female customers but with less difference.
- The average age of male customers is less than that of female customers.
- The average tenure of male customers is more than that of female customers.
- The average account balance of male customers is less than that of female customers.
- The average number of bank products used by male customers is more than that of female customers.
- On an average, male customers are more active members than that of female customers.
- The average estimated salary of male customers is less than that of female customers.
- Female customers churned more than male customers.

In [None]:
# drop 'id','CustomerId','Surname' columns from the main dataframe & create another dataframe for further analysis
df3 = churn_train_df.drop(columns=['id','CustomerId','Surname'], axis=1)
df3.head()

In [None]:
# Group the average values of the features based on 'Geography' & 'Gender'
df3.groupby(['Geography','Gender']).mean()

<h4 style='color:red'>Observations:</h4>

- In all the 3 countries, the female customers churned more than the male customers.
- In all the 3 countries, the average estimates salary of female customers is more than male customers.
- In all the 3 countries, male customers are more active members than that of female customers.
- In all the 3 countries, on an average, male customers have more credit card than that of female customers.
- In all the 3 countries, on an average, male customers use more number of bank products than that of female customers.
- In all the 3 countries, the average age of female customers is more than that of male customers.

<h2 style='color:red'>Data Visualization</h2>

In [None]:
 pd.crosstab(df3.Geography,df3.Exited).plot(kind='bar')
plt.title('Customer churn based on geography')
plt.xlabel('Geography')
plt.ylabel('Customer Churn')
plt.show()

In [None]:
pd.crosstab(df3.Gender,df3.Exited).plot(kind='bar')
plt.title('Customer churn based on gender')
plt.xlabel('Gender')
plt.ylabel('Customer Churn')
plt.show()

<h4 style='color:red'>From the plot we can see that the male customers churn less than the female customers. Thus 'Gender' can be a good predictore of the churn.</h4>

In [None]:
pd.crosstab(df3.NumOfProducts,df3.Exited).plot(kind='bar')
plt.xlabel('Number of bank products used')
plt.ylabel('Customer Churn')
plt.show()

<h4 style='color:red'>From the plot we can see that the customers who are using around 2 bank products, they are less likely to churn. The customers who are using more than 2 products, are more likely to churn. Thus 'NumOfProducts' can be a good predictore of the churn.</h4>

In [None]:
pd.crosstab(df3.HasCrCard,df3.Exited).plot(kind='bar')
plt.xlabel('Has Credit Card?')
plt.ylabel('Customer Churn')
plt.show()

In [None]:
pd.crosstab(df3.IsActiveMember,df3.Exited).plot(kind='bar')
plt.xlabel('Active Member?')
plt.ylabel('Customer Churn')
plt.show()

<h4 style='color:red'>From the plot we can see that the customers who are active, the churn rate is less among those customers than the inactive customers. Thus 'IsActiveMember' can be a good predictore of the churn.</h4>

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df3,x='Tenure',hue='Exited', multiple='stack')
plt.show()

In [None]:
plt.figure(figsize=(10,12))
sns.histplot(df3,x='CreditScore',hue='Exited', multiple='stack')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df3,x='Age',hue='Exited', multiple='stack')
plt.show()

<h4 style='color:red'>From the plot we can see that as the age of customers passing 45, the churn rate is basically increasing. Thus 'Age' can be a good predictore of the churn.</h4>

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df3,x='Balance',hue='Exited', multiple='stack')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(df3,x='EstimatedSalary',hue='Exited', multiple='stack')
plt.show()

In [None]:
df3.CreditScore.hist()
plt.title('Histogram of Credit Score')
plt.xlabel('CreditScore')
plt.ylabel('Frequency')
plt.show()

<h4 style='color:red'>Most of the customers of the bank in the dataset have credit score in the range of 650-700.</h4>

In [None]:
df3.Age.hist()
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

<h4 style='color:red'>Most of the customers of the bank in this dataset are in the age range of 32.5-40.</h4>

<h3 style="color:red">Drop features not much helpful for model prediction</h3>

In [None]:
# Separate features and labels of training data
X = churn_train_df.drop("Exited", axis=1)
y = churn_train_df["Exited"]

In [None]:
# define a function to drop a list of columns from the train dataframe
def drop_column(df, col_list):
    for col in col_list:
        df.drop(col, axis=1, inplace=True)
        print(f"{col} has been dropped from the dataframe")

In [None]:
# List columns to be dropped
cols = ['id','CustomerId','Surname']

In [None]:
# Calling drop_column()
drop_column(X, cols)

<h2 style='color:red'>Split data into train & validation set</h2>

In [None]:
# Split the training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# drop 'id','CustomerId','Surname' columns from test data & create X_test
X_test = churn_test_df.drop(['id','CustomerId','Surname'], axis=1)

In [None]:
# keeping values of id column of test dataset for later use (for file submission)
id_test = churn_test_df['id']

<h2 style="color:red">Data Preprocessing</h2>

<h3 style="color:blue">Encode Categorical Variables into numerical format</h3>

In [None]:
# List the categorical features using the data frame 'X'
cat_features = X.select_dtypes(include=['object']).columns.tolist()
cat_features

In [None]:
# define a function to perform one-hot-encoding & label-encoding simultaneously on categorical features
def encode_categorical(df, feature_name):
    df_encoded= pd.get_dummies(df, columns=[feature_name], prefix="Encoded_"+feature_name[0:3])
    print(f"Encoded dataframe for {feature_name} has been created.")
    
    # list encoded columns
    encoded_cols= df_encoded.filter(like='Encoded_').columns.tolist()
    print(f"Encoded columns are: {encoded_cols}")
    
    '''One-hot-encoding creates individual features based on values of the original categorical feature. 
    But the values of those encoded features used to be in boolean form.
    Hence creating another function to perform Label Encoding on those True & False values.'''
    def label_encode(df_encd, column):
        df_encd[column]=df_encd[column].astype(int)
        print(f"The Label Encoding done successfully for {column}.")
            
        return df_encd[column]
    
    for col in encoded_cols:
        df_encoded[col] = label_encode(df_encoded, col)
            
    return df_encoded

In [None]:
# Call 'encode_categorical' function to encode "Geography" & 'Gender' features of training dataframe
X_train_encoded = encode_categorical(X_train, 'Geography')
X_train_encoded = encode_categorical(X_train_encoded, 'Gender')
X_train_encoded.head()

In [None]:
# Call 'encode_categorical' function to encode "Geography" & 'Gender' features of validation dataframe
X_val_encoded = encode_categorical(X_val, 'Geography')
X_val_encoded = encode_categorical(X_val_encoded, 'Gender')
X_val_encoded.head()

In [None]:
# Call 'encode_categorical' function to encode "Geography" & 'Gender' features of test dataframe
X_test_encoded = encode_categorical(X_test, 'Geography')
X_test_encoded = encode_categorical(X_test_encoded, 'Gender')
X_test_encoded.head()

<h3 style='color:blue'>Feature Scaling</h3>

In [None]:
# Use StandardScaler to scale numerical features

# Identify numerical features
num_features = ['CreditScore','Age','Tenure','Balance','NumOfProducts','EstimatedSalary']

# Standardize numerical features of 'X_train_encoded', 'X_val_encoded' & 'X_test_encoded'
scaler = StandardScaler()

X_train_encoded[num_features] = scaler.fit_transform(X_train_encoded[num_features])
X_val_encoded[num_features] = scaler.transform(X_val_encoded[num_features])
X_test_encoded[num_features] = scaler.transform(X_test_encoded[num_features])

In [None]:
X_train_encoded.head()

In [None]:
X_val_encoded.head()

<h3 style='color:blue'>Handle data imbalance with SMOTE</h3>

In [None]:
# oversampling minority class in train data using Synthetic Minority Oversampling Technique (SMOTE)
smote_samp = SMOTE(sampling_strategy='minority',random_state=42)

In [None]:
columns = X_train_encoded.columns

# Fit the model to generate the data.
X_train_resampled,y_train_resampled = smote_samp.fit_resample(X_train_encoded, y_train)

In [None]:
X_train_resampled.head()

In [None]:
y_train_resampled_df = pd.DataFrame(y_train_resampled,columns=['Exited'])

In [None]:
print("length of oversampled data is ",len(X_train_resampled))
print("Number of non churn records in oversampled data",len(y_train_resampled_df[y_train_resampled_df['Exited']==0]))
print("Number of churn records in oversampled data",len(y_train_resampled_df[y_train_resampled_df['Exited']==1]))
print("Proportion of non churn records in oversampled data is ",len(y_train_resampled_df[y_train_resampled_df['Exited']==0])/len(X_train_resampled))
print("Proportion of churn records in oversampled data is ",len(y_train_resampled_df[y_train_resampled_df['Exited']==1])/len(X_train_resampled))

<h2 style='color:red'>Base Decision Tree Classifier Model</h2>

In [None]:
# Create DecisionTreeClassifier model instance
base_model = DecisionTreeClassifier(criterion='entropy',
                                    max_depth=5,
                                    min_samples_leaf=4,
                                    min_samples_split=2, 
                                    random_state=42)

<h2 style='color:red'>AdaBoost Classifier Model</h2>

In [None]:
adaboost_model = AdaBoostClassifier(base_model, random_state=42)

In [None]:
# Fit the DecisionTreeClassifier model
adaboost_model.fit(X_train_resampled,y_train_resampled)

<h2 style='color:red'>Evaluate the base model</h2>

In [None]:
# predict on validation dataset
y_val_pred = adaboost_model.predict(X_val_encoded)
y_val_pred_proba = adaboost_model.predict_proba(X_val_encoded)[:,1]

In [None]:
print('Accuracy of AdaBoost classifier on validation set: {:.2f}'.format(adaboost_model.score(X_val_encoded, y_val)))

In [None]:
# Confusion Metrix
conf_matrix = confusion_matrix(y_val, y_val_pred)
print("Confusion Matrix:\n", conf_matrix)

The result is telling us that we have 24308+4052=28360 correct predictions and 2903+4052=6955 incorrect predictions.

In [None]:
# Precision, recall, F-measure, support - Classification Report
class_report = classification_report(y_val, y_val_pred)
print("Classification Report:\n",class_report)

In [None]:
# ROC AUC Score
roc_auc = roc_auc_score(y_val, y_val_pred_proba)
print("ROC AUC Score:", roc_auc)

In [None]:
# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_val, y_val_pred_proba)
plt.figure(figsize=(8,8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

<h2 style='color:red'>Hyperparameter Tuning</h2>

In [None]:
# Hyperparameter tuning using GridSearchCV
param_grid = {'n_estimators': [50, 100, 200],
              'learning_rate': [0.01, 0.1, 0.2]}

grid_search = GridSearchCV(adaboost_model, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train_resampled, y_train_resampled)

In [None]:
# Best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

In [None]:
# Fit the model with the best hyperparameters
final_adabc_model = AdaBoostClassifier(**best_params, random_state=42)
final_adabc_model.fit(X_train_resampled, y_train_resampled)

<h2 style='color:red'>Evaluate the best-parameter model</h2>

In [None]:
# predict on validation dataset
y_val_pred_new = final_adabc_model.predict(X_val_encoded)
y_val_pred_new_proba = final_adabc_model.predict_proba(X_val_encoded)[:,1]

In [None]:
print('Accuracy of Decision Tree classifier on validation set: {:.2f}'.format(final_adabc_model.score(X_val_encoded, y_val)))

In [None]:
# Confusion Metrix
conf_matrix_new = confusion_matrix(y_val, y_val_pred_new)
print("Confusion Matrix:\n",conf_matrix_new)

The result is telling us that we have 20664+5521=26185 correct predictions and 1434+5388=6822 incorrect predictions.

In [None]:
# Precision, recall, F-measure, support - Classification Report
class_report_new = classification_report(y_val, y_val_pred_new)
print("Classification Report:\n", class_report_new)

In [None]:
# ROC AUC Score
roc_auc_new = roc_auc_score(y_val, y_val_pred_new_proba)
print("ROC AUC Score:", roc_auc_new)

In [None]:
# Plot ROC Curve
fpr, tpr, _ = roc_curve(y_val, y_val_pred_new_proba)
plt.figure(figsize=(8,8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc_new)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

<h4 style='color:red'>There is no significant change in the model performance after performing hyperparameter tuning. So, we will just consider this newly created model for prediction.</h4>

<h2 style='color:red'>Prediction on Test Data</h2>

In [None]:
# predict on test dataset
y_pred = final_adabc_model.predict(X_test_encoded)

In [None]:
# Combine Predictions with IDs for the expected output
output_df = pd.DataFrame({'id': id_test, 'Exited': y_pred})
output_df.head()

In [None]:
# Shape of output file
output_df.shape

In [None]:
# Save the predictions to a CSV file
output_df.to_csv('adabc_predictions_1.csv', index=False)