# Project: Predicting Customer Churn Using Scikit-learn MLP Classifier
This project analyzes customer transaction data to predict customer churn using a Scikit-learn MLP Classifier. The workflow includes data preprocessing, feature encoding, building a model pipeline, and evaluating model performance.

It is a basic neural network project using MLPClassifier from Scikit-learn. MLP, or **Multi-Layer Perceptron**, is a **feedforward neural network** consisting of one or more hidden layers with nonlinear activation functions such as **ReLU**. In this project, the pipeline trains the neural network to predict customer churn, which is a typical supervised learning task on tabular data.

## Steps:

1. Load the Customer_Transactions.csv dataset from Kaggle.
2. Clean the dataset by handling missing values and formatting columns.
3. Remove leak-prone or unwanted columns.
4. Standardize numeric features using scaling.
5. Encode categorical features using OneHotEncoder.
6. Create an MLP Classifier for churn prediction.
7. Build a Scikit-learn pipeline combining preprocessing and model.
8. Split the dataset into train and test sets.
9. Fit the model to the training data.
10. Evaluate the model and calculate accuracy on the test set.

## Files:

- customer_churn.ipynb
- Customer_Transactions.csv (Dataset from Kaggle - https://www.kaggle.com/datasets/fares279/customers-transactions/data)
- requirements.txt



In [35]:
# Import packages and libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [37]:
# Load the dataset

df = pd.read_csv("Customer_Transactions.csv")
df.head()

Unnamed: 0,customer_id,age,gender,country,annual_income,spending_score,num_purchases,avg_purchase_value,membership_years,website_visits_per_month,cart_abandon_rate,churned,feedback_text,last_purchase_date
0,1,37,Male,Germany,85886,14,18,41.2,6,20,0.95,0,Very satisfied with my purchase.,2025-06-22
1,2,40,Male,India,41041,4,10,31.73,4,29,0.21,0,Good quality and value for money.,2025-10-17
2,3,69,Female,Australia,143869,59,39,65.96,12,26,0.08,0,Excellent customer service.,2025-07-01
3,4,30,Male,UK,87261,45,34,51.87,12,7,0.61,0,Good quality and value for money.,2025-08-17
4,5,69,Female,UK,110678,40,38,59.64,13,16,0.49,0,Excellent customer service.,2025-06-21


In [38]:
df.shape

(10000, 14)

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               10000 non-null  int64  
 1   age                       10000 non-null  int64  
 2   gender                    10000 non-null  object 
 3   country                   10000 non-null  object 
 4   annual_income             10000 non-null  int64  
 5   spending_score            10000 non-null  int64  
 6   num_purchases             10000 non-null  int64  
 7   avg_purchase_value        10000 non-null  float64
 8   membership_years          10000 non-null  int64  
 9   website_visits_per_month  10000 non-null  int64  
 10  cart_abandon_rate         10000 non-null  float64
 11  churned                   10000 non-null  int64  
 12  feedback_text             10000 non-null  object 
 13  last_purchase_date        10000 non-null  object 
dtypes: floa

In [41]:
# Check null values

df.isnull().sum()

customer_id                 0
age                         0
gender                      0
country                     0
annual_income               0
spending_score              0
num_purchases               0
avg_purchase_value          0
membership_years            0
website_visits_per_month    0
cart_abandon_rate           0
churned                     0
feedback_text               0
last_purchase_date          0
dtype: int64

In [42]:
# Summary Statistics

df.describe()

Unnamed: 0,customer_id,age,annual_income,spending_score,num_purchases,avg_purchase_value,membership_years,website_visits_per_month,cart_abandon_rate,churned
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,44.0458,86067.6761,50.9242,22.576,47.44748,6.3405,15.5781,0.501216,0.109
std,2886.89568,15.404669,38986.787991,28.753395,10.163639,11.205902,4.680657,8.655322,0.286836,0.311655
min,1.0,18.0,20028.0,1.0,1.0,16.75,0.0,1.0,0.0,0.0
25%,2500.75,31.0,55345.5,26.0,14.0,39.5,2.0,8.0,0.25,0.0
50%,5000.5,44.0,78339.5,51.0,22.0,46.99,6.0,16.0,0.51,0.0
75%,7500.25,57.0,115570.5,75.0,31.0,55.08,10.0,23.0,0.75,0.0
max,10000.0,70.0,179960.0,100.0,49.0,83.27,15.0,30.0,1.0,1.0


In [43]:
# Remove unwanted columns (leak sources)

df = df.drop(columns=['customer_id', 'last_purchase_date', 'feedback_text'])

In [44]:
# Standardize numeric features using scaling

numeric_features = [
    'age', 'annual_income', 'spending_score', 'num_purchases',
    'avg_purchase_value', 'membership_years',
    'website_visits_per_month', 'cart_abandon_rate'
]

# Encode categorical features using OneHotEncoder

categorical_features = ['gender', 'country']

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)



In [46]:
# Create an MLP Classifier for churn prediction.

mlp = MLPClassifier(
    hidden_layer_sizes=(128, 64),
    activation="relu",
    solver="adam",
    learning_rate_init=0.0005,
    max_iter=500,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    random_state=42
)

In [47]:
# Build a Scikit-learn pipeline combining preprocessing and model.

model = Pipeline([
    ("preprocessor", preprocessor),
    ("mlp", mlp)
])

In [48]:
X = df.drop("churned", axis=1)
y = df["churned"]

In [49]:
# Split the dataset into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [50]:
# Fit the model to the training data.

model.fit(X_train, y_train)
pred = model.predict(X_test)

In [52]:
# Evaluate the model and calculate accuracy on the test set.

print("MLP Accuracy: ", accuracy_score(y_test, pred))

MLP Accuracy:  0.8885


### Results: 
MLP Accuracy (without text features): 88.85%

This indicates that the model can predict customer churn with high accuracy using only numeric and categorical features, without relying on potentially leaking text data.

## Future Scope:

- The model can be enhanced by safely incorporating text data through sentiment analysis or embeddings to capture customer feedback patterns.
- Further improvements can be achieved using advanced models like XGBoost or LightGBM and hyperparameter tuning.