In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import pickle

In [3]:
data= pd.read_csv("Churn_Modelling.csv")
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
## Preprocess the data
### 1. Drop irrelevant features
data=data.drop(['RowNumber','CustomerId','Surname'], axis=1)
data.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Here we encounter categorical variables `Gender` and `Geography`. These need to be converted into numerical format before feeding into a neural network.

### ✳️ Label Encoding for Gender

Gender = ["Male", "Female"]
Label Encoded: Male → 1, Female → 0

✅ Why Label Encoding is suitable for Gender:

Only two categories: "Male" and "Female" — this makes it a binary feature.

Since there’s no ordinal relationship, using values 0 and 1 is safe and efficient.

Saves memory compared to one-hot (which would require two columns).

Most ML models, including neural networks, can handle binary features well.

In [8]:
le_gender= LabelEncoder()
data['Gender']= le_gender.fit_transform(data['Gender'])
data.head()


Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,0,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,502,France,0,42,8,159660.8,3,1,0,113931.57,1
3,699,France,0,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,0,43,2,125510.82,1,1,1,79084.1,0


OHE for Geography

Geography = ["France", "Spain", "Germany"]
One-Hot Encoded:
France  → [1, 0, 0]  
Spain   → [0, 1, 0]  
Germany → [0, 0, 1]

✅ Why One-Hot Encoding is used for Geography:

More than two categories, and no natural ordering (nominal data).

If we use Label Encoding here (e.g., Spain → 0, France → 1, Germany → 2), the model might mistakenly assume that Germany > France > Spain — which introduces false ordinal relationships.

One-hot encoding solves this by treating each category independently and equally.
