
# 
<h2 style="text-align: center; color: blue;">Banking_churn</h2>

<p style="text-align: center;">
    <img src="Employee-walking-away.jpg" alt="Data Cleaning" style="width: 50%; height: auto;">
</p>


## <span style="color:red">Bank Customer Churn Dataset Overview</span>

### 1️. Dataset Summary
- **Number of Rows:** 10,000
- **Number of Columns:** 14
- **Target Variable:** `Exited` (1 = Customer left bank, 0 = Customer stayed)
- **Goal:** Predict customer churn based on demographic and financial features.

---

### 2️. Features Overview

<table>
<tr><th style="color:blue">Column</th><th style="color:green">Type</th><th style="color:purple">Description</th></tr>
<tr><td>RowNumber</td><td>Integer</td><td>Row index (not useful for prediction)</td></tr>
<tr><td>CustomerId</td><td>Integer</td><td>Unique ID of the customer</td></tr>
<tr><td>Surname</td><td>Object</td><td>Customer last name (not predictive)</td></tr>
<tr><td>CreditScore</td><td>Integer</td><td>Customer credit score</td></tr>
<tr><td>Geography</td><td>Category</td><td>Country of residence (France, Spain, Germany)</td></tr>
<tr><td>Gender</td><td>Category</td><td>Male or Female</td></tr>
<tr><td>Age</td><td>Integer</td><td>Customer age in years</td></tr>
<tr><td>Tenure</td><td>Integer</td><td>Years the customer has been with the bank</td></tr>
<tr><td>Balance</td><td>Float</td><td>Customer account balance</td></tr>
<tr><td>NumOfProducts</td><td>Integer</td><td>Number of bank products used</td></tr>
<tr><td>HasCrCard</td><td>Integer</td><td>1 = Has credit card, 0 = No</td></tr>
<tr><td>IsActiveMember</td><td>Integer</td><td>1 = Active customer, 0 = Inactive</td></tr>
<tr><td>EstimatedSalary</td><td>Float</td><td>Estimated annual salary</td></tr>
<tr><td>Exited</td><td>Integer</td><td>Target: 1 = Customer left, 0 = Still customer</td></tr>
</table>

---

### 3️. Key Observations
- **Gender distribution:** Balanced between Male & Female.  
- **Geography:** Customers mainly from France, Spain, and Germany.  
- **Age factor:** Older customers more likely to churn.  
- **Churn rate:** ~20% of customers exited the bank.  
- **Activity:** Inactive customers more likely to churn.  
- **Balance & Salary:** Wide range, with some extreme outliers.  

---

### 4️. Notes
- Columns `RowNumber`, `CustomerId`, `Surname` are identifiers → should be dropped.  
- `Geography` & `Gender` require **encoding** (One-Hot / Label Encoding).  
- Dataset is suitable for **classification task** predicting `Exited`.  


# <span style="color:red">Data understanding </span>


## <span style="color:yellow">Load the dataset </span>

In [22]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px

In [23]:
df = pd.read_csv("Churn_Modelling.csv")

In [24]:
print("Shape of dataset:", df.shape)  

Shape of dataset: (10000, 14)


In [25]:
df.head(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [26]:
print("Columns:\n", df.columns.tolist())


Columns:
 ['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']


In [27]:
print(df['Exited'].unique())


[1 0]


## 

## <span style="color:yellow">Check the data types </span>


In [28]:
print("Data Types Info:")
df.info()

Data Types Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


##

## <span style="color:yellow"> Correct the data types </span>

##### object to category علشان نوفر مساحه في الذاكره
##### drop the unuseful columns [RowNumber, CustomerId, Surname]

In [29]:
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)
df['Geography'] = df['Geography'].astype('category')
df['Gender'] = df['Gender'].astype('category')



In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   CreditScore      10000 non-null  int64   
 1   Geography        10000 non-null  category
 2   Gender           10000 non-null  category
 3   Age              10000 non-null  int64   
 4   Tenure           10000 non-null  int64   
 5   Balance          10000 non-null  float64 
 6   NumOfProducts    10000 non-null  int64   
 7   HasCrCard        10000 non-null  int64   
 8   IsActiveMember   10000 non-null  int64   
 9   EstimatedSalary  10000 non-null  float64 
 10  Exited           10000 non-null  int64   
dtypes: category(2), float64(2), int64(7)
memory usage: 723.0 KB


In [31]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [32]:
print("Exited:")
print(df['Exited'].value_counts())
print(f"Exited: {(df['Exited'].mean() * 100):.2f}%")

Exited:
Exited
0    7963
1    2037
Name: count, dtype: int64
Exited: 20.37%


In [35]:
numeric_columns = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric Columns:", numeric_columns)
for col in numeric_columns:
    if col != 'Exited':
        scale_range = df[col].max() - df[col].min()
        print(f"{col}: Range = {scale_range}")

Numeric Columns: ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited']
CreditScore: Range = 500
Age: Range = 74
Tenure: Range = 10
Balance: Range = 250898.09
NumOfProducts: Range = 3
HasCrCard: Range = 1
IsActiveMember: Range = 1
EstimatedSalary: Range = 199980.90000000002


## <span style="color:yellow">Descriptive statistics </span>



In [36]:
df.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [37]:
df.describe(include=['category','object'])

Unnamed: 0,Geography,Gender
count,10000,10000
unique,3,2
top,France,Male
freq,5014,5457


In [38]:
df.to_csv("pre_cleaned_data.csv", index=False)