# Banking Churn Predictor
---

### Introduction
---
This project aims to analyze the customer churn rate for bank. Identify patterns and factors that contribute to customer churn, enabling banks to take proactive measures to retain customers and improve customer satisfaction.

---
## 1.) Import Required Packages

####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]:
import numpy as np
import pandas as pd
import warnings
import os
warnings.filterwarnings('ignore')

---
## 2.) Data Loading
- The data consists of 14 column and 10000 rows.

#### Import the CSV Data as Pandas DataFrame

In [2]:
df=pd.read_csv('../data/raw_data/customer_churn_data.csv')

#### Show Top 5 Records

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


#### Shape of the dataset

In [4]:
df.shape

(10000, 14)

### 2.1 Dataset Description

1. **RowNumber**: It is likely a unique identifier for each record and does not contribute directly to the analysis.

2. **CustomerId**: It can be used to track and differentiate individual customers within the dataset.

3. **Surname**: It provides information about the family name of each customer.

4. **CreditScore**: It is a numerical value that assesses the creditworthiness of an individual based on their credit history and financial behavior.

5. **Geography**: It provides information about the customers' geographic distribution, allowing for analysis based on regional or national factors.

6. **Gender**: It categorizes customers as either male or female, enabling gender-based analysis if relevant to the churn prediction.

7. **Age**: It represents the customer's age in years and can be used to analyze age-related patterns and behaviors.

8. **Tenure**: It typically represents the number of years or months the customer has been associated with the bank.

9. **Balance**: It reflects the amount of money in the customer's bank account at a specific point in time.

10. **NumOfProducts**: It can include various offerings such as savings accounts, loans, credit cards, etc.

11. **HasCrCard**: It is a binary variable with a value of 1 if the customer possesses a credit card and 0 otherwise.

12. **IsActiveMember**: It is a binary variable indicating whether the customer is an active member (1) or not (0) within the bank.

13. **EstimatedSalary**: It provides an approximation of the customer's income level, which can be relevant for analyzing churn behavior.

14. **Exited**: It indicates whether a customer has churned (1) or not (0) from the bank. It is the variable we aim to predict using the other features.

---
## 3.) Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column

### 3.1 Check Missing values

In [5]:
df.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

#### There are no missing values in the data set

### 3.2 Check Duplicates

In [6]:
df.duplicated().sum()

0

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [7]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


### 3.4 Checking the number of unique values of each column

In [8]:
df.nunique()

RowNumber          10000
CustomerId         10000
Surname             2932
CreditScore          460
Geography              3
Gender                 2
Age                   70
Tenure                11
Balance             6382
NumOfProducts          4
HasCrCard              2
IsActiveMember         2
EstimatedSalary     9999
Exited                 2
dtype: int64

---
## 4.) Saving the Processed data

In [9]:
save_location = '../data/processed_data/customer_churn_data.csv'
os.makedirs(os.path.dirname(save_location), exist_ok=True)  # Ensure the directory exists
df.to_csv(save_location, index=False)