# Domain Selection:
Our team has chosen to focus on the Banking Industry, specifically analyzing customer churn.
This decision aligns with our interests in understanding customer behavior and retention strategies within a highly competitive sector. 
Customer churn is a critical area for banks, as retaining existing clients is often more cost-effective than acquiring new ones. By analyzing patterns and identifying factors that influence customers to leave, we can provide insights that help banks improve customer loyalty and minimize churn rates.
Given the significance of customer retention for banks, this topic will allow us to explore predictive analytics and gain hands-on experience with models to forecast churn. 
This is not only relevant in the banking industry but also applicable across other sectors where customer loyalty is key. We believe this focus will provide valuable insights and practical applications for our future work in data analytics.

# Dataset Selection:
link: https://www.kaggle.com/datasets/shubh0799/churn-modelling

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('Churn_Modelling.csv')

In [5]:
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [7]:
df.set_index('RowNumber',inplace = True)

In [9]:
df.dtypes

CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [11]:
df.describe()

Unnamed: 0,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


# Outcome Variables:Excited (0-Stayed and 1-Left)
**Description**:\
**0 represents** a customer who has stayed with the bank.\
**1 represents** a customer who has exited or left the bank.\
**Justification**: We choose Exited as the outcome variable because it clearly represents customer churn, which is the primary focus of predicting whether a customer will stay or leave the bank.

# Feature Selection:


In [15]:
summary = []

# Loop through each column to capture relevant information
for column in df.columns:
    feature_info = {}
    feature_info['Feature'] = column
    feature_info['Data_Type'] = df[column].dtype
    
    # Determine Min and Max only for numeric columns
    if pd.api.types.is_numeric_dtype(df[column]):
        feature_info['Min'] = df[column].min()
        feature_info['Max'] = df[column].max()
        feature_info['Range'] = f"{feature_info['Min']} - {feature_info['Max']}"
    else:
        feature_info['Min'] = ''
        feature_info['Max'] = ''
        feature_info['Range'] = "(Non-numeric)"
    
    # Justifications for each feature
    if column == 'CustomerId':
        feature_info['Justification'] = f"'{column}' is unique for each customer and does not contribute to the prediction of churn, so it can be excluded from the analysis"
    elif column == 'Surname':
        feature_info['Justification'] = f"'{column}' does not influence customer churn and can be excluded."
    elif column == 'CreditScore':
        feature_info['Justification'] = f"'{column}' may affect financial stability, potentially influencing churn."
    elif column == 'Geography':
        feature_info['Justification'] = f"'{column}' might reflect regional satisfaction variations affecting churn."
    elif column == 'Gender':
        feature_info['Justification'] = f"'{column}' differences might influence behavior and satisfaction, impacting churn."
    elif column == 'Age':
        feature_info['Justification'] = f"'{column}' may correlate with loyalty; younger customers may be more likely to churn."
    elif column == 'Tenure':
        feature_info['Justification'] = f"'{column}' indicates loyalty; shorter tenure may increase churn likelihood."
    elif column == 'Balance':
        feature_info['Justification'] = f"'{column}' Customers with zero balance might be less engaged and more likely to churn."
    elif column == 'NumOfProducts':
        feature_info['Justification'] = f"'{column}' More products indicate higher engagement, potentially reducing churn."
    elif column == 'HasCrCard':
        feature_info['Justification'] = f"'{column}' Credit card holders may be more engaged, indicating lower churn likelihood."
    elif column == 'IsActiveMember':
        feature_info['Justification'] = f"'{column}' are more engaged, often showing lower churn rates."
    elif column == 'EstimatedSalary':
        feature_info['Justification'] = f"'{column}' may correlate with financial stability, influencing churn risk."
    elif column == 'Exited':
        feature_info['Justification'] = f"'{column}' target variable, indicating if the customer has churned (1) or not (0)."
    
    # Append the feature_info to summary list
    summary.append(feature_info)

# Convert summary to a DataFrame for display
summary_df = pd.DataFrame(summary)

# Display the summary DataFrame
summary_df

Unnamed: 0,Feature,Data_Type,Min,Max,Range,Justification
0,CustomerId,int64,15565701.0,15815690.0,15565701 - 15815690,'CustomerId' is unique for each customer and d...
1,Surname,object,,,(Non-numeric),'Surname' does not influence customer churn an...
2,CreditScore,int64,350.0,850.0,350 - 850,"'CreditScore' may affect financial stability, ..."
3,Geography,object,,,(Non-numeric),'Geography' might reflect regional satisfactio...
4,Gender,object,,,(Non-numeric),'Gender' differences might influence behavior ...
5,Age,int64,18.0,92.0,18 - 92,'Age' may correlate with loyalty; younger cust...
6,Tenure,int64,0.0,10.0,0 - 10,'Tenure' indicates loyalty; shorter tenure may...
7,Balance,float64,0.0,250898.09,0.0 - 250898.09,'Balance' Customers with zero balance might be...
8,NumOfProducts,int64,1.0,4.0,1 - 4,'NumOfProducts' More products indicate higher ...
9,HasCrCard,int64,0.0,1.0,0 - 1,'HasCrCard' Credit card holders may be more en...
