# Binary Classification with Bank Churn Dataset.
## This is a kaggle playground competition to improve machine learning and data science skills. Link to the competition can be found [here](https://www.kaggle.com/competitions/playground-series-s4e1/overview)


## Importing Libraries

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

## Data description && EDA

* **Customer ID:** A unique identifier for each customer
* **Surname:** The customer's surname or last name
* **Credit Score:** A numerical value representing the customer's credit score
* **Geography:** The country where the customer resides (France, Spain or Germany)
* **Gender:** The customer's gender (Male or Female)
* **Age:** The customer's age.
* **Tenure:** The number of years the customer has been with the bank
* **Balance:** The customer's account balance
* **NumOfProducts:** The number of bank products the customer uses (e.g., savings account, credit card)
* **HasCrCard:** Whether the customer has a credit card (1 = yes, 0 = no)
* **IsActiveMember:** Whether the customer is an active member (1 = yes, 0 = no)
* **EstimatedSalary:** The estimated salary of the customer
* **Exited:** Whether the customer has churned (1 = yes, 0 = no)

In [10]:
train_data = pd.read_csv("../data/bank_churn_competition/train.csv")
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165034 entries, 0 to 165033
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               165034 non-null  int64  
 1   CustomerId       165034 non-null  int64  
 2   Surname          165034 non-null  object 
 3   CreditScore      165034 non-null  int64  
 4   Geography        165034 non-null  object 
 5   Gender           165034 non-null  object 
 6   Age              165034 non-null  float64
 7   Tenure           165034 non-null  int64  
 8   Balance          165034 non-null  float64
 9   NumOfProducts    165034 non-null  int64  
 10  HasCrCard        165034 non-null  float64
 11  IsActiveMember   165034 non-null  float64
 12  EstimatedSalary  165034 non-null  float64
 13  Exited           165034 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 17.6+ MB


In [29]:
feature_columns = [column for column in list(train_data.columns) if column != "Exited"]
feature_columns.remove("id")
target_column = ["Exited"]

In [30]:
for column in feature_columns:
    data = train_data[column]
    if data.dtype == "float64" or data.dtype == "int64": 
            print(f"{column}:\nMax: {data.max():.2f}\n"
                  f"Min: {data.min():.2f}\n"
                  f"Mean: {data.mean():.3f}"
                  f"\nMedian {data.median():.3f}")
            print("----------\n")

CustomerId:
Max: 15815690.00
Min: 15565701.00
Mean: 15692005.019
Median 15690169.000
----------

CreditScore:
Max: 850.00
Min: 350.00
Mean: 656.454
Median 659.000
----------

Age:
Max: 92.00
Min: 18.00
Mean: 38.126
Median 37.000
----------

Tenure:
Max: 10.00
Min: 0.00
Mean: 5.020
Median 5.000
----------

Balance:
Max: 250898.09
Min: 0.00
Mean: 55478.087
Median 0.000
----------

NumOfProducts:
Max: 4.00
Min: 1.00
Mean: 1.554
Median 2.000
----------

HasCrCard:
Max: 1.00
Min: 0.00
Mean: 0.754
Median 1.000
----------

IsActiveMember:
Max: 1.00
Min: 0.00
Mean: 0.498
Median 0.000
----------

EstimatedSalary:
Max: 199992.48
Min: 11.58
Mean: 112574.823
Median 117948.000
----------


In [4]:
print(f"Target values:: {train_data['Exited'].unique()}")

Target values [0 1]
