# **Project 8: Building Bank Customers Predictor**

***NumPy*** : *A powerful library for numerical computing in Python, providing support for arrays, matrices, and a variety of mathematical operations.*
  
***Seaborn*** : *A statistical data visualization library based on Matplotlib, offering a high-level interface for creating attractive and informative statistical graphics.*

***Matplotlib*** : *A comprehensive plotting library for Python, enabling the creation of static, animated, and interactive visualizations in various formats.*

***Pandas*** : *A versatile data manipulation and analysis library for Python, offering data structures and functions for efficiently handling structured data, such as data frames.*

**"*%matplotlib inline*"** : *A magic command is used in Jupyter notebooks to display Matplotlib plots directly within the notebook interface, allowing for seamless integration of visualizations with code and text.*

***Warnings*** :  *module in Python provides a way to handle warnings generated by the interpreter or libraries. It allows developers to control how warnings are displayed, logged, or ignored during program execution.*

# **Data Loading**

**Importing this neccessary modules for Data Loading, Data Cleaning and Data Visualization.**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
%matplotlib inline

**The line warnings.simplefilter("ignore") in Python suppresses all future warnings that would typically be displayed during program execution, effectively ignoring them.**

In [None]:
warnings.filterwarnings("ignore")

**Reads a CSV file named "Bank customers.csv" into a Pandas DataFrame called "Bank" and displays the first 5 rows of the DataFrame.**

In [None]:
df = pd.read_csv("/content/Bank customers.csv")
df.head(5)

Unnamed: 0,CLIENTNUM,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,...,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,768805383,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,...,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,818770008,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,...,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,713982108,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,...,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,769911858,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,...,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,709106358,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,...,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


# **Data Cleaning**

**The first command removes the "CLIENTNUM" column from the DataFrame "df"; the second command displays the first five rows of "df".**

In [None]:
df = df.drop(columns=["CLIENTNUM"])
df.head(5)

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,Existing Customer,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,Existing Customer,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,Existing Customer,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,Existing Customer,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,Existing Customer,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0


**Displays a summary of the DataFrame "df", including the data types and non-null counts for each column.**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  int64  
 4   Education_Level           10127 non-null  object 
 5   Marital_Status            10127 non-null  object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  int64  
 10  Months_Inactive_12_mon    10127 non-null  int64  
 11  Contacts_Count_12_mon     10127 non-null  int64  
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Op

**Calculates and displays the total number of missing values for each column in the DataFrame "df".**

In [None]:
df.isnull().sum()

Unnamed: 0,0
Attrition_Flag,0
Customer_Age,0
Gender,0
Dependent_count,0
Education_Level,0
Marital_Status,0
Income_Category,0
Card_Category,0
Months_on_book,0
Total_Relationship_Count,0


**Counts and displays the occurrences of each unique value in the "Attrition_Flag" column of the DataFrame "df".**

In [None]:
df["Attrition_Flag"].value_counts()

Unnamed: 0_level_0,count
Attrition_Flag,Unnamed: 1_level_1
Existing Customer,8500
Attrited Customer,1627


**The first command converts the categorical "Attrition_Flag" column in the DataFrame "df" into one-hot encoded columns; the second command displays the first five rows of "df".**

In [None]:
df = pd.get_dummies(df, columns=["Attrition_Flag"])
df.head(5)

Unnamed: 0,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Attrition_Flag_Attrited Customer,Attrition_Flag_Existing Customer
0,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,False,True
1,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,False,True
2,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,False,True
3,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,False,True
4,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,False,True


**The first command renames the specified columns in the DataFrame "df"; the second command displays the first five rows of "df".**

In [None]:
df.rename(columns={"Attrition_Flag_Attrited Customer": "Attrited Customer", "Attrition_Flag_Existing Customer": "Existing Customer"}, inplace=True)
df.head(5)

Unnamed: 0,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Attrited Customer,Existing Customer
0,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,False,True
1,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,False,True
2,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,False,True
3,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,...,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,False,True
4,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,...,4716.0,0,4716.0,2.175,816,28,2.5,0.0,False,True


**The first two commands convert the "Attrited Customer" and "Existing Customer" columns in the DataFrame "df" to integers; the third command displays the first three rows of 'df".**

In [None]:
df["Attrited Customer"] = df["Attrited Customer"].astype(int)
df["Existing Customer"] = df["Existing Customer"].astype(int)
df.head(3)

Unnamed: 0,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,...,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Attrited Customer,Existing Customer
0,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,...,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,0,1
1,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,...,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,0,1
2,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,...,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,0,1


**Counts and displays the occurrences of each unique value in the "Gender" column of the DataFrame "df".**

In [None]:
df["Gender"].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
F,5358
M,4769


**The first command converts the categorical "Gender" column in the DataFrame "df" into one-hot encoded columns; the second command displays the first five rows of "df".**

In [None]:
df = pd.get_dummies(df, columns=["Gender"])
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Attrited Customer,Existing Customer,Gender_F,Gender_M
0,45,3,High School,Married,$60K - $80K,Blue,39,5,1,3,...,11914.0,1.335,1144,42,1.625,0.061,0,1,False,True
1,49,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,...,7392.0,1.541,1291,33,3.714,0.105,0,1,True,False
2,51,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,...,3418.0,2.594,1887,20,2.333,0.0,0,1,False,True
3,40,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,...,796.0,1.405,1171,20,2.333,0.76,0,1,True,False
4,40,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,...,4716.0,2.175,816,28,2.5,0.0,0,1,False,True


**The first command renames the specified columns ("Gender_F" to "Female" and "Gender_M" to "Male") in the DataFrame "df"; the second command displays the first five rows of "df".**

In [None]:
df.rename(columns={"Gender_F": "Female", "Gender_M": "Male"}, inplace=True)
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Attrited Customer,Existing Customer,Female,Male
0,45,3,High School,Married,$60K - $80K,Blue,39,5,1,3,...,11914.0,1.335,1144,42,1.625,0.061,0,1,False,True
1,49,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,...,7392.0,1.541,1291,33,3.714,0.105,0,1,True,False
2,51,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,...,3418.0,2.594,1887,20,2.333,0.0,0,1,False,True
3,40,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,...,796.0,1.405,1171,20,2.333,0.76,0,1,True,False
4,40,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,...,4716.0,2.175,816,28,2.5,0.0,0,1,False,True


**The first two commands convert the "Male" and "Female" columns in the DataFrame "df" to integers; the third command displays the first three rows of "df".**

In [None]:
df["Male"] = df["Male"].astype(int)
df["Female"] = df["Female"].astype(int)
df.head(3)

Unnamed: 0,Customer_Age,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,...,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Attrited Customer,Existing Customer,Female,Male
0,45,3,High School,Married,$60K - $80K,Blue,39,5,1,3,...,11914.0,1.335,1144,42,1.625,0.061,0,1,0,1
1,49,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,...,7392.0,1.541,1291,33,3.714,0.105,0,1,1,0
2,51,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,...,3418.0,2.594,1887,20,2.333,0.0,0,1,0,1


**Counts and displays the occurrences of each unique value in the "Education_Level" column of the DataFrame "df".**

In [None]:
df["Education_Level"].value_counts()

Unnamed: 0_level_0,count
Education_Level,Unnamed: 1_level_1
Graduate,3128
High School,2013
Unknown,1519
Uneducated,1487
College,1013
Post-Graduate,516
Doctorate,451


**The first command converts the categorical "Education_Level" column in the DataFrame "df" into one-hot encoded columns; the second command displays the first five rows of "df".**

In [None]:
df = pd.get_dummies(df, columns=["Education_Level"])
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,...,Existing Customer,Female,Male,Education_Level_College,Education_Level_Doctorate,Education_Level_Graduate,Education_Level_High School,Education_Level_Post-Graduate,Education_Level_Uneducated,Education_Level_Unknown
0,45,3,Married,$60K - $80K,Blue,39,5,1,3,12691.0,...,1,0,1,False,False,False,True,False,False,False
1,49,5,Single,Less than $40K,Blue,44,6,1,2,8256.0,...,1,1,0,False,False,True,False,False,False,False
2,51,3,Married,$80K - $120K,Blue,36,4,1,0,3418.0,...,1,0,1,False,False,True,False,False,False,False
3,40,4,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,...,1,1,0,False,False,False,True,False,False,False
4,40,3,Married,$60K - $80K,Blue,21,5,1,0,4716.0,...,1,0,1,False,False,False,False,False,True,False


**The first command renames the specified columns related to education levels in the DataFrame "df"; the second command displays the first five rows of "df".**

In [None]:
df.rename(columns={"Education_Level_College": "College",
                   "Education_Level_Doctorate": "Doctorate",
                   "Education_Level_Graduate": "Graduate",
                   "Education_Level_High School": "High School",
                   "Education_Level_Post-Graduate": "Post-Graduate",
                   "Education_Level_Uneducated": "Uneducated",
                   "Education_Level_Unknown": "Education-Unknown"}, inplace=True)
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,...,Existing Customer,Female,Male,College,Doctorate,Graduate,High School,Post-Graduate,Uneducated,Education-Unknown
0,45,3,Married,$60K - $80K,Blue,39,5,1,3,12691.0,...,1,0,1,False,False,False,True,False,False,False
1,49,5,Single,Less than $40K,Blue,44,6,1,2,8256.0,...,1,1,0,False,False,True,False,False,False,False
2,51,3,Married,$80K - $120K,Blue,36,4,1,0,3418.0,...,1,0,1,False,False,True,False,False,False,False
3,40,4,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,...,1,1,0,False,False,False,True,False,False,False
4,40,3,Married,$60K - $80K,Blue,21,5,1,0,4716.0,...,1,0,1,False,False,False,False,False,True,False


**These commands ensure that each column representing an education level is converted from categorical values (likely encoded as dummy variables) into integers for further analysis or modeling purposes.**

In [None]:
df["College"] = df["College"].astype(int)
df["Doctorate"] = df["Doctorate"].astype(int)
df["Graduate"] = df["Graduate"].astype(int)
df["High School"] = df["High School"].astype(int)
df["Post-Graduate"] = df["Post-Graduate"].astype(int)
df["Uneducated"] = df["Uneducated"].astype(int)
df["Education-Unknown"] = df["Education-Unknown"].astype(int)
df.head(3)

Unnamed: 0,Customer_Age,Dependent_count,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,...,Existing Customer,Female,Male,College,Doctorate,Graduate,High School,Post-Graduate,Uneducated,Education-Unknown
0,45,3,Married,$60K - $80K,Blue,39,5,1,3,12691.0,...,1,0,1,0,0,0,1,0,0,0
1,49,5,Single,Less than $40K,Blue,44,6,1,2,8256.0,...,1,1,0,0,0,1,0,0,0,0
2,51,3,Married,$80K - $120K,Blue,36,4,1,0,3418.0,...,1,0,1,0,0,1,0,0,0,0


**These columns include various customer and demographic data, as well as the transformed and renamed columns based on your previous operations.**

In [None]:
df.columns

Index(['Customer_Age', 'Dependent_count', 'Marital_Status', 'Income_Category',
       'Card_Category', 'Months_on_book', 'Total_Relationship_Count',
       'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
       'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
       'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
       'Avg_Utilization_Ratio', 'Attrited Customer', 'Existing Customer',
       'Female', 'Male', 'College', 'Doctorate', 'Graduate', 'High School',
       'Post-Graduate', 'Uneducated', 'Education-Unknown'],
      dtype='object')

**This command counts and displays the occurrences of each unique value in the "Marital_Status" column of the DataFrame df.**

In [None]:
df["Marital_Status"].value_counts()

Unnamed: 0_level_0,count
Marital_Status,Unnamed: 1_level_1
Married,4687
Single,3943
Unknown,749
Divorced,748


**The first command converts the categorical "Marital_Status" column in the DataFrame "df" into one-hot encoded columns; the second command displays the first five rows of "df".**

In [None]:
df = pd.get_dummies(df, columns=["Marital_Status"])
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,...,Doctorate,Graduate,High School,Post-Graduate,Uneducated,Education-Unknown,Marital_Status_Divorced,Marital_Status_Married,Marital_Status_Single,Marital_Status_Unknown
0,45,3,$60K - $80K,Blue,39,5,1,3,12691.0,777,...,0,0,1,0,0,0,False,True,False,False
1,49,5,Less than $40K,Blue,44,6,1,2,8256.0,864,...,0,1,0,0,0,0,False,False,True,False
2,51,3,$80K - $120K,Blue,36,4,1,0,3418.0,0,...,0,1,0,0,0,0,False,True,False,False
3,40,4,Less than $40K,Blue,34,3,4,1,3313.0,2517,...,0,0,1,0,0,0,False,False,False,True
4,40,3,$60K - $80K,Blue,21,5,1,0,4716.0,0,...,0,0,0,0,1,0,False,True,False,False


**The first command renames the specified columns related to marital status in the DataFrame "df"; the second command displays the first five rows of "df".**

In [None]:
df.rename(columns={"Marital_Status_Divorced": "Divorced",
                   "Marital_Status_Married": "Married",
                   "Marital_Status_Single": "Single",
                   "Marital_Status_Unknown": "Marital-Unknown"}, inplace=True)
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,...,Doctorate,Graduate,High School,Post-Graduate,Uneducated,Education-Unknown,Divorced,Married,Single,Marital-Unknown
0,45,3,$60K - $80K,Blue,39,5,1,3,12691.0,777,...,0,0,1,0,0,0,False,True,False,False
1,49,5,Less than $40K,Blue,44,6,1,2,8256.0,864,...,0,1,0,0,0,0,False,False,True,False
2,51,3,$80K - $120K,Blue,36,4,1,0,3418.0,0,...,0,1,0,0,0,0,False,True,False,False
3,40,4,Less than $40K,Blue,34,3,4,1,3313.0,2517,...,0,0,1,0,0,0,False,False,False,True
4,40,3,$60K - $80K,Blue,21,5,1,0,4716.0,0,...,0,0,0,0,1,0,False,True,False,False


**These commands ensure that each column representing marital status is converted from categorical values (likely encoded as dummy variables) into integers for further analysis or modeling purposes.**

In [None]:
df["Divorced"] = df["Divorced"].astype(int)
df["Married"] = df["Married"].astype(int)
df["Single"] = df["Single"].astype(int)
df["Marital-Unknown"] = df["Marital-Unknown"].astype(int)
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,...,Doctorate,Graduate,High School,Post-Graduate,Uneducated,Education-Unknown,Divorced,Married,Single,Marital-Unknown
0,45,3,$60K - $80K,Blue,39,5,1,3,12691.0,777,...,0,0,1,0,0,0,0,1,0,0
1,49,5,Less than $40K,Blue,44,6,1,2,8256.0,864,...,0,1,0,0,0,0,0,0,1,0
2,51,3,$80K - $120K,Blue,36,4,1,0,3418.0,0,...,0,1,0,0,0,0,0,1,0,0
3,40,4,Less than $40K,Blue,34,3,4,1,3313.0,2517,...,0,0,1,0,0,0,0,0,0,1
4,40,3,$60K - $80K,Blue,21,5,1,0,4716.0,0,...,0,0,0,0,1,0,0,1,0,0


**This command counts and displays the occurrences of each unique value in the "Income_Category" column of the DataFrame "df".**

In [None]:
df["Income_Category"].value_counts()

Unnamed: 0_level_0,count
Income_Category,Unnamed: 1_level_1
Less than $40K,3561
$40K - $60K,1790
$80K - $120K,1535
$60K - $80K,1402
Unknown,1112
$120K +,727


**The first command converts the categorical "Income_Category" column in the DataFrame "df" into one-hot encoded columns; the second command displays the first five rows of "df".**

In [None]:
df = pd.get_dummies(df, columns=["Income_Category"])
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,...,Divorced,Married,Single,Marital-Unknown,Income_Category_$120K +,Income_Category_$40K - $60K,Income_Category_$60K - $80K,Income_Category_$80K - $120K,Income_Category_Less than $40K,Income_Category_Unknown
0,45,3,Blue,39,5,1,3,12691.0,777,11914.0,...,0,1,0,0,False,False,True,False,False,False
1,49,5,Blue,44,6,1,2,8256.0,864,7392.0,...,0,0,1,0,False,False,False,False,True,False
2,51,3,Blue,36,4,1,0,3418.0,0,3418.0,...,0,1,0,0,False,False,False,True,False,False
3,40,4,Blue,34,3,4,1,3313.0,2517,796.0,...,0,0,0,1,False,False,False,False,True,False
4,40,3,Blue,21,5,1,0,4716.0,0,4716.0,...,0,1,0,0,False,False,True,False,False,False


**These commands rename the income category columns to shorter and more readable names for ease of use and understanding in further data analysis.**

In [None]:
df.rename(columns={"Income_Category_$120K +": "$120K +",
                   "Income_Category_$40K - $60K": "$40K - $60K",
                   "Income_Category_$60K - $80K": "$60K - $80K",
                   "Income_Category_$80K - $120K": "$80K - $120K",
                   "Income_Category_Less than $40K": "Less than $40K",
                   "Income_Category_Unknown": "Income-Unknown"}, inplace=True)

df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,...,Divorced,Married,Single,Marital-Unknown,$120K +,$40K - $60K,$60K - $80K,$80K - $120K,Less than $40K,Income-Unknown
0,45,3,Blue,39,5,1,3,12691.0,777,11914.0,...,0,1,0,0,False,False,True,False,False,False
1,49,5,Blue,44,6,1,2,8256.0,864,7392.0,...,0,0,1,0,False,False,False,False,True,False
2,51,3,Blue,36,4,1,0,3418.0,0,3418.0,...,0,1,0,0,False,False,False,True,False,False
3,40,4,Blue,34,3,4,1,3313.0,2517,796.0,...,0,0,0,1,False,False,False,False,True,False
4,40,3,Blue,21,5,1,0,4716.0,0,4716.0,...,0,1,0,0,False,False,True,False,False,False


**These commands ensure that each column representing income categories is converted from categorical values (likely encoded as dummy variables) into integers for further analysis or modeling purposes.**

In [None]:
df["$120K +"] = df["$120K +"].astype(int)
df["$40K - $60K"] = df["$40K - $60K"].astype(int)
df["$60K - $80K"] = df["$60K - $80K"].astype(int)
df["$80K - $120K"] = df["$80K - $120K"].astype(int)
df["Less than $40K"] = df["Less than $40K"].astype(int)
df["Income-Unknown"] = df["Income-Unknown"].astype(int)
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,...,Divorced,Married,Single,Marital-Unknown,$120K +,$40K - $60K,$60K - $80K,$80K - $120K,Less than $40K,Income-Unknown
0,45,3,Blue,39,5,1,3,12691.0,777,11914.0,...,0,1,0,0,0,0,1,0,0,0
1,49,5,Blue,44,6,1,2,8256.0,864,7392.0,...,0,0,1,0,0,0,0,0,1,0
2,51,3,Blue,36,4,1,0,3418.0,0,3418.0,...,0,1,0,0,0,0,0,1,0,0
3,40,4,Blue,34,3,4,1,3313.0,2517,796.0,...,0,0,0,1,0,0,0,0,1,0
4,40,3,Blue,21,5,1,0,4716.0,0,4716.0,...,0,1,0,0,0,0,1,0,0,0


**This set of commands uses LabelEncoder from sklearn.preprocessing to encode the "Card_Category" column in the DataFrame df into numerical labels, and then displays the first five rows of df:**

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Card_Category"] = le.fit_transform(df["Card_Category"])
df.head(5)

Unnamed: 0,Customer_Age,Dependent_count,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,...,Divorced,Married,Single,Marital-Unknown,$120K +,$40K - $60K,$60K - $80K,$80K - $120K,Less than $40K,Income-Unknown
0,45,3,0,39,5,1,3,12691.0,777,11914.0,...,0,1,0,0,0,0,1,0,0,0
1,49,5,0,44,6,1,2,8256.0,864,7392.0,...,0,0,1,0,0,0,0,0,1,0
2,51,3,0,36,4,1,0,3418.0,0,3418.0,...,0,1,0,0,0,0,0,1,0,0
3,40,4,0,34,3,4,1,3313.0,2517,796.0,...,0,0,0,1,0,0,0,0,1,0
4,40,3,0,21,5,1,0,4716.0,0,4716.0,...,0,1,0,0,0,0,1,0,0,0


**This will show you how many times each encoded value appears in the "Card_Category" column.**

In [None]:
df["Card_Category"].value_counts()

Unnamed: 0_level_0,count
Card_Category,Unnamed: 1_level_1
0,9436
3,555
1,116
2,20


**This command displays all the columns present in the database after Data Cleaning.**

In [None]:
df.columns

Index(['Customer_Age', 'Dependent_count', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Attrited Customer', 'Existing Customer', 'Female', 'Male', 'College',
       'Doctorate', 'Graduate', 'High School', 'Post-Graduate', 'Uneducated',
       'Education-Unknown', 'Divorced', 'Married', 'Single', 'Marital-Unknown',
       '$120K +', '$40K - $60K', '$60K - $80K', '$80K - $120K',
       'Less than $40K', 'Income-Unknown'],
      dtype='object')

# **Model Building**

**X: Contains features (customer demographics, education, marital status, income) for predicting "Card_Category".**

**y: Target variable representing "Card_Category".**

In [None]:
X = df[['Customer_Age', 'Dependent_count', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Attrited Customer', 'Existing Customer', 'Female', 'Male', 'College',
       'Doctorate', 'Graduate', 'High School', 'Post-Graduate', 'Uneducated',
       'Education-Unknown', 'Divorced', 'Married', 'Single', 'Marital-Unknown',
       '$120K +', '$40K - $60K', '$60K - $80K', '$80K - $120K',
       'Less than $40K', 'Income-Unknown']]

y = df[["Card_Category"]]

**The code you provided uses the MinMaxScaler from scikit-learn to scale features to a given range, typically between 0 and 1. The set_output(transform="pandas") method specifies that the output should be a pandas DataFrame.**

In [None]:
from sklearn.preprocessing import MinMaxScaler

Scaler = MinMaxScaler().set_output(transform = "pandas")

X = Scaler.fit_transform(X)

Scaler_y  = MinMaxScaler().set_output(transform = "pandas")

y = Scaler_y.fit_transform(y)

**These imports from scikit-learn enable data splitting for training and testing models, along with evaluation metrics for regression tasks, specifically mean squared error.**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

**This command splits the dataset X and target y into training and testing sets, with 80% of the data allocated for training (X_train and y_train) and 20% for testing (X_test and y_test).**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**1: Importing Linear Regression: from sklearn.linear_model import LinearRegression**

**2: Initializing Linear Regression: LR = LinearRegression()**

**3: Training the Model: LR.fit(X_train, y_train)**

**4: Making Predictions: y_pred = LR.predict(X_test)**

**5: Calculating Mean Squared Error: MSE = mean_squared_error(y_test, y_pred)**

**6: Printing Mean Squared Error: print("Linear Regression Mean Squared Error:", MSE)**

**7: "(n_jobs=-1)" specifies a linear regression model with moderate regularization and utilizes all available CPU cores ("n_jobs=-1") for training.**

In [None]:
from sklearn.linear_model import LinearRegression

LR = LinearRegression(n_jobs = -1)

LR.fit(X_train, y_train)

y_pred = LR.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Linear Regression Mean Squared Error:", MSE)

Linear Regression Mean Squared Error: 0.034463542286488125


**1: Import RandomForestRegressor: from sklearn.ensemble import RandomForestRegressor**

**2: Initialize RandomForestRegressor: RFR = RandomForestRegressor()**

**3: Train the regressor: RFR.fit(X_train, y_train)**

**4: Make predictions: y_pred = RFR.predict(X_test)**

**5: Calculate Mean Squared Error: MSE = mean_squared_error(y_test, y_pred)**

**6: Print Mean Squared Error: print("Random Forest Mean Squared Error:", MSE)**

**7: "'max_depth=20, max_samples=20' sets constraints on decision tree depth ('max_depth=20') and maximum number of samples ('max_samples=20') used for each tree, optimizing model complexity and training efficiency."**

In [None]:
from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(max_depth=5, max_samples=5)

RFR.fit(X_train, y_train)

y_pred = RFR.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Random Forest Mean Squared Error:", MSE)

Random Forest Mean Squared Error: 0.04616618953603159


**1: Import Decision Tree Regressor: from sklearn.tree import DecisionTreeRegressor**

**2: Initialize Decision Tree Regressor: DTR = DecisionTreeRegressor()**

**3: Train the regressor: DTR.fit(X_train, y_train)**

**4: Make predictions: y_pred = DTR.predict(X_test)**

**5: Calculate Mean Squared Error: MSE = mean_squared_error(y_test, y_pred)**

**6: Print Mean Squared Error: print("Decision Tree Mean Squared Error:", MSE)**

In [None]:
from sklearn.tree import DecisionTreeRegressor

DTR = DecisionTreeRegressor()

DTR.fit(X_train, y_train)

y_pred = DTR.predict(X_test)

MSE = mean_squared_error(y_test, y_pred)

print("Decision Tree Mean Squared Error:", MSE)

Decision Tree Mean Squared Error: 0.019085225403093126


# **Conclusion:**

**Based on the analysis of classification algorithms and their respective mean squared errors:**

 - **Logistic Regression Mean Squared Error: 0.0344**
 - **Random Forest Mean Squared Error: 0.0461**
 - **Decision Tree Mean Squared Error: 0.0190**

**The decision tree algorithm exhibits the lowest mean squared error among the models evaluated. Therefore, it is selected as the optimal model for making predictions based on this metric.**

# **Save The Model**

**This code snippet imports the "pickle" module and saves the decision tree classifier ("DTC") into a file named "Finalized-Model.pickle" using binary write mode ("wb").**

In [None]:
import pickle

with open("Finalized-Model.pickle", "wb") as file:
    pickle.dump(DTR, file)