## 

## Introduction 
* In the ever-evolving landscape of business and commerce, understanding customer behavior is paramount for strategic decision-making. 
* One effective way to gain insights into customer purchasing patterns is through predictive modeling. This project endeavors to employ a decision tree classifier to forecast whether a customer will make a purchase, leveraging a rich dataset encompassing demographic information and behavioral data. 
* By doing so, we aim to equip businesses with a predictive tool that can enhance their ability to tailor marketing strategies and optimize customer engagement.

## Project Overview 
* The project centers around the creation and utilization of a decision tree classifier, a powerful machine learning algorithm, to predict customer purchasing behavior. Leveraging a diverse dataset that incorporates demographic details and behavioral data, the aim is to develop a robust model capable of making accurate predictions regarding a customer's likelihood to purchase a product or service. 
* The project workflow encompasses data exploration, preprocessing, model training, and evaluation to ensure the classifier's efficacy. The ultimate goal is to provide businesses with a valuable tool for customer relationship management and targeted marketing.

### Problem Statement 
* Despite the increasing availability of data, accurately predicting customer purchase behavior remains a complex challenge. The aim of this project is to address this challenge by developing a decision tree classifier capable of discerning patterns within demographic and behavioral data to forecast purchasing decisions. 
* Key issues involve identifying relevant features, mitigating potential biases in the dataset, and optimizing the decision tree model for both accuracy and interpretability. By tackling these challenges, we seek to empower businesses with a valuable tool for enhancing customer targeting and optimizing marketing initiatives.

## Import Libraries 
* These are the required libraries needed to read data 

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot# from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator
%matplotlib inline

#### Load Data 

In [2]:
# Load the dataset into a DataFrame
bank_data = pd.read_csv(r'C:\Users\wanji\Desktop\ML Projects\bank+marketing\bank\bank-full.csv', sep=';')
bank_data

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


## Data Understanding
This involves:
* Shape
* Columns, column names
* Column Datatypes
    


In [3]:
file_path = r'C:\Users\wanji\Desktop\ML Projects\bank+marketing\bank\bank-full.csv'
bank_data= pd.read_csv(file_path, sep=';')
#Function to determine the shape, column names 
def analyze_dataset(dataset):
    shape = dataset.shape
    column_names = dataset.columns
    columns = dataset.values.tolist()
    return shape, column_names,columns


In [4]:
shape, column_names,columns= analyze_dataset(bank_data)
print("Shape:", shape)
print("Column Names:", column_names)
print("Columns:",columns)


Shape: (45211, 17)
Column Names: Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')
Columns: [[58, 'management', 'married', 'tertiary', 'no', 2143, 'yes', 'no', 'unknown', 5, 'may', 261, 1, -1, 0, 'unknown', 'no'], [44, 'technician', 'single', 'secondary', 'no', 29, 'yes', 'no', 'unknown', 5, 'may', 151, 1, -1, 0, 'unknown', 'no'], [33, 'entrepreneur', 'married', 'secondary', 'no', 2, 'yes', 'yes', 'unknown', 5, 'may', 76, 1, -1, 0, 'unknown', 'no'], [47, 'blue-collar', 'married', 'unknown', 'no', 1506, 'yes', 'no', 'unknown', 5, 'may', 92, 1, -1, 0, 'unknown', 'no'], [33, 'unknown', 'single', 'unknown', 'no', 1, 'no', 'no', 'unknown', 5, 'may', 198, 1, -1, 0, 'unknown', 'no'], [35, 'management', 'married', 'tertiary', 'no', 231, 'yes', 'no', 'unknown', 5, 'may', 139, 1, -1, 0, 'unknown', 'no'], [28, 'management', 'single', 'te

In [6]:
# Column Datatype
# Column Datatype
file_path = r'C:\Users\wanji\Desktop\ML Projects\bank+marketing\bank\bank-full.csv'
bank_data = pd.read_csv(file_path, sep=';')

def column_data_type(dataset):
    column_type = dataset.dtypes
    return column_type

# Call the function with the DataFrame 'bank_data'
data_types = column_data_type(bank_data)
print("Data Types:", data_types)


Data Types: age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object


In [9]:
def get_dataset_statistics(dataset):
    statistics = dataset.describe(include ='all')
    return statistics

# Call the function with the DataFrame 'bank_data'
get_statistics = get_dataset_statistics(bank_data)
get_statistics

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
count,45211.0,45211,45211,45211,45211,45211.0,45211,45211,45211,45211.0,45211,45211.0,45211.0,45211.0,45211.0,45211,45211
unique,,12,3,4,2,,2,2,3,,12,,,,,4,2
top,,blue-collar,married,secondary,no,,yes,no,cellular,,may,,,,,unknown,no
freq,,9732,27214,23202,44396,,25130,37967,29285,,13766,,,,,36959,39922
mean,40.93621,,,,,1362.272058,,,,15.806419,,258.16308,2.763841,40.197828,0.580323,,
std,10.618762,,,,,3044.765829,,,,8.322476,,257.527812,3.098021,100.128746,2.303441,,
min,18.0,,,,,-8019.0,,,,1.0,,0.0,1.0,-1.0,0.0,,
25%,33.0,,,,,72.0,,,,8.0,,103.0,1.0,-1.0,0.0,,
50%,39.0,,,,,448.0,,,,16.0,,180.0,2.0,-1.0,0.0,,
75%,48.0,,,,,1428.0,,,,21.0,,319.0,3.0,-1.0,0.0,,


### Data Cleaning

In [19]:
# Check for duplicate values
# Function to check for duplicates
def check_duplicates(dataset):
    duplicates = dataset.duplicated().sum()
    return duplicates
# Call the function
check_duplicates_result = check_duplicates(bank_data)
check_duplicates_result


0

In [17]:
# Check for missing values
def check_missing_values(dataset):
    missing_values = dataset.isnull().sum()
    return missing_values

# Call the function
missing_values_result = check_missing_values(bank_data)
missing_values_result

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

* From the above there are no duplicates nor missing values, which shows that the dataset is clean and comprehensive analysis/EDA can be done 

## EDA
* In depth understanding of the dataset

# 