<a href="https://colab.research.google.com/github/Mweru/Bank-Marketing-Data-Analysis/blob/main/Bank_Marketing_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Data Report here](https://docs.google.com/document/d/1X5978wftiwQ_1Ef9aMhId56umDE_R8Ha-ipip15CuPI/edit?usp=sharing)

# Introduction

## Data Understanding
This data set contains records relevant to a direct marketing campaign of a Portuguese banking institution sourced from [Kaggle](https://www.kaggle.com/datasets/ruthgn/bank-marketing-data-set/data). The marketing campaign was executed through phone calls. Often, more than one call needs to be made to a single client before they either decline or agree to a term deposit subscription. The classification goal is to predict if the client will subscribe (yes/no) to the term deposit (variable y).

This is a modified version of the classic bank marketing data set originally shared in the UCI Machine Learning Repository.

### Input variables:
#### Bank client data:
- age (numeric)
- job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- default: has credit in default? (categorical: 'no','yes','unknown')
- housing: has housing loan? (categorical: 'no','yes','unknown')
- loan: has personal loan? (categorical: 'no','yes','unknown')

#### related with the last contact of the current campaign:
- contact: contact communication type (categorical: 'cellular','telephone')
- month: last contact month of year (categorical: 'jan', 'feb', 'mar', …, 'nov', 'dec')
- day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

#### other attributes:
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

#### social and economic context attributes:
- emp.var.rate: employment variation rate - quarterly indicator (numeric)
- cons.price.idx: consumer price index - monthly indicator (numeric)
- cons.conf.idx: consumer confidence index - monthly indicator (numeric)
- euribor3m: euribor 3 month rate - daily indicator (numeric)
- nr.employed: number of employees - quarterly indicator (numeric)

#### Output variable (desired target):

- y - has the client subscribed a term deposit? (binary: 'yes','no')




## Overview
Use this data set to test the performance of your classification models and to explore the best strategies to improve a banking institution's next direct marketing campaign.

Term deposits are cash investment held at a financial institution and are a major source of revenue for banks--making them important for financial institutions to market. Telemarketing remains to be a popular marketing technique because of the potential effectiveness of human-to-human contact provided by a telephone call, which is sometimes quite the opposite of many impersonal and robotic marketing messages relayed through social and digital media. However, executing such direct marketing effort usually requires a huge investment by the business as large call centers need to be contracted to contact clients directly.

How can the banking institution have more effective direct marketing campaigns in the future? Analyze this data set and identify the patterns that will help us develop future strategies.

In [1]:
#loading and previewing the data
import pandas as pd
df = pd.read_csv('bank-direct-marketing-campaigns.csv')
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [3]:
#size of df
df.shape

(41188, 20)

In [10]:
#checking for duplicates
df.duplicated().sum()

np.int64(1784)

In [11]:
#dropping the duplicates
df = df.drop_duplicates()
df.duplicated().sum()

np.int64(0)

In [12]:
#checking the missing values
df.isna().sum()

Unnamed: 0,0
age,0
job,0
marital,0
education,0
default,0
housing,0
loan,0
contact,0
month,0
day_of_week,0


In [20]:
#checking the value counts for columns in search for placeholders...
for column in df.columns:
  print(f"Value counts for column '{column}':")
  print(df[column].value_counts())
  print()

Value counts for column 'age':
age
31    1825
32    1764
33    1741
35    1671
36    1670
      ... 
91       2
98       2
95       1
87       1
94       1
Name: count, Length: 78, dtype: int64

Value counts for column 'job':
job
admin.           9873
blue-collar      8835
technician       6404
services         3801
management       2820
retired          1683
entrepreneur     1405
self-employed    1386
housemaid        1028
unemployed        992
student           852
unknown           325
Name: count, dtype: int64

Value counts for column 'marital':
marital
married     23869
single      10997
divorced     4459
unknown        79
Name: count, dtype: int64

Value counts for column 'education':
education
university.degree      11561
high.school             9121
basic.9y                5785
professional.course     5018
basic.4y                3993
basic.6y                2222
unknown                 1686
illiterate                18
Name: count, dtype: int64

Value counts for column 'defaul