<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/bank_marketing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bank Marketing Dataset
![picture](https://www.socialtoaster.com/wp-content/uploads/2018/10/Finance-and-Retail-Banking-Blog-Post.jpg)

Today, organizations hiring data scientists express a keen interest in a job candidate's portfolio. Analyzing an organization's marketing data stands out as one of the most common applications of data science and machine learning. Including such an analysis in your portfolio can significantly enhance its value.

In essence, datasets containing marketing data serve two distinct business objectives:

**1) Predicting Marketing Campaign Results:**

Forecasting the outcomes of marketing campaigns for individual customers and gaining insights into the factors influencing these results.
This analysis aids in discovering strategies to enhance the efficiency of marketing campaigns.

**2) Segmenting Customers:**

Identifying customer segments by utilizing data from those who subscribed to a term deposit.
This segmentation helps in profiling customers more likely to acquire the product, facilitating the development of targeted and effective marketing campaigns.

## About this Dataset

<img src="https://datascientest.com/en/wp-content/uploads/sites/9/2023/08/power-bi-dashboard.webp">

This is the classic marketing bank dataset uploaded originally in the UCI Machine Learning Repository. The dataset gives you information about a marketing campaign of a financial institution in which you will have to analyze in order to find ways to look for future strategies in order to improve future marketing campaigns for the bank.

To read more check [here](https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset/data)

Features related with bank clients:

| Features | Description |
| -- | -- |
| Age | age of customer |
| Job | type of job |
| Marital | marital status |
| Education | education (primary, secondary, tertiary and unknown)|
| default | has credit in default?|
| housing | has housing loan? ('no','yes','unknown') |
| loan | has personal loan? ('no','yes','unknown') |
| balance |Balance of the individual. |

Features related with the last contact of the current campaign:

| Features | Description |
| -- | -- |
|contact | contact communication type ('cellular','telephone')|
|month | last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')|
|day | last contact day of the week ('mon','tue','wed','thu','fri') |
| duration | last contact duration, in seconds (numeric) |

> Important note: this attribute highly affects the output target (if duration=0 then y=no)

Other Attributes:

| Features | Description |
| -- | -- |
| campaign| number of contacts performed during this campaign and for this client (numeric, includes last contact) |
| pdays| number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)|
| previous| number of contacts performed before this campaign and for this client (numeric)|
| poutcome| outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')|

# Your Assignment

<img src="https://t3.ftcdn.net/jpg/01/27/17/00/360_F_127170057_T8TeGnPWtYX24uTpSjeIT0500sUxi9M1.jpg">


** Part 1. Exploratory Data Analysis (EDA) **
Explore the dataset thoroughly through Exploratory Data Analysis (EDA). In this phase, analyze and visualize key aspects of the data to gain insights into its structure, patterns, and potential relationships. EDA sets the foundation for a deeper understanding of the dataset.

** Part 2. Classification **
In this section, delve into the realm of classification algorithms such as KNN and Decision Tree (DT). Implement and evaluate models that categorize data into distinct classes or groups. Understand the principles behind classification, assess model performance, and fine-tune parameters to achieve accurate predictions.

** Part 3. Clustering **
Enter the world of clustering, where the goal is to group similar data points together. Explore KMeans clustering algorithm, try to find the best value for K for your data using both elbow-method and silhouette analysis.



In [None]:
# We need this to get the data from Kaggle
!pip install opendatasets

In [None]:
# Let's begin by importing the essential Python libraries.
# Start by importing pandas, numpy, and matplotlib and opendatasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import opendatasets as od


In [None]:
# Retrieve your dataset from Kaggle.
url = "https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset/data"
od.download(url)

# Part 0. Read The Data

<img src="https://user.oc-static.com/upload/2020/11/23/16061415667329_3C3.png">

In [8]:
# Let's load the data into pandas
# If you need a refresher on how to do this, you can refer to the previous notebooks on text classification or decision trees.
# The data should be here /content/bank-marketing-dataset/bank.csv



# Part 1) Exploratory Data Analysis (EDA)
<img src="https://d2o2utebsixu4k.cloudfront.net/media/images/3a2daf59-b87d-453e-871c-070e4656267e.jpg">

Here are some hints for this section:
- Data Overview:
Display the first few rows of the dataset to understand its structure.

- Check the data types of each column.

- Summary Statistics:
Use statistical methods to summarize the main characteristics of the dataset.
summary statistics such as mean, median, standard deviation, etc.

- Missing Values:
Identify and handle missing values if any.
Decide whether to drop missing values or fill them using median.

- Data Visualization:
Create a Point-wise Correlation Pairplot only for your numerical features:
Age, balance, duration
Use target value to color differentiate the data points


```
Only if you had time and want to go extra miles:

- Distribution Analysis:
Examine the distribution of numerical features.
Use histograms, box plots, or other visualizations to understand the data distribution.

- Categorical Variables:
Explore the distribution of categorical variables.
Use bar charts to visualize the frequency of each category.

- Correlation Analysis:
Investigate the relationships between numerical features.
Create a correlation matrix and visualize it using a heatmap.

- Outlier Detection:
Identify and handle potential outliers in the dataset.
Use visualizations like box plots to identify extreme values.
Data Visualization:

- Additional visualization
Create additional visualizations (scatter plots, pair plots, etc.) to gain insights into relationships between variables.

- Feature Engineering:
If necessary, create new features that might be relevant for classification.
Feature scaling or normalization may be required.

- Data Insights:
Summarize key findings and insights from the EDA process.
```




In [9]:
# Start by looking at few lines of the code and explore your data as you can
# Use the hints that were mentioned above and get inspired by the previous notebooks we practiced in the class






# Part 2) Classification

The objective of classification in this dataset is to predict the target variable "Deposit." By employing machine learning classification algorithms, the goal is to build a model that can effectively differentiate between instances where a customer subscribed to a term deposit ("Deposit" equals yes) and instances where they did not subscribe ("Deposit" equals no). This predictive model can provide valuable insights into the factors influencing customers' decisions to subscribe to term deposits, enabling businesses to tailor their marketing strategies and allocate resources more efficiently. The classification task aims to create a robust model capable of making accurate predictions on whether a customer is likely to make a term deposit based on various features in the dataset.

## What is a Term Deposit?
A Term deposit is a deposit that a bank or a financial institurion offers with a fixed rate (often better than just opening deposit account) in which your money will be returned back at a specific maturity time. For more information with regards to Term Deposits please click on this link from Investopedia: https://www.investopedia.com/terms/t/termdeposit.asp


<img src="https://static.wixstatic.com/media/a4a427_5a12d4912c124348b616a571caa9c817~mv2.png/v1/fill/w_445,h_223,al_c,lg_1,q_85/a4a427_5a12d4912c124348b616a571caa9c817~mv2.png">


## Preprocessing
- What is your target value?
- What features do you want to use here? It's up to you to decide what features to use.
- Do you need scaling your data? Feel free to do it on numerical features
- Do you need encoding for categorical features?
- Split your data into features and target values (X, y)
- Visualize your target value
- Is it balanced or imbalance?
- Select an evaluation metric and to measure the model performance if you are not sure use accuracy.
- Select a metric
- Split your data into train and test
- Let's scale our dat


## Decision Tree

- Try to use decision tree to classify your data
- Train your model on train data and test it on test data
- Evaluate your model performance both on train and test data
- If you have time try to do HyperParameterTuning and use GridSearch to find the best parameters for "max_depth" and "criterion"
- Can you visualizr the tree using graphvis

## KNN

- Try to use KNN to classify your data
- Use k=5 for start
- Remember that you have to work with numerical features here so make sure to select numerical features or encode categoricals to numerical before running KNN.
- Can you find the best K looking at the train and test error plot?
- Train your model with besk K if it was not 5
- Evaluate your model performance both on train and test data


In [None]:
# Importing some of the usefull functions for this part
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [10]:
# Preprocess your data here, take a look at the preprocessing steps in the previous notebooks to be inspired

In [11]:
# Train your Decision Tree model and evaluate it

In [12]:
# Train your KNN model and evaluate it

# Part 3) Clustering

<img src="https://miro.medium.com/v2/resize:fit:906/1*Aczgwp5UkCIIO7nf9ItLiA.png">

The goal is to identify meaningful customer segments that share similar attributes, allowing businesses to gain insights into the diverse profiles within their customer base. By analyzing these segments, organizations can uncover patterns and trends related to customers who have shown an interest in term deposits.

## KMeans

- Select any features you want to work with, remember they should be numerical.
- Implement the KMeans clustering algorithm on the dataset.
- Explore different values of K to identify optimal clusters using the elbow method and silhouette analysis.
- Visualize the clustered data using scatter plots or other appropriate visualizations for two of your features.
- Analyze and interpret the characteristics of each cluster to gain insights into customer segments.
- Discuss potential business strategies or marketing approaches based on the identified clusters.