## LABORATORY 04: MACHINE LEARNING II - CLASSIFICATION PROBLEM

#### **Context**

Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

#### **Detailed Column Descriptions**
1 - **age** (numeric)

2 - **job**: type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")

3 - **marital**: marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - **education** (categorical: "unknown","secondary","primary","tertiary")

5 - **default**: has credit in default? (binary: "yes","no")

6 - **balance**: average yearly balance, in euros (numeric)

7 - **housing**: has housing loan? (binary: "yes","no")

8 - **loan**: has personal loan? (binary: "yes","no")

**related with the last contact of the current campaign:**

9 - **contact**: contact communication type (categorical: "unknown","telephone","cellular")

10 - **day**: last contact day of the month (numeric)

11 - **month**: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - **duration**: last contact duration, in seconds (numeric)

**other attributes:**

13 - **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - **pdays**: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - **previous**: number of contacts performed before this campaign and for this client (numeric)

16 - **poutcome**: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

**Output variable (desired target):**

17 - **y** - has the client subscribed a term deposit? (binary: "yes","no")

#### 1. Load the dataset

In [4]:
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn 
import seaborn as sbn
import scripts

In [3]:
# load train and test set
dataset = pd.read_csv("train.csv", sep = ";", low_memory = False)
test_set = pd.read_csv("test.csv", sep = ";", low_memory = False)
dataset.head(5)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [7]:
common_records = pd.merge(dataset, test_set, how='inner')
print(f"Amount of train records: {len(dataset)}")
print(f"Amount of test records: {len(test_set)}")
print(f"Amount of common records: {len(common_records)}")

Amount of train records: 45211
Amount of test records: 4521
Amount of common records: 4521


The creators of this dataset decided to create a test dataset, taking 10% of the training set there. This is not suitable for our task, so we delete from the dataset what is contained in the test_set

In [10]:
dataset = pd.concat([dataset, test_set, test_set]).drop_duplicates(keep=False)

common_records = pd.merge(dataset, test_set, how='inner')
print(f"Amount of train records: {len(dataset)}")
print(f"Amount of test records: {len(test_set)}")
print(f"Amount of common records: {len(common_records)}")


Amount of train records: 40690
Amount of test records: 4521
Amount of common records: 0


In [12]:
# dimensions of dataset
print("#Train-features = ", dataset.shape[1])
print("#Test-features = ", test_set.shape[1])

#Train-features =  17
#Test-features =  17


In [5]:
# definition of preprocessor
from scripts.preprocess import DataPreprocessing

dp = DataPreprocessing()
metadata, num_cols, cat_cols = dp.get_metadata(dataset)

print(f"Metadata ==> total: {len(metadata)} \n", metadata)
print("Numerical features: \n", num_cols)
print("Categorical features: \n", cat_cols)

ModuleNotFoundError: No module named 'scripts.preprocess'