# Kaggle Competition

## Dataset Description

### Files

1. **train.csv** - the training set
2. **test.csv** - the test set
3. **submission.csv** - a sample submission file in the correct format

**The following description is copied from [Kaggle](https://www.kaggle.com/t/507a848f9c1f48c28aff93b8d6a4fcc7).**

### Context

The dataset is related to direct marketing campaigns (phone calls) of a Portuguese banking institution.

### Columns

1. **age**: (numeric)
2. **job**: type of job (categorical: `'admin.'`, `'blue-collar'`, `'entrepreneur'`, `'housemaid'`, `'management'`, `'retired'`, `'self-employed'`, `'services'`, `'student'`, `'technician'`, `'unemployed'`, `'unknown'`)
3. **marital**: marital status (categorical: `'divorced'`, `'married'`, `'single'`, `'unknown'`; note: `'divorced'` includes divorced or widowed)
4. **education**: (categorical: `'basic.4y'`, `'basic.6y'`, `'basic.9y'`, `'high.school'`, `'illiterate'`, `'professional.course'`, `'university.degree'`, `'unknown'`)
5. **default**: has credit in default? (categorical: `'no'`, `'yes'`, `'unknown'`)
6. **housing**: has housing loan? (categorical: `'no'`, `'yes'`, `'unknown'`)
7. **loan**: has personal loan? (categorical: `'no'`, `'yes'`, `'unknown'`)

#### Related to the last contact of the current campaign:

8. **contact**: contact communication type (categorical: `'cellular'`, `'telephone'`)
9. **month**: last contact month of year (categorical: `'jan'`, `'feb'`, `'mar'`, …, `'nov'`, `'dec'`)
10. **day_of_week**: last contact day of the week (categorical: `'mon'`, `'tue'`, `'wed'`, `'thu'`, `'fri'`)
11. **duration**: last contact duration in seconds (numeric).

   **Important note**: This attribute highly affects the output target (e.g., if `duration=0`, then `y='no'`). However, the duration is not known before a call is performed, so it should only be included for benchmark purposes and discarded for realistic predictive modeling.

#### Other attributes:

12. **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. **pdays**: number of days since the client was last contacted from a previous campaign (numeric; 999 means the client was not previously contacted)
14. **previous**: number of contacts performed before this campaign and for this client (numeric)
15. **poutcome**: outcome of the previous marketing campaign (categorical: `'failure'`, `'nonexistent'`, `'success'`)

#### Social and economic context attributes:

16. **emp.var.rate**: employment variation rate - quarterly indicator (numeric)
17. **cons.price.idx**: consumer price index - monthly indicator (numeric)
18. **cons.conf.idx**: consumer confidence index - monthly indicator (numeric)
19. **euribor3m**: euribor 3-month rate - daily indicator (numeric)
20. **nr.employed**: number of employees - quarterly indicator (numeric)

#### Target Variable:

21. **subscribed**: has the client subscribed to a term deposit? (binary: `'yes'`, `'no'`)

### Citation

[Moro et al., 2014] S. Moro, P. Cortez, and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. *Decision Support Systems*, Elsevier, 62:22-31, June 2014.

# Introduction

## Import libraries

In [1]:
import pandas as pd

## Load data

In [2]:
train_data = pd.read_csv('../datasets/su-24-classification-competition/train.csv')
test_data = pd.read_csv('../datasets/su-24-classification-competition/test.csv')

In [3]:
train_data.head()

Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,subscribed
0,0,49,unemployed,married,basic.9y,no,yes,yes,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.955,5228.1,no
1,1,69,blue-collar,married,unknown,no,yes,no,cellular,aug,...,3,999,0,nonexistent,-2.9,92.201,-31.4,0.883,5076.2,no
2,2,25,admin.,single,basic.9y,no,no,no,cellular,jul,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.96,5228.1,no
3,3,43,services,married,basic.6y,unknown,no,no,cellular,jul,...,1,999,0,nonexistent,1.4,93.918,-42.7,4.962,5228.1,no
4,4,27,admin.,single,university.degree,no,no,no,cellular,mar,...,4,999,0,nonexistent,-1.8,93.369,-34.8,0.635,5008.7,yes


In [4]:
train_data.describe()

Unnamed: 0,id,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,14999.5,39.949233,2.563767,962.522567,0.1739,0.081227,93.575133,-40.499327,3.621459,5167.108613
std,8660.398374,10.405306,2.764596,186.800371,0.495715,1.572133,0.579706,4.631713,1.734181,72.230819
min,0.0,17.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,7499.75,32.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,14999.5,38.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,22499.25,47.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,29999.0,98.0,43.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              30000 non-null  int64  
 1   age             30000 non-null  int64  
 2   job             30000 non-null  object 
 3   marital         30000 non-null  object 
 4   education       30000 non-null  object 
 5   default         30000 non-null  object 
 6   housing         30000 non-null  object 
 7   loan            30000 non-null  object 
 8   contact         30000 non-null  object 
 9   month           30000 non-null  object 
 10  day_of_week     30000 non-null  object 
 11  campaign        30000 non-null  int64  
 12  pdays           30000 non-null  int64  
 13  previous        30000 non-null  int64  
 14  poutcome        30000 non-null  object 
 15  emp.var.rate    30000 non-null  float64
 16  cons.price.idx  30000 non-null  float64
 17  cons.conf.idx   30000 non-null 

## EDA (Exploratory Data Analysis)

## Data Preprocessing

#### Check missing values

In [6]:
train_data.isnull().sum().sum().item()

0