In [2]:
# import the necessary dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mutual_info_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

### Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
```

We need to take `bank/bank-full.csv` file from the downloaded zip-file.  
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not. 

In [3]:
# load the dataset
data = pd.read_csv("data/bank-full.csv", sep=";")

In [4]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


### Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

### Data preparation

* Select only the features from above.
* Check if the missing values are presented in the features.

In [5]:
data = data.drop(columns=["loan", "default"], axis=1)

In [6]:
data.isnull().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

**There are no missing values**

### Question 1

What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`


In [7]:
data.describe(include="all")["education"]

count         45211
unique            4
top       secondary
freq          23202
mean            NaN
std             NaN
min             NaN
25%             NaN
50%             NaN
75%             NaN
max             NaN
Name: education, dtype: object

In [8]:
data["education"].mode()

0    secondary
Name: education, dtype: object

**The answer to question 1 is _secondary_**

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`

In [9]:
numerical_columns = data.select_dtypes(include=['int64', 'float64'])

In [10]:
numerical_columns.corr()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.097783,-0.00912,-0.004648,0.00476,-0.023758,0.001288
balance,0.097783,1.0,0.004503,0.02156,-0.014578,0.003435,0.016674
day,-0.00912,0.004503,1.0,-0.030206,0.16249,-0.093044,-0.05171
duration,-0.004648,0.02156,-0.030206,1.0,-0.08457,-0.001565,0.001203
campaign,0.00476,-0.014578,0.16249,-0.08457,1.0,-0.088628,-0.032855
pdays,-0.023758,0.003435,-0.093044,-0.001565,-0.088628,1.0,0.45482
previous,0.001288,0.016674,-0.05171,0.001203,-0.032855,0.45482,1.0


**pdays and previous has the biggest correlation**

### Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.

In [11]:
y = data["y"]

y = (y == "yes").astype(int)

In [12]:
X = data.drop("y", axis=1)

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.


In [13]:
X_full, X_test, y_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_full, y_full, test_size=0.25, random_state=42)

In [14]:
len(X_train), len(X_test), len(X_val)

(27126, 9043, 9042)

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `contact`
- `education`
- `housing`
- `poutcome`

In [15]:
round(mutual_info_score(y_full, X_full.contact), 2)

0.01

In [16]:
round(mutual_info_score(y_full, X_full.education), 2)

0.0

In [17]:
round(mutual_info_score(y_full, X_full.housing), 2)

0.01

In [18]:
round(mutual_info_score(y_full, X_full.poutcome), 2)

0.03

**poutcome has the highest mutual information score**

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9

In [19]:
X_train_dict = X_train.to_dict(orient="records")
X_val_dict = X_val.to_dict(orient="records")

dv = DictVectorizer(sparse=False)
X_train_trans = dv.fit_transform(X_train_dict)
X_val_trans = dv.transform(X_val_dict)

In [20]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_trans, y_train)

In [21]:
y_pred = model.predict(X_val_trans)

In [22]:
global_y = (y_val == y_pred).mean()

In [23]:
global_y

0.9011280690112807

**The answer for question 4 is 0.9**

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> **Note**: The difference doesn't have to be positive.

In [35]:
def feature_elimination(X_train, X_val,y_train, y_val, column, global_y):
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
     
    X_train = X_train.drop(columns=column, axis=1)
    X_val = X_val.drop(columns=column, axis=1)

    X_train_dict = X_train.to_dict(orient="records")
    X_val_dict= X_val.to_dict(orient="records")

    dv = DictVectorizer(sparse=False)
    X_train_trans = dv.fit_transform(X_train_dict)
    X_val_trans = dv.transform(X_val_dict)
    
    model.fit(X_train_trans, y_train)

    y_pred = model.predict(X_val_trans)

    subset_accuracy = (y_val == y_pred).mean()
    diff = round(global_y, 2) - subset_accuracy

    return print(f"difference after dropping {column}: {diff}")

In [36]:
for i in X_train.columns.tolist():
    feature_elimination(X_train , X_val,y_train, y_val, i , global_y)

difference after dropping age: -0.001128069011280708
difference after dropping job: -0.001128069011280708
difference after dropping marital: -0.0003539040035389629
difference after dropping education: -0.000796284007962833
difference after dropping balance: -0.001128069011280708
difference after dropping housing: -0.0005750940057509535
difference after dropping contact: -0.000796284007962833
difference after dropping day: -0.000796284007962833
difference after dropping month: 0.0003096660030966758
difference after dropping duration: 0.00971024109710239
difference after dropping campaign: 0.0001990710019906805
difference after dropping pdays: -0.0009068790090687173
difference after dropping previous: -0.0006856890068568378
difference after dropping poutcome: 0.0060606060606061


**The answer is marital**

### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

In [40]:
for c in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    X_train_dict = X_train.to_dict(orient="records")
    X_val_dict= X_val.to_dict(orient="records")

    dv = DictVectorizer(sparse=False)
    X_train_trans = dv.fit_transform(X_train_dict)
    X_val_trans = dv.transform(X_val_dict)
    
    model.fit(X_train_trans, y_train)

    y_pred = model.predict(X_val_trans)

    accuracy = round((y_val == y_pred).mean(), 3)
    print(f"{c}: accuracy of c = {accuracy}")

0.01: accuracy of c = 0.898
0.1: accuracy of c = 0.901
1: accuracy of c = 0.901
10: accuracy of c = 0.902
100: accuracy of c = 0.901


**The answer to question 6 is 10**