## Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.

### Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip

We need to take `bank/bank-full.csv` file from the downloaded zip-file. Please use semicolon as a separator in the `read_csv` function.

In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not.


In [3]:
df = pd.read_csv('/Users/nathaly/MLZoomcamp/03-classification/bank+marketing/bank/bank-full.csv', sep=';')


In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no




### Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

### Data preparation

* Select only the features from above.
* Check if the missing values are presented in the features.


In [5]:
df=df[['age',
'job',
'marital',
'education',
'balance',
'housing',
'contact',
'day',
'month',
'duration',
'campaign',
'pdays',
'previous',
'poutcome',
'y']]

In [6]:
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,no




### Question 1

What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`

In [7]:
df.education.mode()

0    secondary
Name: education, dtype: object

In [8]:
df.education.value_counts()

education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`

In [9]:
corr=df.select_dtypes('int64').corr()
corr

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.097783,-0.00912,-0.004648,0.00476,-0.023758,0.001288
balance,0.097783,1.0,0.004503,0.02156,-0.014578,0.003435,0.016674
day,-0.00912,0.004503,1.0,-0.030206,0.16249,-0.093044,-0.05171
duration,-0.004648,0.02156,-0.030206,1.0,-0.08457,-0.001565,0.001203
campaign,0.00476,-0.014578,0.16249,-0.08457,1.0,-0.088628,-0.032855
pdays,-0.023758,0.003435,-0.093044,-0.001565,-0.088628,1.0,0.45482
previous,0.001288,0.016674,-0.05171,0.001203,-0.032855,0.45482,1.0


In [10]:
corr.loc['age','balance']

np.float64(0.09778273937134807)

In [11]:
corr.loc['day','campaign']

np.float64(0.1624902163261922)

In [12]:
corr.loc['age','pdays']

np.float64(-0.023758014111728242)

In [13]:
corr.loc['pdays','previous']

np.float64(0.4548196354805043)

pdays, previos is the highest

### Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.


In [14]:
y=df.y.replace(['yes','no'],['1','0'])


### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.


In [15]:
from sklearn.model_selection import train_test_split

In [16]:
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,no


In [17]:
X=df.drop(columns='y')

In [18]:
X.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown


In [19]:
X_full_train, X_test, y_full_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
X_train, X_val, y_train, y_val = train_test_split(X_full_train,y_full_train, test_size = 0.25, random_state=42 ) 

In [21]:
X_train.shape

(27126, 14)

In [22]:
X_val.shape

(9042, 14)

In [23]:
X_test.shape

(9043, 14)

In [24]:
X_train.columns

Index(['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact',
       'day', 'month', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome'],
      dtype='object')


### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `contact`
- `education`
- `housing`
- `poutcome`

In [25]:
from sklearn.metrics import mutual_info_score

In [26]:
categorical = list(X_train.select_dtypes('object').columns)

In [27]:
categorical

['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome']

In [28]:
scores=[]
for c in categorical:
    score = mutual_info_score(y_train, X_train[c])
    scores.append(score)

In [29]:
list(zip(categorical, scores))

[('job', np.float64(0.007316082778474635)),
 ('marital', np.float64(0.0020495925927810216)),
 ('education', np.float64(0.0026967549991295282)),
 ('housing', np.float64(0.010343105891750026)),
 ('contact', np.float64(0.013356062198247219)),
 ('month', np.float64(0.02509003344365025)),
 ('poutcome', np.float64(0.029532821290436224))]

poutcome

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9

In [30]:
from sklearn.feature_extraction import DictVectorizer

In [31]:
train_dicts = X_train.to_dict(orient = 'records')

In [32]:
dv = DictVectorizer(sparse = False)

In [33]:
dv.fit(train_dicts)

In [34]:
dv.get_feature_names_out()

array(['age', 'balance', 'campaign', 'contact=cellular',
       'contact=telephone', 'contact=unknown', 'day', 'duration',
       'education=primary', 'education=secondary', 'education=tertiary',
       'education=unknown', 'housing=no', 'housing=yes', 'job=admin.',
       'job=blue-collar', 'job=entrepreneur', 'job=housemaid',
       'job=management', 'job=retired', 'job=self-employed',
       'job=services', 'job=student', 'job=technician', 'job=unemployed',
       'job=unknown', 'marital=divorced', 'marital=married',
       'marital=single', 'month=apr', 'month=aug', 'month=dec',
       'month=feb', 'month=jan', 'month=jul', 'month=jun', 'month=mar',
       'month=may', 'month=nov', 'month=oct', 'month=sep', 'pdays',
       'poutcome=failure', 'poutcome=other', 'poutcome=success',
       'poutcome=unknown', 'previous'], dtype=object)

In [35]:
X_train_t = dv.transform(train_dicts)

In [36]:
val_dicts = X_val.to_dict(orient='records')

In [37]:
X_val_t = dv.transform(val_dicts)

In [38]:
from sklearn.linear_model import LogisticRegression

In [39]:
model = LogisticRegression(solver='liblinear', C=1.0,max_iter=1000,random_state=42)

In [40]:
model.fit(X_train_t, y_train)

In [41]:
y_pred = model.predict(X_val_t)

In [42]:
accuracy = round((y_pred == y_val).mean(),2)
accuracy

np.float64(0.9)

In [43]:
accuracy_final = (y_pred == y_val).mean()

In [44]:
accuracy_final

np.float64(0.9007962840079629)

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> **Note**: The difference doesn't have to be positive.

In [45]:
columns = list(X_train.columns)
columns

['age',
 'job',
 'marital',
 'education',
 'balance',
 'housing',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome']

In [46]:
from sklearn.metrics import accuracy_score


In [47]:
X_train_f = X_train.copy()
X_val_f = X_val.copy()
y_train_f = y_train.copy()
y_val_f = y_val.copy()

scores = pd.DataFrame(columns=['eliminated_feature', 'accuracy', 'difference'])
for c in columns:
    train = X_train_f.drop(columns=c)
    train_dicts = train.to_dict(orient = 'records')
    dv = DictVectorizer(sparse = False)
    dv.fit(train_dicts)
    X_train_final = dv.transform(train_dicts)
    val_dicts = X_val_f.to_dict(orient='records')
    X_val_final = dv.transform(val_dicts)
    model = LogisticRegression(solver='liblinear', C=1.0,max_iter=1000,random_state=42)
    model.fit(X_train_final, y_train_f)
    y_pred = model.predict(X_val_final)
    accuracy = accuracy_score(y_val_f,y_pred)
    resta = accuracy_final-accuracy
    scores.loc[len(scores)] = [c, accuracy, resta]



In [48]:
scores

Unnamed: 0,eliminated_feature,accuracy,difference
0,age,0.901017,-0.000221
1,job,0.900796,0.0
2,marital,0.900575,0.000221
3,education,0.900907,-0.000111
4,balance,0.901017,-0.000221
5,housing,0.901349,-0.000553
6,contact,0.900464,0.000332
7,day,0.901239,-0.000442
8,month,0.899801,0.000995
9,duration,0.889073,0.011723


In [49]:
scores[scores.difference == scores.difference.abs().min()]

Unnamed: 0,eliminated_feature,accuracy,difference
1,job,0.900796,0.0


### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.


In [50]:
C = [0.01, 0.1, 1, 10, 100]

In [51]:
regularized = pd.DataFrame(columns=['C','accuracy'])
for i in C:    

    train_dicts = X_train.to_dict(orient = 'records')
    dv = DictVectorizer(sparse = False)
    dv.fit(train_dicts)
    X_train_final = dv.transform(train_dicts)
    val_dicts = X_val.to_dict(orient='records')
    X_val_final = dv.transform(val_dicts)
    model = LogisticRegression(solver='liblinear', C=i,max_iter=1000,random_state=42)
    model.fit(X_train_final, y_train)
    y_pred = model.predict(X_val_final)
    accuracy = round(accuracy_score(y_val,y_pred),3)
    regularized.loc[len(regularized)] = [i, accuracy]

In [52]:
regularized

Unnamed: 0,C,accuracy
0,0.01,0.898
1,0.1,0.901
2,1.0,0.901
3,10.0,0.901
4,100.0,0.901


In [53]:
regularized[regularized.accuracy==regularized.accuracy.max()]

Unnamed: 0,C,accuracy
1,0.1,0.901
2,1.0,0.901
3,10.0,0.901
4,100.0,0.901


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw03
* If your answer doesn't match options exactly, select the closest one