# ML Zoomcamp 2024 - Classification 

This is part [ML Zoomcamp](!https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master) organized by [DataTalks.Club](!https://datatalks.club/). 
In this session, we learned about classification. 

The dataset that we used was bank-full.csv from [bank marketing](!https://archive.ics.uci.edu/static/public/222/bank+marketing.zip) dataset provided by [Moro et.al, 2011](!http://hdl.handle.net/1822/14838)<sup>1</sup>.
<br>In this dataset, our desired target for classification task will be the `y` variable - has the client subscribed a term deposit or not.

<sup>1</sup>S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.S.

# 1. Data Preparation 

* Read the data with pandas.
* Look at the data.
* Selecting the columns (based on course instruction).
* Change the target variable to be an integer - target encoding.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [3]:
df = pd.read_csv('../bank/bank-full.csv', sep=";")
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [3]:
df = df[['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']]

In [4]:
df.dtypes

age           int64
job          object
marital      object
education    object
balance       int64
housing      object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [5]:
df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [6]:
df.y = (df.y == 'yes').astype(int)

In [7]:
df.corr(numeric_only=True)

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y
age,1.0,0.097783,-0.00912,-0.004648,0.00476,-0.023758,0.001288,0.025155
balance,0.097783,1.0,0.004503,0.02156,-0.014578,0.003435,0.016674,0.052838
day,-0.00912,0.004503,1.0,-0.030206,0.16249,-0.093044,-0.05171,-0.028348
duration,-0.004648,0.02156,-0.030206,1.0,-0.08457,-0.001565,0.001203,0.394521
campaign,0.00476,-0.014578,0.16249,-0.08457,1.0,-0.088628,-0.032855,-0.073172
pdays,-0.023758,0.003435,-0.093044,-0.001565,-0.088628,1.0,0.45482,0.103621
previous,0.001288,0.016674,-0.05171,0.001203,-0.032855,0.45482,1.0,0.093236
y,0.025155,0.052838,-0.028348,0.394521,-0.073172,0.103621,0.093236,1.0


# 2. Setting Up The Validation Framework.
* Perform the train/validation/test split using Scikit-Learn.

In [8]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [9]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [10]:
len(df_train), len(df_val), len(df_test)

(27126, 9042, 9043)

In [11]:
y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values

In [12]:
del df_train['y']
del df_val['y']
del df_test['y']

# 3. EDA
* Check missing values.
* Look at the target variable.
* Look at numerical and categorical variables

In [13]:
df_full_train.isnull().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [14]:
df_full_train.y.value_counts(normalize=True)

y
0    0.883931
1    0.116069
Name: proportion, dtype: float64

The data train consists of 11% of customers who have subscribed to a term deposit.

In [15]:
numerical = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

In [16]:
categorical = ['job', 'marital', 'education', 'housing', 'contact','month', 'poutcome']

In [17]:
df_full_train[categorical].nunique()

job          12
marital       3
education     4
housing       2
contact       3
month        12
poutcome      4
dtype: int64

# 4. Feature Importance 
* Calculate the mutual information between `y` and other categorical variables.
* Calculate correlation with numerical variables.

In [18]:
def mutual_information(series):
    return mutual_info_score(series, df_full_train.y)

In [19]:
df_full_train[categorical].apply(mutual_information).round(2).sort_values(ascending=False)

poutcome     0.03
month        0.02
job          0.01
housing      0.01
contact      0.01
marital      0.00
education    0.00
dtype: float64

In [20]:
df_full_train[numerical].corrwith(df_full_train.y).round(3).sort_values(ascending=False)

duration    0.393
pdays       0.106
previous    0.092
balance     0.053
age         0.027
day        -0.026
campaign   -0.073
dtype: float64

# 5.  One-hot Encoding

In [21]:
dv = DictVectorizer(sparse=False)

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [22]:
train_dicts[0]

{'job': 'technician',
 'marital': 'single',
 'education': 'tertiary',
 'housing': 'yes',
 'contact': 'cellular',
 'month': 'aug',
 'poutcome': 'unknown',
 'age': 32,
 'balance': 1100,
 'day': 11,
 'duration': 67,
 'campaign': 1,
 'pdays': -1,
 'previous': 0}

# 6. Training Logistic Regression with Scikit-LUserWarning
* Train a model with Scikit-Learn.
* Apply it to the validation dataset.
* Calculate the accuracy of the validation dataset.

In [23]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [24]:
model.coef_[0].round(3)

array([ 1.000e-03,  0.000e+00, -7.900e-02,  2.580e-01,  6.400e-02,
       -1.328e+00,  1.000e-02,  4.000e-03, -4.480e-01, -2.520e-01,
       -6.800e-02, -2.380e-01, -1.530e-01, -8.540e-01,  8.700e-02,
       -2.260e-01, -2.640e-01, -3.470e-01, -9.000e-02,  2.630e-01,
       -2.940e-01, -1.280e-01,  2.950e-01, -1.480e-01,  4.000e-02,
       -1.940e-01, -3.470e-01, -4.830e-01, -1.770e-01, -1.000e-03,
       -7.290e-01,  4.190e-01, -3.290e-01, -1.220e+00, -1.044e+00,
        3.030e-01,  1.493e+00, -5.080e-01, -9.710e-01,  7.770e-01,
        8.030e-01, -1.000e-03, -8.240e-01, -6.360e-01,  1.485e+00,
       -1.032e+00,  9.000e-03])

In [25]:
model.intercept_[0]

-1.006449986188843

In [26]:
y_pred = model.predict_proba(X_val)[:, 1]

In [27]:
subscribe_prediction = (y_pred >= 0.5)

Calculating the accuracy.

In [28]:
def accuracy(model, X_val, y_val):
    y_pred = model.predict_proba(X_val)[:, 1]
    subscribe_prediction = (y_pred >= 0.5)

    return (y_val == subscribe_prediction).mean()

In [29]:
original_accuracy = accuracy(model, X_val, y_val)
original_accuracy

0.9013492590134926

In [30]:
df_pred = pd.DataFrame()
df_pred['probability'] = y_pred
df_pred['prediction'] = subscribe_prediction.astype(int)
df_pred['actual'] = y_val
df_pred['correct'] = df_pred.prediction == df_pred.actual
df_pred.head()

Unnamed: 0,probability,prediction,actual,correct
0,0.012663,0,0,True
1,0.009701,0,0,True
2,0.153056,0,1,False
3,0.230069,0,0,True
4,0.445617,0,1,False


# 7. Model Interpretation
* Look at the coefficients.
* Train a smaller model with fewer features.

In [31]:
dict(zip(dv.feature_names_, model.coef_[0].round(3)))

{'age': 0.001,
 'balance': 0.0,
 'campaign': -0.079,
 'contact=cellular': 0.258,
 'contact=telephone': 0.064,
 'contact=unknown': -1.328,
 'day': 0.01,
 'duration': 0.004,
 'education=primary': -0.448,
 'education=secondary': -0.252,
 'education=tertiary': -0.068,
 'education=unknown': -0.238,
 'housing=no': -0.153,
 'housing=yes': -0.854,
 'job=admin.': 0.087,
 'job=blue-collar': -0.226,
 'job=entrepreneur': -0.264,
 'job=housemaid': -0.347,
 'job=management': -0.09,
 'job=retired': 0.263,
 'job=self-employed': -0.294,
 'job=services': -0.128,
 'job=student': 0.295,
 'job=technician': -0.148,
 'job=unemployed': 0.04,
 'job=unknown': -0.194,
 'marital=divorced': -0.347,
 'marital=married': -0.483,
 'marital=single': -0.177,
 'month=apr': -0.001,
 'month=aug': -0.729,
 'month=dec': 0.419,
 'month=feb': -0.329,
 'month=jan': -1.22,
 'month=jul': -1.044,
 'month=jun': 0.303,
 'month=mar': 1.493,
 'month=may': -0.508,
 'month=nov': -0.971,
 'month=oct': 0.777,
 'month=sep': 0.803,
 'pdays'

In [32]:
def create_train_val_dataset(df_train, df_val, column, categorical, numerical):
    df_train, df_val = df_train.drop(columns=column), df_val.drop(columns=column)

    num_copy = numerical.copy()
    num_copy.remove(column)

    dv = DictVectorizer(sparse=False)

    train_dicts = df_train[categorical + num_copy].to_dict(orient='records')
    X_train = dv.fit_transform(train_dicts)
    
    val_dicts = df_val[categorical + num_copy].to_dict(orient='records')
    X_val = dv.transform(val_dicts)

    return X_train, X_val

## Train without `age` feature.

In [33]:
X_train_no_age, X_val_no_age = create_train_val_dataset(df_train, df_val, 'age', categorical, numerical)

In [34]:
model_no_age = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model_no_age.fit(X_train_no_age, y_train)

In [35]:
accuracy_no_age = accuracy(model_no_age, X_val_no_age, y_val)
accuracy_no_age

0.9006856890068569

In [36]:
accuracy_no_age - original_accuracy

-0.0006635700066357497

## Train without `balance` feature.

In [37]:
X_train_no_balance, X_val_no_balance = create_train_val_dataset(df_train, df_val, 'balance', categorical, numerical)

In [38]:
model_no_balance = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model_no_balance.fit(X_train_no_balance, y_train)

In [39]:
accuracy_no_balance = accuracy(model_no_balance, X_val_no_balance, y_val)
accuracy_no_balance

0.9013492590134926

In [40]:
accuracy_no_balance - original_accuracy

0.0

## Train without `previous` feature.

In [41]:
X_train_no_previous, X_val_no_previous = create_train_val_dataset(df_train, df_val, 'previous', categorical, numerical)

In [42]:
model_no_previous = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model_no_previous.fit(X_train_no_previous, y_train)

In [43]:
accuracy_no_previous = accuracy(model_no_previous, X_val_no_previous, y_val)
accuracy_no_previous

0.9011280690112807

In [44]:
accuracy_no_previous - original_accuracy

-0.00022119000221187957

## Train without `marital` feature.

In [45]:
df_train_no_marital, df_val_no_marital = df_train.drop(columns='marital'), df_val.drop(columns='marital')

cat_copy = categorical.copy()
cat_copy.remove('marital')

dv = DictVectorizer(sparse=False)

train_dicts_no_marital = df_train_no_marital[cat_copy + numerical].to_dict(orient='records')
X_train_no_marital = dv.fit_transform(train_dicts_no_marital)

val_dicts_no_marital = df_val_no_marital[cat_copy + numerical].to_dict(orient='records')
X_val_no_marital = dv.transform(val_dicts_no_marital)

In [46]:
model_no_marital = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model_no_marital.fit(X_train_no_marital, y_train)

In [47]:
accuracy_no_marital = accuracy(model_no_marital, X_val_no_marital, y_val)
accuracy_no_marital

0.9011280690112807

In [48]:
accuracy_no_marital - original_accuracy

-0.00022119000221187957

# 8. Train a Regularized Logistic Regression 
* Experiment with different `c` values: `[0.01, 0.1, 1, 10, 100]`

#### C = 0.01

In [49]:
model_001 = LogisticRegression(solver='liblinear', C=0.01, max_iter=1000, random_state=42)
model_001.fit(X_train, y_train)

In [50]:
accuracy_001 = accuracy(model_001, X_val, y_val)
print(f'Accuracy using C = 0.01: {round(accuracy_001*100,3)}%')

Accuracy using C = 0.01: 89.781%


#### C = 0.1

In [51]:
model_01 = LogisticRegression(solver='liblinear', C=0.1, max_iter=1000, random_state=42)
model_01.fit(X_train, y_train)

In [52]:
accuracy_01 = accuracy(model_01, X_val, y_val)
print(f'Accuracy using C = 0.1: {round(accuracy_01*100,3)}%')

Accuracy using C = 0.1: 90.091%


#### C = 1

In [53]:
model_1 = LogisticRegression(solver='liblinear', C=1, max_iter=1000, random_state=42)
model_1.fit(X_train, y_train)

In [54]:
accuracy_1 = accuracy(model_1, X_val, y_val)
print(f'Accuracy using C = 1: {round(accuracy_1*100,3)}%')

Accuracy using C = 1: 90.135%


#### C = 10

In [55]:
model_10 = LogisticRegression(solver='liblinear', C=10, max_iter=1000, random_state=42)
model_10.fit(X_train, y_train)

In [56]:
accuracy_10 = accuracy(model_10, X_val, y_val)
print(f'Accuracy using C = 10: {round(accuracy_10*100,3)}%')

Accuracy using C = 10: 90.069%


#### C = 100

In [57]:
model_100 = LogisticRegression(solver='liblinear', C=100, max_iter=1000, random_state=42)
model_100.fit(X_train, y_train)

In [58]:
accuracy_100 = accuracy(model_100, X_val, y_val)
print(f'Accuracy using C = 100: {round(accuracy_100*100,3)}%')

Accuracy using C = 100: 90.069%


# 9. Using the Model
* predict and check the accuracy on the test dataset.
* Use the model for prediction of a customer.

In [59]:
dv = DictVectorizer(sparse=False)

dicts_test = df_test[categorical + numerical].to_dict(orient='records')
X_test = dv.fit_transform(dicts_test)

In [60]:
y_pred_test = model_1.predict_proba(X_test)[:, 1]

In [61]:
subscribe_prediction_test = (y_pred_test >= 0.5)

In [62]:
print(f'Test accuracy: {round((subscribe_prediction_test == y_test).mean() * 100, 3)}%')

Test accuracy: 89.915%


In [63]:
customer = dicts_test[0]
customer

{'job': 'blue-collar',
 'marital': 'married',
 'education': 'secondary',
 'housing': 'yes',
 'contact': 'unknown',
 'month': 'may',
 'poutcome': 'unknown',
 'age': 40,
 'balance': 580,
 'day': 16,
 'duration': 192,
 'campaign': 1,
 'pdays': -1,
 'previous': 0}

In [64]:
X_small = dv.transform([customer])

In [65]:
model_1.predict_proba(X_small)[0,1]

0.00832354456711765

In [66]:
y_test[0]

0

# 10. Summary

In this module we learn about classification. 

There are two values can be used to check feature importance:
1. Mutual information for categorical feature.
2. Correlation for numerical feature.

One-hot encoding was used to handle the categorical feature.

We experimented to remove some feature for training. Removing `balance` gave least difference accuracy against complete feautre. This insight can be used for feature reduction.
Several `C` values were tried for training the model and `C=1` gave the best result.  

We have also learn how to use the model for prediction. 