# ML Zoomcamp 2024 - Deployment

This is part [ML Zoomcamp](!https://github.com/DataTalksClub/machine-learning-zoomcamp/tree/master) organized by [DataTalks.Club](!https://datatalks.club/). 
After we trained the model in previous [session](!<ml-zoomcamp-2024/04-Evaluation/evaluation.ipynb>), we want to use our model not in a notebook but in some web service.

The dataset that we used was bank-full.csv from [bank marketing](!https://archive.ics.uci.edu/static/public/222/bank+marketing.zip) dataset provided by [Moro et.al, 2011](!http://hdl.handle.net/1822/14838)<sup>1</sup>.
<br>In this dataset, our desired target for classification task will be the `y` variable - has the client subscribed a term deposit or not.

<sup>1</sup>S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.S.

# 1. Data preparation 

* Read the data with pandas.
* Look at the data.
* Selecting the columns (based on course instruction).
* Change the target variable to be an integer - target encoding.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import pickle

from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import auc, roc_auc_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

from tqdm.notebook import tqdm

%matplotlib inline

In [3]:
df = pd.read_csv('../bank/bank-full.csv', sep=";")
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [4]:
df = df[['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']]

In [5]:
df.dtypes

age           int64
job          object
marital      object
education    object
balance       int64
housing      object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [6]:
df.describe()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [7]:
df.y = (df.y == 'yes').astype(int)

In [8]:
df.corr(numeric_only=True)

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y
age,1.0,0.097783,-0.00912,-0.004648,0.00476,-0.023758,0.001288,0.025155
balance,0.097783,1.0,0.004503,0.02156,-0.014578,0.003435,0.016674,0.052838
day,-0.00912,0.004503,1.0,-0.030206,0.16249,-0.093044,-0.05171,-0.028348
duration,-0.004648,0.02156,-0.030206,1.0,-0.08457,-0.001565,0.001203,0.394521
campaign,0.00476,-0.014578,0.16249,-0.08457,1.0,-0.088628,-0.032855,-0.073172
pdays,-0.023758,0.003435,-0.093044,-0.001565,-0.088628,1.0,0.45482,0.103621
previous,0.001288,0.016674,-0.05171,0.001203,-0.032855,0.45482,1.0,0.093236
y,0.025155,0.052838,-0.028348,0.394521,-0.073172,0.103621,0.093236,1.0


# 2. Dataset splitting
* split the dataset into training and testing dataset.

In [9]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

# 3. Selecting features and target variable

In [10]:
numerical = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']

In [11]:
categorical = ['job', 'marital', 'education', 'housing', 'contact','month', 'poutcome']

In [12]:
y_full_train = df_full_train.y.values
y_test = df_test.y.values

# 4. Training the model 

In [13]:
def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000)
    model.fit(X_train, y_train)

    return dv, model

In [14]:
def predict(df, dv, model):
    dicts = df[categorical + numerical].to_dict(orient='records')

    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]

    return y_pred

In [15]:
C = 1.0
n_splits = 5

In [17]:
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

scores = [] 

for train_idx, val_idx in tqdm(kfold.split(df_full_train)):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.y.values
    y_val = df_val.y.values

    dv, model = train(df_train, y_train, C)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

print(f'C: {C} | AUC mean: {np.mean(scores).round(3)} | AUC std: {np.std(scores).round(3)}')

0it [00:00, ?it/s]

C: 1.0 | AUC mean: 0.906 | AUC std: 0.006


# 5. Testing the model

In [19]:
dv, model = train(df_full_train, y_full_train, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc.round(3)

0.906

# 6. Save the model

In [21]:
output_file = f'model_C={C}.bin'
output_file

'model_C=1.0.bin'

In [22]:
f_out = open(output_file, 'wb')
pickle.dump((dv, model), f_out)
f_out.close()

To make sure that the file is always close, it is better to use `with` statement

In [23]:
with open(output_file, 'wb') as f_out:
    pickle.dump((dv, model), f_out)

# 7. Load the model

In [2]:
model_file = 'model_C=1.0.bin'

In [3]:
with open(model_file, 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [4]:
dv, model 

(DictVectorizer(sparse=False),
 LogisticRegression(max_iter=1000, solver='liblinear'))

In [9]:
customer = {
    'job': 'blue-collar',
    'marital': 'married',
    'education': 'secondary',
    'housing': 'yes',
    'contact': 'unknown',
    'month': 'may',
    'poutcome': 'unknown',
    'age': 40,
    'balance': 580,
    'day': 16,
    'duration': 365,
    'campaign': 1,
    'pdays': -1,
    'previous': 0
}

In [10]:
X = dv.transform([customer])
X

array([[ 40., 580.,   1.,   0.,   0.,   1.,  16., 365.,   0.,   1.,   0.,
          0.,   0.,   1.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,  -1.,   0.,   0.,
          0.,   1.,   0.]])

In [11]:
model.predict_proba(X)[0, 1]

0.0166128321351466

# 9. Summary

