In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

#### Question 1

- Install Pipenv
- What's the version of pipenv you installed?
- Use --version to find out

version 2025.0.4

#### Question 2

- Use Pipenv to install Scikit-Learn version 1.5.2
- What's the first hash for scikit-learn you get in Pipfile.lock?

"sha256:03b6158efa3faaf1feea3faa884c840ebd61b6484167c711548fce208ea09445"

#### Models

We've prepared a dictionary vectorizer and a model.
Note: You don't need to train the model. This code is just for your reference.

And then saved with Pickle. Download them:

- DictVectorizer
- LogisticRegression

#### Question 3

Let's use these models!

- Write a script for loading these models with pickle
- Score this client: {"job": "management", "duration": 400, "poutcome": "success"}

What's the probability that this client will get a subscription?

In [2]:
import pickle

To save the model:

In [4]:
full_df = pd.read_csv('../bank-full.csv', sep=';')

In [5]:
df = full_df[[
    'age',
    'job',
    'marital',
    'education',
    'balance',
    'housing',
    'contact',
    'day',
    'month',
    'duration',
    'campaign',
    'pdays',
    'previous',
    'poutcome',
    'y'
]]

In [6]:
df['y'] = (df['y']=='yes').astype(int)
df['y'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['y'] = (df['y']=='yes').astype(int)


y
0    39922
1     5289
Name: count, dtype: int64

In [7]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
len(df_train), len(df_val), len(df_test)

(27126, 9042, 9043)

In [8]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values
del df_train['y']
del df_val['y']
del df_test['y']

In [9]:
def train(df_train, y_train, C=1.0):
    dicts = df_train.to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000)
    model.fit(X_train, y_train)

    return dv, model

In [10]:
def predict(df, dv, model):
    dicts = df.to_dict(orient='records')

    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]

    return y_pred

In [11]:
dv, model = train(df_train, y_train)

In [12]:
y_pred = predict(df_val, dv, model)

In [13]:
roc_auc_score(y_val, y_pred)

0.8999653998756322

In [20]:
C = 1.0
output_file = f'model_C={C}.bin'
output_file
with open(output_file, 'wb') as f_out:
    pickle.dump((dv, model), f_out)

To load the model:

In [21]:
model_file = 'model_C=1.0.bin'

with open(model_file, 'rb') as f_in:
    dv, model = pickle.load(f_in)

In [22]:
client = {"job": "management", "duration": 400, "poutcome": "success"}

In [23]:
X = dv.transform([client])
X

array([[  0.,   0.,   0.,   0.,   0.,   0.,   0., 400.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          1.,   0.,   0.]])

In [24]:
model.predict_proba(X)[0, 1]

0.8992766099364298

#### Question 4

Now let's serve this model as a web service

- Install Flask and gunicorn (or waitress, if you're on Windows)
- Write Flask code for serving the model
- Now score this client using requests:
    - url = "YOUR_URL"
    - client = {"job": "student", "duration": 280, "poutcome": "failure"}
    - requests.post(url, json=client).json()

What's the probability that this client will get a subscription?

In [1]:
import requests

In [2]:
url = 'http://localhost:9696/predict'

In [3]:
client = {"job": "student", "duration": 280, "poutcome": "failure"}

In [7]:
response = requests.post(url, json=client).json()
response

{'churn': True, 'churn_probability': 0.5226867244545101}

In [5]:
if response['churn'] == True:
    print("sending promo email to %s" % ('xyz-123'))

sending promo email to xyz-123


#### Docker

Install Docker. We will use it for the next two questions.

For these questions, we prepared a base image: svizor/zoomcamp-model:3.11.5-slim. You'll need to use it (see Question 5 for an example).

This image is based on python:3.11.5-slim and has a logistic regression model (a different one) as well a dictionary vectorizer inside.

This is how the Dockerfile for this image looks like:

- FROM python:3.11.5-slim
- WORKDIR /app
- COPY ["model2.bin", "dv.bin", "./"]

We already built it and then pushed it to svizor/zoomcamp-model:3.11.5-slim.
Note: You don't need to build this docker image, it's just for your reference.

#### Question 5

Download the base image svizor/zoomcamp-model:3.11.5-slim. You can easily make it by using docker pull command.

So what's the size of this base image?

You can get this information when running docker images - it'll be in the "SIZE" column.

Dockerfile
Now create your own Dockerfile based on the image we prepared.

It should start like that:

FROM svizor/zoomcamp-model:3.11.5-slim
-  add your stuff here

Now complete it:

- Install all the dependencies form the Pipenv file
- Copy your Flask script
- Run it with Gunicorn

After that, you can build your docker image.

197mb

#### Question 6

Let's run your docker container!

After running it, score this client once again:

- url = "YOUR_URL"
- client = {"job": "management", "duration": 400, "poutcome": "success"}
- requests.post(url, json=client).json()

What's the probability that this client will get a subscription now?

In [9]:
url = 'http://localhost:9696/predict'

In [10]:
client = {"job": "management", "duration": 400, "poutcome": "success"}

In [11]:
response = requests.post(url, json=client).json()
response

{'churn': True, 'churn_probability': 0.8992766099364298}

In [12]:
if response['churn'] == True:
    print("sending promo email to %s" % ('xyz-123'))

sending promo email to xyz-123
