# Week 5 - Deploying Machine Learning Models

## 5.2 Saving and loading the model

Here we will compile the components of the model from the previous weeks

In [23]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [24]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'

!wget $data -O data.csv

--2023-10-15 10:20:36--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 977501 (955K) [text/plain]
Saving to: ‘data.csv’

data.csv              0%[                    ]       0  --.-KB/s               


2023-10-15 10:20:37 (7.57 MB/s) - ‘data.csv’ saved [977501/977501]



In [25]:
df = pd.read_csv('data.csv')

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

In [26]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [27]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

In [28]:
def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)

    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
    
    return dv, model

In [29]:
def predict(df, dv, model):
    dicts = df[categorical + numerical].to_dict(orient='records')

    X = dv.transform(dicts)
    y_pred = model.predict_proba(X)[:, 1]

    return y_pred

In [30]:
C = 1.0
n_splits = 5

In [31]:
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

scores = []

for train_idx, val_idx in kfold.split(df_full_train):
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    y_train = df_train.churn.values
    y_val = df_val.churn.values
    y_test = df_test.churn.values

    dv, model = train(df_train, y_train, C=C)
    y_pred = predict(df_val, dv, model)

    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

C=1.0 0.841 +- 0.008


In [32]:
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)

auc = roc_auc_score(y_test, y_pred)
auc

0.8572386167896259

Now we can save the model in a format that will work in a web service

In [33]:
import pickle

In [37]:
#take model and write it to a file
output_file = f'model_C={C}.bin'
f_out = open(output_file, 'wb') #wb specifies 'write' & 'binary'
pickle.dump((dv, model), f_out) #save model to specified file
f_out.close() #it is very easy to forget to close the file this best practice

In [39]:
with open(output_file, 'wb') as f_out:
    pickle.dump((dv, model), f_out)

Load the model

In [40]:
import pickle

In [43]:
model_file = 'model_C=1.0.bin'

In [44]:
with open(model_file, 'rb') as f_in:
    dv, model = pickle.load(f_in)

Using an example customer to test the model

In [46]:
customer = {
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'yes',
    'dependents': 'no',
    'phoneservice': 'no',
    'multiplelines': 'no_phone_service',
    'internetservice': 'dsl',
    'onlinesecurity': 'no',
    'onlinebackup': 'yes',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'no',
    'streamingmovies': 'no',
    'contract': 'month-to-month',
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 29.85,
    'totalcharges': 29.85
}

In [48]:
#turn customer into feature matrix
x = dv.transform([customer])

In [50]:
#get probability that this customer will churn
model.predict_proba(x)[0,1]

0.6363584152721875

No we will convert this code into the train.py and predict.py files in this directory for use with a web service.

## 5.4 Servicing the churn model with flask

Updating the predict.py to now be hosted on flask

Run the model
```bash
python predict.py
```

Run a prediction example
```bash
python predict-test.py
```

Notice the error when you run the model:  
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.  
  
```bash
pip install gunicorn
```  
```bash
gunicorn --bind 0.0.0.0:9696 predict:app
```
Note you will need to stop the previous run on that port to enable this

Now that the webservice is running we need to build a virtual environment for our service so it can be ran from any OS

## 5.5 Python virtual environment: Pipenv

Instead of running `pip install ...` for all the packages which installs the packages to the $PATH variable of the python kernel we want to utilize virtual environments

/venv is the built in python virtual environment  
/conda ist he built in conda virtual environment  
/pipenv is an officially recommended python virtual environment package

```bash
pip install pipenv
```

```bash
pipenv install numpy
pipenv install flask
```
Etc to install all your packages which will add them to the Pipfile and the Pipfile.lock

Now we want to be able to get into this environment we created  
```bash
pipenv shell
```
This will show the bin you can explore with ls where you will see the packages

Now that we are in the environment we can run the gunicorn bash from above to activate in the virtual environment we just created

```bash
gunicorn --bind 0.0.0.0:9696 predict:app
```

A shorthand of running commands in the pipenv:
```bash
pipenv run gunicorn --bind 0.0.0.0:9696 predict:app
```

## 5.6 Environment Management: Docker

Official docker images can be found on docker hub https://hub.docker.com/_/python/

We are going to use python image 3.8.12-slim as an example

```bash
docker run -it --rm --entrypoint=bash python:3.8.12-slim
```
note:  
-it means giving docker access to the terminal  
--rm specifies to remove from system when done  
--allows bash terminal as opposed to default python terminal 

We are going to use the python image as the base image then create a Dockerfile to add to it

Once Dockerfile is created specifying other dependencies we need to run the file to build the image  
```bash
docker build -t zoomcamp-test .
```

Now instead of running as above we run with the new image name  
```bash
docker run -it --rm --entrypoint=bash zoomcamp-test
```

Now we can reopen the port inside the docker image
```bash
pipenv run gunicorn --bind 0.0.0.0:9696 predict:app
```

We cannot yet access this port because it needs to be exposed. This is done via port mapping from the hosting device to the docker container port with the following addition do the Dockerfile

```
EXPOSE 9696

ENTRYPOINT ["gunicorn", "--bind 0.0.0.0:9696", "predict:app"]
```  
Note that every word must be separate by quotations unlike the bash command we ran above manually

Now after rebuilding when we run we do not need to specify the entrypoint but we do need to map the host machine to docker port using -p

```bash
docker run -it --rm -p 9696:9696 zoomcamp-test
```