## 4. Model Deployment

## 4.1 Three ways of deploying a model

Recap of what we have done so far: 
1. We have designed the ML model
2. Trained the model
  - Experiment tracking
  - Productionizing the model (ML pipeline)
  
And in this chapter, we select the model from the last step and **deploy** it. There are different ways of deploying a model depending on how frequently the prediction result is required:

1. **(Offline) Batch deployment**: 
  - If we require the prediction less often and at regular intervals (e.g., we can wait for 1h, 1 day, 1 week for the prediction). 
  - The model is not up and running all the time but run somewhat regularly (hourly, daily, monthly)
  - Overview:
    - We have a database (with all our data)
    - A scoring job (which stores the model) pulls some data and provides it to the model
    - This data can be from the past hour, from yesterday, etc, depending on the regularity.
    - Some predictions are made
    - These are stored in another database
  - An example with our Taxi project: 
    - The user has an app to call the taxi
    - There are also other app competitors (e.g. uber)
    - Our marketing team may call the model to study churn (how many users leaves us for the competitors)
    - Marketing doesn't need to run the model all the time, but maybe just daily or weekly
    
2. **Online**
  - For cases when we need predictions inmediately.
  - The model is up and running all the time. It's always available

Within the **online** mode, there are two options:

1. **Web service**. 
  - You have a web service that contains the model. 
  - Through ```http``` requests we can get the prediction from the model
  - 1x1 relationship (Client (the backend) -- Server (the web service)). The connection is kept alive while the server is processing the request and sends the response back
  - Our project of predicting the duration of a taxi ride is a good case to use the web service:
    - The user has an app which talks to the backend of the web service
    - Info about the user is sent to the backend (time of the day, where is the user,...)
    - This information is then used to run the model
    - The predictions are sent back to the backend, which passes it to the user
    - The user needs the predictions inmediately to make the decision of getting a taxi
2. **Streaming**. 
  - In a streaming setting we have producers and consumers
    - Producers: they generate events and push them to an event stream
    - Consumers: they read from the stream and react to these events
  - The key difference wrt to web service is the lack of explicit connection between Producers and Consumers. Producers don't care which or how many consumers there are
  - 1xN or many-to-many relationship (one/many producers to many consumers)
  - Applied to our taxi project:
    - We can have one producer (backend) that the user interacts with.
    - The producer then sends the event containing user info to a stream
    - Multiple consumers (not to be confused with user) feed from this stream
      - Consumer 1 could be user tip prediction (model to predict user tips). It would then send a notification to the user regarding the tip
      - Consumer 2 may have a more accurate model to predict ride duration
      - ...

## 4.2 - Web-services: Deploying models with Flask and Docker

- In this section we will use the pickle object saved in Week 1 and deploy it as a web application. 
- In the next section we will see how to connect our model registry from MLflow to the web application.

We'll start first with an [introduction to Flask](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/05-deployment/03-flask-intro.md) from the ml-zoomcamp course. Some descriptions first:

- A **web service** allows different applications to communicate and exchange data over the internet using standardized protocols such as HTTP. It acts as a mediator, facilitating seamless data sharing regardless of the programming language they are written in.
- **Flask** is a Python web framework. This means flask provides you with tools, libraries and technologies that allow you to build a web application.

We can convert a Python function into a web service by using:

```python
from flask import Flask


# give an identity to your web service
app = Flask('ping-pong')

# Define a route ('/ping') for the web service. The decorated function 
# will be executed when that route is accessed.
@app.route('/ping',methods=['GET'])
def ping():
    return 'PONG'

# Run the Flask application (in an IDE) and start the web service
if __name__=='__main__':
    app.run(debug=True, host='0.0.0.0', port=9696)
```

The ```ping()``` function is now converted into a web service using Flask. When you access the specified route (http://localhost:9696/ping), it will return the string 'PONG' as the response.

In Flask, the ```@app.route``` decorator allows you to specify different HTTP methods for a particular route:
- GET is a method used to retrieve data from the server. It's the default.
- POST is used to send data to the server to create or update a resource (e.g. when login we are submitting (posting) our username and password to the web service). Note that there is no specification where the data goes.
- PUT is same as POST but we are specifying where the data is going to.
- DELETE is used to request to delete some data from the server.

The ```app.run()``` method is used to customize the behavior of the Flask development server:
- By default, it will run on the local machine (```127.0.0.1``` or ```localhost```)
- It will run on port ```5000``` by default
- The debug mode is enabled by default and provides helpful debugging information. It automatically reloads the server when code changes are detected.

After a brief introduction to Flask, we can now describe the steps to deploy a model as a web-service:

1. Save the trained model 
2. Create a virtual environment
3. Creating a script for predicting 
4. Putting the script into a Flask app
5. Packaging the app to Docker

The different files are located in ```notes-deployment-04-files/web-service```.

### 4.2.1 Save trained model

We had saved the trained model using ```pickle``` (see Week 1). This resulted in the binary file ```lin_reg.bin```, which has been copied to ```notes-deployment-04-files/web-service```.

### 4.2.2 Create a virtual environment

We want to use the model developed in week 1 of the course. For that, we'll need to obtain the python environment we used to train and test the model for consistency. To obtain the packages and the package versions of the current python environment (shell command): 

In [5]:
! pip freeze 

aiosqlite==0.19.0
alembic==1.11.1
anyio==3.6.2
appdirs==1.4.4
apprise==1.4.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asgi-lifespan==2.1.0
asttokens==2.2.1
asyncpg==0.27.0
attrs==23.1.0
backcall==0.2.0
beautifulsoup4==4.12.2
black==23.3.0
bleach==6.0.0
blinker==1.6.2
boto3==1.26.139
botocore==1.29.139
cachetools==5.3.1
certifi==2023.5.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
cloudpickle==2.2.1
cmaes==0.9.1
colorama==0.4.6
colorlog==6.7.0
comm==0.1.3
contourpy==1.0.7
coolname==2.2.0
cramjam==2.6.2
croniter==1.3.15
cryptography==41.0.1
cycler==0.11.0
databricks-cli==0.17.7
dateparser==1.1.8
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
docker==6.1.2
docker-pycreds==0.4.0
entrypoints==0.4
executing==1.2.0
fastapi==0.96.0
fastjsonschema==2.16.3
fastparquet==2023.4.0
Flask==2.3.2
fonttools==4.39.4
fqdn==1.5.1
fsspec==2023.5.0
future==0.18.3
gitdb==4.0.10
GitPython==3.1.31
google-auth==2.19.1
greenle

We're mostly interested in getting the scikit-learn version:

```scikit-learn==1.2.2```

Now that we have the package version we need. We use ```pipenv``` to create an envionment with them in ```notes-deployment-04-files/web-service``` (on the 'base' ```conda``` environment):

```
$ pip install pipenv
$ pipenv install scikit-learn==1.2.2 flask --python=3.10.9
$ pipenv shell
$ exit
```

Notes: 
- ```pip install``` only in case ```pipenv``` is not installed
- ```pipenv``` uses the current directory as the root of the environment
- ```pipenv shell``` activates the environment
- ```exit``` exits the environment

This creates two files in our directory: ```Pipfile``` and ```Pipfile.lock```. 
- ```Pipfile``` stores the versions of the packages that we want (like scikit-learn, Flask) 
-```Pipfile.lock``` stores the dependency tree to avoid for example updating Numpy for scikit-learn and breaking Flask in the process.

### 4.2.3 Create script for predicting

We create a simple script that loads the saved model, preprocesses the input data and generates prediction (```predict.py```). The following paragraphs refer to this file.

In Week 1 we pickled 2 files into ```lin_reg.bin```:
1. ```DictVectorizer```
2. Linear Regressor

They will be loaded with:

```python
with open('lin_reg.bin', 'rb') as f_in:
    (dv, model) = pickle.load(f_in)
```

We then extract and preprocess the features as we did in Week 1 by using the function

```python
def prepare_features(ride: dict[str,float]) -> dict[str,float|str]:
    ...
```

And we perform the prediction (it will only return 1st prediction) 
```python
def predict(features: dict[str,float|str]) -> float:
    ...
```

### 4.2.4 Put the script into a Flask app

Now that we have the ```predict.py``` ready, we can convert it into a Flask app:

```python
app = Flask('duration-prediction')

@app.route('/predict', methods=['POST'])
def predict_endpoint(): # <-- The parameters are usually given by Flask
  ride = request.get_json() # <-- The parameters are extracted from the
                            #     request (reads the JSON passed to the app)
  features = prepare_features(ride)
  pred = predict(features)
  
  result = {'duration': pred}
    
  return jsonify(result)  # transforms a dictionary into a JSON
```

To run the Flask application on localhost we add to the previous file:

```python
if __name__ == "__main__":
  app.run(debug=True, host='0.0.0.0', port=9696)
```
Now if we run ```predict.py```, a Flask application will run on ```localhost``` on port ```9696```.

#### Request Predictions from the Flask app

To request a prediction from the server, we create another file ```test.py```. This file will post its ride information to the server and print out the response (i.e: the predicted duration):

```$ python test.py```

Output:

```{'duration': 25.82088225071811}```

#### Use a production WSGI server

The current Flask setup is a development environment. We receive the following warning:

```WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.```

To deploy the model into production, we use ```gunicorn```:

```
$ pipenv install gunicorn
$ gunicorn --bind=0.0.0.0:9696 predict:app
```

where predict is the ```predict.py``` located in the current directory, and ```app``` is the Flask app defined on that file (see above). Now we run the same application, but instead of Flask we use ```gunicorn```. We can again run ```test.py``` and confirm that we get the same result.

We have run ```test.py``` script from the ```mlops-zoomcamp``` conda environment (development environment). The ```requests``` library is installed there. Ideally, the development environment should have the ```requests``` library installed as we need to do the testing, however in the production environment we do not need to install it.

To run ```test.py``` from the ```pipenv``` (production environment) we can install ```requests``` with dev dependencies only so that it will not be available during deployment:

```
$ pipenv install --dev requests
```

### 4.2.5 Package the app to Docker

Now we want to deploy our predictor into a Docker container. We create the following ```Dockerfile``` in ```notes-deployment-04-files/web-service```:


```dockerfile
# Use the base image of Python version 3.10.9 with a slim distribution
# (smaller footprint compared to the regular distribution)
FROM python:3.10.9-slim
# Update pip to the latest version
RUN pip install -U pip
# Install pipenv package manager
RUN pip install pipenv
# Set the working directory inside the container to /app
WORKDIR /app
# Copy the Pipfile and Pipfile.lock from the local directory to the container's
# /app directory
COPY [ "Pipfile", "Pipfile.lock", "./" ]
# Install the project dependencies. They are installed on the system python 
# (w/o virtual environment) as Docker already gives us the isolation. 
# Additionally, --deploy insures that pipenv will install the exact versions of the 
# dependencies as recorded in Pipfile.lock to ensure consistency and reproducibility
# when installing dependencies across different environments 
RUN pipenv install --system --deploy
# Copy the predict.py and lin_reg.bin files from the local directory to the 
# container's /app directory
COPY [ "predict.py", "lin_reg.bin", "./" ]
# Expose port 9696 for the container to listen on
EXPOSE 9696
# Set the entrypoint command to run gunicorn with the specified bind address and 
# port, and the "predict:app" application
ENTRYPOINT [ "gunicorn", "--bind=0.0.0.0:9696", "predict:app" ]
```

We then build the Docker Image with:

```
$ docker build -t ride-duration-prediction-service:v1 .
```
where:
- The ```-t``` flag is used to specify a tag for the Docker image.
- The dot ```.``` at the end specifies that the current directory should be used as the build context.

And run the container with:

```
$ docker run -it --rm -p 9696:9696 ride-duration-prediction-service:v1
```
where:
- ```-it``` enables interactive mode in the container, allowing you to interact with the container's terminal.
- ```--rm``` indicates that the container should be automatically removed when it exits.
- ```-p``` is used to publish and map ports between the container and the host. In this case, it maps port ```9696``` of the host to port ```9696``` of the container.

This will deploy the app on ```localhost```. We can run the ```test.py``` script again to confirm the result. Now instead of going to ```guincorn``` or Flask, it goes to the model deployed in a Docker container.

## 4.3 Web-services: Getting the models from the model registry (MLflow)

Up to this point, we have prepared the model in a ```Dockerfile```, making it deployable on any Docker-compatible computing platform. However, the model we utilized was retrieved directly from a local path, which contradicts what we learned in previous sessions. We were advised to utilize a model registry (MLflow) to store the candidate models. Therefore, in this section, we will explore how to retrieve the model from the model registry for serving purposes.

### 4.3.1 Setup MLflow

We launch the mlflow server locally by running the following command in your terminal while being in ```/notes-deployment-04-files/web-service-mlflow```:

```
(mlops-zoomcamp) $ mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./artifacts_local
```

The artifacts will be saved in ```./artifacts_local``` but the runs and metadata will be stored in the sqlite database (there will be a file ```backend.db``` specifying that sqlite has been used). In this scenario we can use the model registry.

Now we can open mlflow UI on http://127.0.0.1:5000/


### 4.3.2 Train a model

We train a Random Forest regressor model, and tracked and saved the model in MLflow (```random-forest.py```). You can check the experiment details and logged model artifact in MLflow UI:

![title](images/mlflow1.png)


We can then extract the run ID: 

```python
RUN_ID = '068bda10a3ed4b73a771df771161f60a'
```
and save it in ```web-service-mlflow/predict.py```

### 4.3.3 Inference script to fetch model from MLflow

We copy the ```Pipfile``` and ```Pipfile.lock``` from the previous section (4.2) to our current folder (```web-service-mlflow```). Then we use ```pipenv``` to create an environment in this folder. It will inherit the modules from the ```Pipfile``` and ```Pipfile.lock``` files already present. We also want to install MLflow:

```
$ pipenv install mlflow
```

Now if we run ```web-service-mlflow/predict.py```, a Flask application will run on ```localhost``` on port ```9696```. To request a prediction from the server, we run ```web-service-mlflow/test.py```. This file will post its ride information to the server and print out the response (i.e: the predicted duration + run_id):

```$ python test.py```

Output:

```{'duration': 45.50965007660852, 'model_version': '068bda10a3ed4b73a771df771161f60a'}```

## 4.4 (Optional) Streaming: Deploying models with Kinesis and Lambda

See these [Notes](https://sagarthacker.com/posts/mlops/aws-deployment-lambda-kinesis.html)

## 4.5 Batch deployment: use of scoring script

Even though **batch deployment** is not the ideal manner to deploy our "taxi ride duration" model (ideal way will be web service), we can rethink the problem as to having the *actual* duration and the *predicted* duration and see how often our drivers deviate from the *ideal* predicted duration.

### 4.5.1 Preparing a scoring script

We start from the ```web-service-mlflow/random-forest.ipynb``` notebook from the previous dection 4.3:
- We copy it and rename it to ```batch/score.ipynb```.
- We add the first lines from ```web-service-mlflow/predict.py``` (related to MLflow and loading the model created in 4.3)
- To check that the model is loaded correctly, we need to start MLflow in the directory ```web-service-mlflow/```, where we saved all artifacts.

```
(mlops-zoomcamp) $ mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./artifacts_local
```
- Now we'll just apply the model (we do not train), therefore we just have ```df```, ```dicts``` and the prediction on the loaded model ```y_pred```
- We save the predictions in a DataFrame. To give an unique ID to every row we can use the Python library ```uuid```
- We add some meta information to this DataFrame ('PULocationID','DOLocationID','actual_duration',...)
- We can also parametrise the input/output files

### 4.5.2 Running the scoring script

To convert the notebook into a Python script, execute:

```
$ jupyter nbconvert --to script score.ipynb
```

which will create ```score.py``` in the current directory (```batch/```). You can now create a function 

```python
def run():
...
```
to parametrice the script. We include the module ```sys``` to read the parameters from the terminal:
- ```taxi_type``` is the 1st parameter
- ```run_id``` is the 2nd parameter
- ```output_file``` is the 3rd parameter
- ```run_id``` is the 4th parameter

Therefore, by running:
```
$ python score.py green 2021 2 068bda10a3ed4b73a771df771161f60a
```
we get the output file ```green_2021-02.parquet``` stored in ```batch/output/```.

## 4.6 Batch deployment: Scheduling scoring jobs with Prefect

### 4.6.1 Adjusting previous ```score.py``` file
We will start by copying the previous file ```batch/score.py``` to the folder to a new folder ```batch-prefect/score.py```. Some changes have been pre-included in this new directory :
- Added function ```save_results()``` -> deals with all operations in ```df_result```
- Added Prefect ```@task``` to the function ```apply_model```
- ```print()``` has been replaced by Prefect's ```get_run_logger()```
- Added function ```get_paths()``` -> deals with input/output paths
- Added function ```ride_duration_prediction()``` decorated with Prefect's ```@flow``` -> repackages some of the functionality of ```run()```

### 4.6.2 Run new ```score.py``` with Prefect support

We are ready to run the new file!

- To check that the model is loaded correctly, we need to start MLflow in the directory ```web-service-mlflow/```, where we saved all artifacts.

```
(mlops-zoomcamp) $ mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./artifacts_local
```
We can now run again:
```
$ python score.py green 2021 2 068bda10a3ed4b73a771df771161f60a
```
we get the output file ```green_2021-01.parquet``` stored in ```batch-prefect/output/```. We can also see the logs in Prefect:

![title](images/prefect1.png)

### 4.6.3 Create Prefect Project & deployment

In order to proceed with the deployment, we will use Prefect Projects to manage and maintain our different deployments. By using Projects we add another layer of abstraction (group different deployments together). To initialize a Project within ```batch-prefect/```:

```(mlops-zoomcamp) $ prefect project init```

which will create different ```.yaml``` files within the directory to set the project up and running. 

1. We can now create a work pool in the Prefect UI. We name it ```local-work```

2. We get the desired flow (```score.py:ride_duration_prediction```) deployed by running:

```
(mlops-zoomcamp) $ prefect deploy score.py:ride_duration_prediction -n my-first-deployment -p local-work
```

We can see now the new deployment in Prefect UI:

![title](images/prefect2.png)

### 4.6.4 Running deployment

1. To execute flow runs from the deployment set up in 4.6.3, start a worker that pulls work from 
the ```local-work``` work pool

```
(mlops-zoomcamp) $ prefect worker start --pool local-work
```

2. Now we can go to the deployment and launch a quick run. A window will pop up asking for the input parameters:

![title](images/prefect3.png)

**Important**: Make sure that the data you are using is uploaded to your GitHub repo, otherwise it will not find the files and raise an error.