# CPSC 330 Lecture 24

Outline:

- 👋
- **Turn on recording**
- Announcements + survey (15 min)
- Model deployment (30 min)
- Instructor/TA evaluations + Break (10 min)
- Review / conclusion (25 min)

## Learning objectives

- Describe the goals and challenges of model deployment.

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, cross_validate

## Announcements + survey (15 min)

- Last lecture today!
- Learning objectives should now be posted for all lectures.
- hw7 grades posted
- Recordings on README now link to YouTube, which has some parts cut out. See Canvas for whole videos.
- Extremely short survey on CPSC 330 vs. 340: https://ubc.ca1.qualtrics.com/jfe/form/SV_2ayfs2EcNUJdYKV
- We will take time later for the formal course evaluations.

In [None]:
REMINDER TO RESUME RECORDING

#### Final exam

- Two parts: written part on Canvas (40%), coding part on GitHub (60%)
- Both parts will be available from Dec 7 at 6:00pm to Dec 9 at 6:00pm

Canvas part:

- No time limit within the 48 hours. Poll: https://piazza.com/class/kb2e6nwu3uj23?cid=647
- Similar length to the midterm: approximately 20 short-answer questions
- Open-book
- Answer must be in your own words
- Questions will appear in random order
- Once you submit an answer you cannot go back
- I am guessing the written part will take you 1-2 hours (similar to midterm), but you can take as long as you want.
- More emphasis on 2nd half of the course.

Coding part:

- No time limit within the 48 hours
- Submission process exactly the same as assignments (push to github and Canvas)
- Make sure you're using the course environment (conda or pip)
- Open-book (course notes, internet posts other than you asking a question, anything except another human)
- Can use code from class with attribution
- Cannot use code from elsewhere
- Similar format to [last year's final exam](https://github.com/UBC-CS/cpsc330/blob/master/exams/2019W2/final_exam.ipynb) (except Q7)
- I am guessing the coding part will take you 2-3 hours (similar to last year's final), but you can take as long as you want.
- Will cover the whole course but more emphasis on the 1st half.

Joint rules:

- No communication 
- No public Piazza posts
- Questions via private Piazza posts
  - I will check several times a day (probably between 7am and 8pm)
- There will be an integrity pledge

I will write this all up in a Piazza post soon.

## Model deployment (30 min)

#### Attribution

This material adapted from the [model deployment tutorial](https://github.com/TomasBeuzen/machine-learning-tutorials/blob/master/ml-deploy-model/deploy-with-flask.ipynb) by [Tomas Beuzen](https://www.tomasbeuzen.com/).

#### What is deployment?

- After we train a model, we want to use it!
- The user likely does not want to install your Python stack, train your model.
- You don't necessarily want to share your dataset.
- So we need to do two things:

1. Save/store your model for later use.
2. Make the saved model conveniently accessible.

We will use [Joblib](https://joblib.readthedocs.io/) for (1) and [Flask](https://flask.palletsprojects.com/) & [Heroku](https://www.heroku.com/) for (2).

#### Requirements (I already did these)

- Heroku account. Register [here](https://www.heroku.com/).
- Heroku CLI. Download [here](https://devcenter.heroku.com/categories/command-line).

More python installations (not in the course environment):

```
pip install Flask
pip install Flask-WTF
pip install joblib
```

#### Preparing the model we wish to deploy

We'll be training a regression model to predict the age of abalone based on the classic abalone dataset hosted [here](https://archive.ics.uci.edu/ml/datasets/abalone). We aim to predict the age of abalone based on four physical measurements. I've renamed `abalone.data` to `abalone.csv` after downloading.

In [2]:
abalone_df = pd.read_csv('data/abalone.csv',
                       names = ['sex', 'length', 'diameter', 'height',
                                'whole_weight', 'shucked_weight', 'viscera_weight',
                                'shell_weight', 'rings'])

For simplicity, only use 4 features:

In [3]:
features = ['length', 'diameter', 'height', 'whole_weight']

X = abalone_df[features]
y = abalone_df['rings']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [5]:
X_train.head()

Unnamed: 0,length,diameter,height,whole_weight
3906,0.245,0.18,0.065,0.0635
2562,0.44,0.325,0.1,0.4165
2197,0.405,0.305,0.105,0.3625
1405,0.655,0.535,0.205,1.6445
1903,0.575,0.445,0.145,0.876


In [6]:
y_train.head()

3906     4
2562     6
2197    10
1405    13
1903    10
Name: rings, dtype: int64

Build and score model:

In [7]:
model = RandomForestRegressor(n_estimators=10, random_state=123)
model.fit(X_train, y_train);

In [8]:
model.score(X_train, y_train)

0.8716166306386149

In [9]:
model.score(X_test, y_test)

0.27082133598283975

- We'll re-fit the model on the full dataset to get it ready for deployment. 
  - This is probably a good idea, because more data is better.
  - It's also a little scary, because we can't test this new model.

In [10]:
model = RandomForestRegressor(n_estimators=10, random_state=123)
model.fit(X, y);

Save trained model using `joblib`. This will be loaded up when we start our "app": 

In [14]:
with open('web_api/abalone_predictor.joblib', 'wb') as f:
    joblib.dump(model, f)
with open('web_application/abalone_predictor.joblib', 'wb') as f:
    joblib.dump(model, f)

We'll define a function that accepts input data as a dictionary and returns a prediction:

In [11]:
def return_prediction(model, input_json):
    
    input_data = [[input_json[k] for k in features]]
    prediction = model.predict(input_data)[0]
    
    return prediction

In [12]:
example_input_json = {
    'length': 0.41,
    'diameter': 0.33,
    'height': 0.10,
    'whole_weight': 0.36
}

In [13]:
return_prediction(model, example_input_json)

7.6

This function appears in the `app.py` that we'll be using shortly.

#### (optional) Setting up a directory structure and environment

- We need a specific directory structure to help us easily deploy our machine learning model. 
- This is already set up in this repo.

```shell
lectures
├── web_api
│   └── abalone_predictor.joblib  # this is the machine learning model we have built locally
│   └── app.py  # the file that defines our flask API
│   └── Procfile  # required by Heroku to help start flask app
│   └── requirements.txt  # file containing required packages
│   
└── web_application
    └── abalone_predictor.joblib  # this is the machine learning model we have built locally
    └── app.py  # the file that defines our flask API
    └── Procfile  # required by Heroku to help start flask app
    └── requirements.txt  # file containing required packages
    └── templates  # this subdirectory contains HTML templates to help us build the web application
    │   └── style.css  # css template to be used in web application
    └── static  # this subdirectory contains CSS style sheets
        └── home.html  # html template to be used in web application
        └── prediction.html  # html template to be used in web application
```

#### Model deployment

We have two options for deploying our abalone prediction model. We can:

1. Develop a RESTful web API that accepts HTTP requests in the form of input data and returns a prediction.
2. Build a web application with a HTML user-interface that interacts directly with our API.

We'll explore both options below.

#### Building and deploying a web API

|      | on localhost (my laptop) | on server (the interwebs) |
|------|--------------------------|--------------------------|
| API  |    you are here          |                          |
| app  |                          |                          |

- I have a separate Python file called `app.py` that handles this part.
- We can open it up here in Jupyter Lab and take a look.
- If we run `python app.py` we'll bring it to life.
  - If you get an error, you may need to install those extra packages and make sure you have the environment loaded.
- We won't go into details here. If you want to learn more about Flask, see:
  - [Flask tutorial video series by Corey Schafer](https://www.youtube.com/playlist?list=PL-osiE80TeTs4UjLw5MM6OjgkjFeUxCYH)
  - [Flask docs](https://flask.palletsprojects.com/en/1.1.x/)
  - [Flask tutorial by Miguel Grinberg](https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-i-hello-world)
- If you don't know what an API is, that's OK.
  - For our purposes, it's something that exists at a particular address, that can accept information and return information.
  - Sort of like a function but not Python-specific and potentially accessible by anyone on the internet.

#### Testing that the Flask API is alive

- Let's test this out in a terminal here in Jupyter Lab. 
- Ok, now let's go to the URL `http://127.0.0.1:5000/`. 

#### Sending a request to the API

In [16]:
!curl -d '{"length":0.41,"diameter":0.33,"height":0.10,"whole_weight":0.36}' \
      -H "Content-Type: application/json" \
      -X POST http://localhost:5000/predict

7.6


(Or, we can open up another Terminal tab and do it there)

#### Deploying the API on a server

|      | on localhost (my laptop) | on server (the interwebs) |
|------|--------------------------|--------------------------|
| API  |                          |      you are here   |
| app  |                          |                          |

- Okay, so we have a working API running on localhost, but we don't want to host this service on my laptop!
- We now want to deploy it on a "real" server so others can send it requests. 
- We will use Heroku to deploy our app but you could also use other services such as AWS.


#### Heroku set-up (I already did these):

1. Go to [Heroku](https://dashboard.heroku.com/), log-in, and click "Create new app".
2. Choose a unique name for your app.
3. Create app.

<img src="img/flask_images/fl_6.png" width="600">

- We will be using the Heroku CLI to deploy our model. 
- We'll open up another terminal.

```
heroku login
cd my-project/
git init
heroku git:remote -a my-abalone-predictor
git add .
git commit -am "Initial commit"
git push heroku master
```

(Note that for more complex applications, you may choose to containerize everything in a Docker container to deploy to Heroku).

#### Using the API on Heroku

In [19]:
!curl -d '{"length":0.41,"diameter":0.35,"height":0.12,"whole_weight":0.36}' \
      -H "Content-Type: application/json" \
      -X POST https://my-abalone-predictor.herokuapp.com/predict 

9.6


- OK so what this means is that anyone can do this.
- In fact, you all have your laptops - give it a try!
- You can also do the `curl` from a terminal:

```
curl -d '{"length":0.41,"diameter":0.33,"height":0.10,"whole_weight":0.36}' \
     -H "Content-Type: application/json" \ 
     -X POST https://my-abalone-predictor.herokuapp.com/predict 
```

![](img/mike_highfive.png)

That's it for the API approach. Next:

#### Building and deploying a web application

|      | on localhost (my laptop) | on server (the interwebs) |
|------|--------------------------|--------------------------|
| API  |                          |                          |
| app  |   you are here           |                          |

- Flask can create entire web applications.
- We only need to refactor our code a little bit and link it up with some html and css to create our web application.
- We will use Flask to create a html form, accept data submitted to the form, and return a prediction using the submitted data. 
- Again, I won't go into too much detail here, but we can open up `web_application/` and take a quick look.

#### Testing the web application

- Let's terminate our API Flask app, navigate to `../web_application`, and run again.
- Now let's go back to `http://127.0.0.1:5000/`.

We can try it again on localhost.

#### Deploying the web application

|      | on localhost (my laptop) | on server (the interwebs) |
|------|--------------------------|--------------------------|
| API  |                          |                          |
| app  |                        |          you are here       |

- I already logged in to Heroku and created the app.
- Now the same commands:

```
heroku login
cd my-project/
git init
heroku git:remote -a my-abalone-web-app
git add .
git commit -am "Initial commit"
git push heroku master
```

- Let's try it out: https://my-abalone-web-app.herokuapp.com/
- You can try it too!

#### Discussion

- There are many ways to deploy a model; a RESTful API is very common and convenient. 
- As you can see, a simple deployment is fairly straightward. 
- However, there may be other considerations such as:
  - Privacy/security
  - Scaling
  - Error handling
  - Real-time / speed
  - Low-resource environments (e.g. edge computing)
  - etc.

## Break (10 min)

- We'll take a longer break today.
- Consider taking this time to fill out the instructor/TA evaluations if you haven't already.
- You may have seen [my post about these evaluations](https://www.reddit.com/r/UBC/comments/k18qj7/teaching_evaluations_the_good_the_bad_and_the_ugly/) on r/ubc.

Evaluation link: https://canvas.ubc.ca/courses/53561/external_tools/4732

## Course review / conclusion (25 min)


#### Learning objectives

Here are the course learning outcomes I came up with when proposing this new course:

1. Identify problems that may be addressed with machine learning.
2. Select the appropriate machine learning tool for a problem.
3. Transform data of various types into usable features.
4. Apply standard tools implementing supervised and unsupervised learning techniques.
5. Describe core differences between training, validation, and testing regimes.
6. Effectively communicate the results of a machine learning pipeline.
7. Be realistic about the limitations of individual approaches and machine learning as a whole. 
8. Create reproducible workflows and pipelines.

- How did we do? 
- Hopefully OK, except we skipped the last point (that will likely be its own new course).
- I would also add:

9. Identify and avoid scenarios in which training and testing data are accidentally mixed (the "Golden Rule").
10. Employ good habits for applying ML, such as starting an analysis with a baseline estimator.

because I think they are important enough to make it to the course-level list.

#### What did we cover?

I see the course roughly like this (not in order):

Part 1: Supervised learning on tabular data

- Overfitting, train/validation/test/deployment, cross-validation
- Feature preprocessing, pipelines, imputation, OHE, etc
- The Golden Rule, various ways to accidentally violate it
- Classification metrics: confusion matrix, precision/recall, ROC, AUC
- Regression metrics: MSE, MAPE
- Regression: transforming the targets
- Feature importances, feature selection
- Hyperparameter optimization

Part 2: Other data types (non-tabular)

- Computer vision with deep learning
- Language data, text preprocessing
- Ratings data
- Time series
- Right-censored data / survival analysis

Also: Other stuff

- Ensembles
- Outlier detection
- Clustering
- A bunch of models: 
  - baselines
  - linear models (ridge, lasso, huber, logistic regression, SGD)
  - tree-based models (random forest, gradient boosted trees)
  - KNN classifier/regressor
  - pre-trained deep learning models
- Communicating your results (including visualizations)
- ML skepticism
- Ethics for ML

#### Some key takeaways

Some useful guidelines:

- Do train-test split right away and only once
- Don't look at the test set until the end
- Don't call `fit` on test/validation data
- Use pipelines
- Use baselines

Recipe to approach a supervised learning problem with tabular data:

1. Have a long conversation with the stakeholder(s) who will be using your pipeline.
2. Have a long conversation with the person(s) who collected the data.
3. Think about the ethical implications - are you sure you want to do this project? If so, should ethics guide your approach?
4. Random train-test split with fixed random seed; do not touch the test data until Step 16.
5. Exploratory data analysis, outlier detection.
6. Choose a scoring metric -> higher values should make you & your stakeholders happier.
7. Fit a baseline model, e.g. `DummyClassifier` or `DummyRegressor`.
8. Create a preprocessing pipeline. May involve feature engineering. (This is usually a time-consuming step!)
9. Try a linear model, e.g. `LogisticRegression` or `Ridge`; tune hyperparameters with CV.
10. Try other sensible model(s), e.g. LightGBM; tune hyperparameters with CV.
11. For each model, look at sub-scores from the folds of cross-validation to get a sense of "error bars" on the scores.
12. Pick a model that you like. Best CV score is a reasonable metric, though you may choose to favour simpler models.
13. Look at feature importances.
14. (optional) Perform some more diagnostics like confusion matrix for classification, or "predicted vs. true" scatterplots for regression.
15. (optional) Try to calibrate the uncertainty/confidence outputted by your model.
16. Test set evaluation.
17. Question everything again: validity of results, bias/fairness of trained model, etc.
18. Discuss your results with stakeholders.
19. (optional) Retrain on all your data.
20. Deployment & integration.
21. Profit?

PS: the order of steps is approximate, and some steps may need to be repeated during prototyping, experimentation, and as needed over time.

#### What would I do differently?

- Find a dataset with multi-class classification for an early part of the course.
- Reordering the material a bit:
  - Move "feature importances for computer vision" into computer vision lecture (not ethics).
  - Introduce random forests and feature importances a bit earlier
  - Move outlier lecture much earlier
- Allocate 2 lectures to time series data 

I'm sure you have other suggestions - feel free to drop me an email, submit my contact form anonymously at mikegelbart.com, or drop them in the course evaluations.

#### 330 vs. 340

- I am hoping lots of people will take both courses.
- There is some overlap but not a crazy amount (I hope).
- If you want to learn how these methods work under the hood, CPSC 340 will give you a lot of that, such as:
  - Implementing `Ridge.fit()` from scratch
  - Mathematically speaking, what is `C` in `LogisticRegression`?
  - How fast do these algorithms run in terms of the number of rows and columns of your dataset? 
  - Etc.
- There are also a bunch of other methods covered. 

#### Unsolicited advice

- I sometimes end my courses with "unsolicited life advice".
- Today I will just say let's take care of ourselves and each other and try to get through this.
- If you're interested, the advice from CPSC 340 a couple years ago [is on YouTube](https://www.youtube.com/watch?v=_7zYxpzrKmQ&list=PLWmXHcz_53Q02ZLeAxigki1JZFfCO6M-b&index=34&t=0s).

## Conclusion & farewell

That's all, folks. You made it!

<table style="float:left"><tr>
<td><img src="img/mike_hanginthere.png"/></td>
<td><img src="img/mike_believeinyou.png"/></td>
</tr></table>


I believe in you! Thank you for believing in me and entrusting me with this part of your education.