## Lighthouse Labs
### Deployment of ML Models

Instructor: Alibek Kruglikov

### How is Data Science related to the Web?

Web Pages are intended for Humans. However, there’s lots of valuable data embedded in web pages:
* course listings
* bank records
* blogs

### What if we wanted to collect this data for analysis?

We would need a program that acts like a web browser but collects web document data rather than displaying it.

This is called `web scraping`. Popular methods include Scrapy, a free and open-source web-crawling framework written in Python.

A Web Scraper...
* acts like a web browser (i.e., sends HTTP GET requests to web server)
* at the time it allows your to process the data that comes back.

Some other useful libraries useful when scraping if you are interested:

Beautiful Soup
* python library that can parse HTML (Super useful)

### Disadvantages of Web Scraping

- Scraping processes are hard to understand.

- Extracted data needs extensive cleaning (This is where we use `Beautiful Soup`).

- In certain cases, this might take a long time and a lot of energy to complete (show why)

- New data extraction applications a lot of time in the beginning.

- Web scrapping services are slower than API calls.

- If the developer of a website decides to introduce changes in the code, the scrapping service might stop working.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"

In [None]:
res = requests.get(URL).text
print(res)

In [None]:
soup = BeautifulSoup(res)

In [None]:
for table_rows in soup.find('table', class_='wikitable').find_all('tr')[1:]:
    data = table_rows.find_all(['th','td'])
    try:
        country = data[0].a.text
        title = data[1].a.text
        try:
            name = data[1].a.find_next_sibling().text
        except:
            pass
    except IndexError:
        pass

    print(country, title, name, sep=' | ')

### What is an API?

**A**pplication  
**P**rogramming  
**I**nterface  
  
**RE**presentation  
**S**tate  
**T**ransfer

**J**ava**S**cript **O**bject **N**otation (json)


Textual format for structured data  
* [a,b,c] for arrays  
* {‘x’: m, ‘y’: n, ‘z’: o} for objects

JSON
* textual description of python (javascript actually) objects
* arrays and dictionaries

```
{
'library': [
           {'title': 'For Whom the Bell Tolls', 'author': 'Ernest Hemingway'},
           {'title': 'Trump: The Art of the Deal', 'author': 'Good Question'}
           ]
}
```

### The Anatomy Of A Request

It’s important to know that a request is made up of four things:

1. The endpoint (the URL)

2. The method (verb: GET, PUT, POST, etc.)

3. The headers (parameters)

4. The data (or body)

1. The endpoint (or route) is the url you request for

root-endpoint/?

https://api.github.com

2. The Method is the type of request you send to the server. You can choose from these types below:

a. GET - Used to get resource from server

b. POST - Used to create new resource on server

c. PUT/PATCH - update resource on server

d. DELETE - delete a resource on the server

## Flask

Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a predictions as a response.

Now that you are going to be a Data Scientist, you cannot always rely on having your models in Jupyter Notebook.

Jupyter Notebooks are awesome for EDA. However, when you need an application that has a predictive model, you will need to deploy your model elsewhere.

You can try to get the best model possible in a notebook or a script. Once you have decided that you have the best model, you must hand it in a way that the client can run it easily in their infrastructure.

For this purpose you need a tool that can fit in their  infrastructure, preferably in a language that you’re familiar with. This is where you can use Flask. Flask is a micro web framework written in Python. It can create a REST API that allows you to send data, and receive a prediction as a response.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostRegressor
from sklearn.pipeline import Pipeline

In [None]:
# Load and split data
data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=93)

In [None]:
feature_names = data.feature_names
feature_names

In [None]:
# Create and fit a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=4)),
    ('model', AdaBoostRegressor(random_state=93))
])

pipeline.fit(X_train, y_train)

In [None]:
# Evaluate model (not going to be great, obviously)
print(f'R-squared (train): {pipeline.score(X_train, y_train)}')
print(f'R-squared (test): {pipeline.score(X_test, y_test)}')

In [None]:
pd.DataFrame(X, columns=feature_names).describe()

In [None]:
houses_to_predict = pd.DataFrame(
    {
        'MedInc': [1.5, 3.87, 5.5, 8.0, 15.0],
        'HouseAge': [1.0, 18.0, 29.0, 40.0, 52.0],
        'AveRooms': [2.0, 4.5, 5.4, 7.0, 12.0],
        'AveBedrms': [0.5, 1.0, 1.1, 1.5, 2.5],
        'Population': [100, 800, 1425, 2500, 10000],
        'AveOccup': [1.0, 2.5, 3.0, 4.0, 10.0],
        'Latitude': [32.6, 34.0, 35.6, 37.5, 41.9],
        'Longitude': [-124.3, -121.5, -119.6, -118.0, -114.5]
    })

In [None]:
houses_to_predict

In [None]:
pipeline.fit(X, y)

In [None]:
pipeline.predict(houses_to_predict.values)

In [None]:
# saving the model
import pickle

pickle.dump(pipeline, open('pipeline.pkl', 'wb'))

In [None]:
import requests

json_data = houses_to_predict.to_dict()

# post request with null values
r = requests.post(url='http://127.0.0.1:5000/predict', json=json_data)
r.json()['prediction']

### Advantages of Flask
- Easy to understand development: Beginner friendly.
- It is very flexible and easy: Comes with a template engine too!
- Testing: Unit testing is possible.

### Disadvantages of Flask
- Since it is too easy, it allows to use low-quality code creating a "bad web application".
- Scalability: It can handle every request one at a time. For multiple requests, it will be slow.
- Modules: Using more modules is seen as a third party involvement which could be a major breach in security and expense.
- Community support is limited - more support for frameworks such as Django (streamlit is another)

## Streamlit

[Streamlit](https://streamlit.io/) is an open-source Python library that makes it incredibly easy to create custom web apps.

It allows data scientists to turn scripts into shareable web applications in just a few lines of code, without needing extensive web development knowledge.

Think of it as a way to quickly prototype and deploy interactive dashboards or model inference interfaces directly from your Python scripts.

### How Streamlit Works
Streamlit works by rerunning your script from top to bottom every time a user interacts with the app (e.g., changes a slider, clicks a button). It intelligently caches data and computations to optimize performance, ensuring a smooth user experience.

### Advantages of Streamlit
- Speed of Development: Build interactive data apps with minimal code, often in hours instead of days.
- Simplicity: Designed specifically for data scientists, it uses familiar Python syntax and concepts.
- Interactivity: Easily add widgets like sliders, text inputs, checkboxes, and buttons to create dynamic applications.
- Focus on Data: Allows you to focus on the data and model logic rather than intricate web development details.
- Deployment: Streamlit apps can be easily deployed to various platforms, including Streamlit Cloud.

### Disadvantages of Streamlit
- Limited Customization: While great for quick apps, it offers less control over the UI/UX compared to full-fledged web frameworks like Flask or Django.
- Not for Complex Web Apps: Not designed for building multi-page, highly interactive, or complex web applications with intricate routing and user management.
- Performance for Large Apps: For very large or computationally intensive applications, the "rerun on every interaction" model might become a bottleneck if not optimized carefully.
- Community and Ecosystem: While growing rapidly, its ecosystem and community support are still smaller than more established web frameworks.

More advanced deployment can be done in the cloud (AWS, Azure, GCP)

Cloud computing benefits:
- Cost reduction
- Quick Deployment
- Flexibility
- Scalability
- Security
- Backups