In [1]:
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
import pickle
import requests
import pandas as pd
import numpy as np
import seaborn as sns
import subprocess

# Documentation of ML-API

**Autor:** Simon Staehli

## Idea

In our project `Immobilienpreisrechner` in the third semester we have built a Machine Learning model to predict housing prices as accurate as possible. One task of this project was to implement an API for our model. Our API was not strucutured very well and it did not allow to pre-process the data same as the data we used to build and train our model. We did not a have a clear data pipeline for our model. In this small project for the study course web data collection I wanted to fix this issue and implement a basis to make a possible roll-out of the model possible.

## Description of the Service

In my opinion this service provides a framework for me in the future  to implement a machine learning model and roll it out for customer or a colleague. It should be simple and easy to adpat with a new model and pre-processing of the data.
The API provides all relevant operations like predictions, scoring for customers as well as for a maintainer with the methods to update the model parameters and deletion of the model.

## CRUD Operations

The API allows several Operations. It allows CRUD (Create, Read, Update, Delete). Here I will describe each endpoint and what kind of CRUD Operations it is classified. It is to mention that for this purpose the Create operation is very hard to implement as the pre-processing of the data and the model methods variate a lot.

- `\predict`: Read
- `\create_model`: Create
- `\score`: Read
- `\model_params`: Read
- `\update_model`: Update
- `\delete_model`: Delete

## Choice API

The reason why I have choosen Restful over GraphQL for my API is that with each request I only need to perform one certain task which returns the exact data which was requested. It could be predictions for a given dataset X for instance. I do not have any specialization or data which exceeds the user requests like a user profile for example, which would be returned entirely by one request. Therefore the GraphQL would be much more appropriate as it provides much more flexibility to a requester. It allows much more finegrained requests [(How to GraphQL, 2017)](https://www.howtographql.com/basics/1-graphql-is-the-better-rest/). You would not have a data overhead and just get the data requested over the API. In my application with the model we do not have any overfitting of the requests and therefore the Restful API is more appropriate. The implemented Restful API only allows to get and manipulate certain parameters or data of the model. 

## Getting Started

I would like to introduce the API with a small working example with a simple Lineare Regression Model trained on the Boston Housing dataset (built-in Sklearn. [(Pedregosa et al., 2011)](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html)


### Load Sample Data

Now we are loading the Sample data as defined above in our RAM [(Pedregosa et al., 2011)](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html).

In [2]:
X, y = load_boston(return_X_y=True)

### Train a Simple Model on it

After we loaded our data we can train a model based on that data. The model which we will use is a Linear Regression model from Sklearn [(Pedregosa et al., 2011)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). 
We can use the trained model to compare with the API-outputs later in time.

In [3]:
model = Lasso()
model.fit(X, y)
y_pred = model.predict(X)
y_pred[:10]

array([30.99753918, 25.77681736, 29.98601449, 29.51799813, 28.03248999,
       27.51646227, 24.45293184, 19.85336661, 11.03931099, 20.52605621])

In [4]:
print('R2 Score: ', model.score(X, y))

R2 Score:  0.6825842212709925


## URL of the Endpoint

I have implemented the API on the website `PythonAnywhere.com`, to enhance the learnings and take-aways from this challenge. This website offers a cost-free implementation of webservers with pre-installed python, which can be used to implement small webapplications. The storage capacity on the webserver is strictly limited as well as the usage times for requests. For more information about this service consult following source: https://www.pythonanywhere.com/pricing/ 
Some of the functions are regulated and if the provided ressources were exceeded the API will not be reachable anymore.

In [6]:
# The URL of our API 
base_url = 'http://simonstaehli.pythonanywhere.com'
requests.get(base_url + '/check')

<Response [200]>

### Create new Model Endpoint

Endpoint: `\create_model`

Creates a new model with a fitting id within the src folder. Then the id of the new model will be returned for further usage.
Because it is not possible to upload a python object to the API it is currently only possible to adapt an existing model. For example if there are only Lasso Regression models stored in src folder you can only adapt and create new Lasso Regression models.

In [7]:
new_model_params = {'alpha': .5}

In [8]:
# Send request
response = requests.put(url=base_url+'/create_model', json=new_model_params)
print('HTTP-Statuscode:', response.status_code)

HTTP-Statuscode: 200


In [9]:
response.json()

{'model_id': 2}

### Prediction Endpoint

Endpoint : `\predict`

The next endpoint is the prediction endpoint. With this endpoint you can send data to the API with POST method. 
We will convert our numpy array to a dataframe. This will make it easier to convert to JSON datatype which is needed for the API.

In [10]:
# Create dataframe to convert it to json
data = pd.DataFrame(X)
data.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14


In [11]:
# Convert to JSON
data = data.to_json(orient='records')

In [12]:
# Send request
response = requests.post(url=base_url+'/predict/2', json=data)
print('HTTP-Statuscode:', response.status_code)

HTTP-Statuscode: 200


In [13]:
print('Predictions : \n', response.json()[:5])

Predictions : 
 [30.997539182131007, 25.776817360601793, 29.986014494727087, 29.517998131638052, 28.032489994804962]


### Score Endpoint

Endpoint: `\score`

Next we will test the Scoring endpoint of the API. The scoring endpoint returns the score of the model reached by the uploaded data. Therefore we have to send the X and y data encoded as JSON.

In [14]:
y_ = pd.DataFrame(y)
y_ = y_.to_json(orient='records')

In [15]:
data_ = dict(X=data, y=y_)

In [16]:
response = requests.post(url=base_url+'/score/2', json=data_)
print('HTTP-Statuscode:', response.status_code)

HTTP-Statuscode: 200


In [17]:
print('R2-Score: ',response.json())

R2-Score:  0.6825842212709925


### Model Coefficients Endpoint

Endpoint: `\model_params`

This endpoint can be reached with a GET requests and returns the model parameters. The model parameters depend on the model used. It is expected that this changes from model to model. In accordance with that the model parameters have to be changed.

In [18]:
response = requests.get(url=base_url + '/model_params/2')
print('HTTP-Statuscode:', response.status_code)

HTTP-Statuscode: 200


In [19]:
model_coef = response.json()
model_coef

{'alpha': 0.5,
 'coef_': [-0.06343729004514066,
  0.04916466550764739,
  -0.0,
  0.0,
  -0.0,
  0.9498106999845143,
  0.020909514944737546,
  -0.6687900023707882,
  0.26420643097453383,
  -0.01521158979163473,
  -0.7229663585199505,
  0.00824703348549421,
  -0.7611145367697878],
 'copy_X': True,
 'dual_gap_': 4.238246337692544,
 'fit_intercept': True,
 'intercept_': 41.05693374499337,
 'l1_ratio': 1.0,
 'max_iter': 1000,
 'n_features_in_': 13,
 'n_iter_': 42,
 'normalize': False,
 'positive': False,
 'precompute': False,
 'random_state': None,
 'selection': 'cyclic',
 'tol': 0.0001,
 'warm_start': False}

### Update Model Endpoint

Endpoint: `/update_model`

This endpoint can be reached with a put request. With this requests you give the model new model parameters. As well as the method above this need to be changed as the model parameters strongly depend on the model. However it makes sense to change the model parameters if the model need to be retrained with new data for instance time series data.

In [20]:
# Update model parameters
response = requests.put(url=base_url+'/update_model/2', json={'params': np.random.uniform(size=len(model.coef_)).tolist()})
print('HTTP-Statuscode:', response.status_code)

HTTP-Statuscode: 200


In [21]:
# Print response
response.json()

'Parameters updated successfully.'

In [22]:
# Check if model parameters have changed by requesting the model parameters
response = requests.get(url=base_url + '/model_params/2')

In [23]:
model_coef = response.json()

print('New Model Parameters: \n', model_coef['coef_'])
print('Old Model Parameters: \n', model.coef_)

New Model Parameters: 
 [0.11923388695687542, 0.5329274985274414, 0.5246065724220114, 0.7421874219936702, 0.27830659792190404, 0.513148923257028, 0.29685918122830035, 0.31087894249996495, 0.08379887223575799, 0.6748163080443688, 0.414134469268018, 0.14815079700638278, 0.04610357620974992]
Old Model Parameters: 
 [-0.06343729  0.04916467 -0.          0.         -0.          0.9498107
  0.02090951 -0.66879     0.26420643 -0.01521159 -0.72296636  0.00824703
 -0.76111454]


In comparisson to the old parameters we can see that the parameters have been changed now.

### Delete Model Endpoint

Endpoint: `\delete_model` 

As we have changed the model parameters the model is in fact not usable anymore. Ultimately we have to delete it. For this purpose I created a delete endpoint. 

In [24]:
response = requests.delete(url=base_url+ '/delete_model/2')
print('HTTP-Statuscode:', response.status_code)

HTTP-Statuscode: 200


In [25]:
print(response.json())

Model Deleted Successfully.


## Sources

Pedregosa, F, G Varoquaux, und A Gramfort. „Scikit-Learn: Machine Learning in Pyton“. Sklearn, 2011. https://scikit-learn.org/stable/index.html.

Prisma. HowToGraphQL (Fundamentals) - GraphQL is the better REST (2/4), 2017. https://www.youtube.com/watch?v=T571423fC68.

Staehli, Simon, und Firat Saritas. „Immobilienpreisrechner“. GitLab, 10. Januar 2021. https://gitlab.fhnw.ch/simon.staehli/immobilienpreisrechner.

