# Demo

In this notebook we have 2 main parts. 

- Part 1 - Api for a SKL pipline
- Part 2 - Make a docker image. 

# Part 1 - Api for a SKL pipline

In this section we shall
- Import libraries and create the model
- Export pipline & Test load
- Test prediction for API verification
- Building the flask app file
- Test hitting the running API


## Import libraries and create the model

In this demo rather than just putting a model into an API we shall be putting a whole pipeline into one. In other words we shall just send raw data to the API and any pre-processing, feature engieering will also be done as well

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

import pickle

import requests
import json

The data we shall be using is a sample dataset downloaded from somewhere on the internet. This isn't meant to be used for anything other than demonstrations purposes. 

We shall be building a simple linear regression, where the feature we are trying to predict is "charges"

In [2]:
df =pd.read_csv("../data/datasets_13720_18513_insurance.csv")    #read in the csv to a pandas dataframe

In [3]:
df.head()  #a quick look at the head of the data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
df.dtypes    #check the data types of the columns read

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [5]:
df.describe(include='all')     #quick descriptive stats of the data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338,1338.0,1338.0,1338,1338,1338.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,676,,,1064,364,
mean,39.207025,,30.663397,1.094918,,,13270.422265
std,14.04996,,6.098187,1.205493,,,12110.011237
min,18.0,,15.96,0.0,,,1121.8739
25%,27.0,,26.29625,0.0,,,4740.28715
50%,39.0,,30.4,1.0,,,9382.033
75%,51.0,,34.69375,2.0,,,16639.912515


The first thing we shall do is sort the columns into numerical and categorical features, as we will apply different pre-processing steps to each group.

- numerical - We shall just apply a standard scaler.
- categorical - We shall one-hot-encode the different groups dropping the first one

Obviously in a real world situation we might have to be a bit more careful with what transformations we apply, potentially different ones to each columns. But for the purposes of this demo/exercise it will surfice. We care more about the process rather than the quailty of the model.

In [10]:
numeric_features=[]                         #init an empty list to hold the numerical features
catigorical_features=[]                     #init an empty list to hold the categorical features
for col in df.columns[:-1]:                 #for each of our input features
    if df[col].dtype == 'O':                #if its an 'object' datatype
        catigorical_features.append(col)    #add it to our categorical features list
    else:                                   #if its not an object
        numeric_features.append(col)        #add it to numerical features

In [11]:
numeric_features

['age', 'bmi', 'children']

In [12]:
catigorical_features

['sex', 'smoker', 'region']

In [13]:
numerical_transformer = Pipeline(steps=[('scaler', StandardScaler())])                   #set the numerical transformer
catigorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(drop='if_binary'))])  #set the categorical transformer

In [14]:
#create the column transformer
colum_transformer = ColumnTransformer(transformers=[
        ('num', numerical_transformer, numeric_features),
        ('cat', catigorical_transformer, catigorical_features)])

The above column transformer will then apply the transformations to the respective columns.

Below, we combine this in a pipline with a linear regssion model. Any data that is recieved will have the columns transformations applied first and then the resulting transformed data will be passed to the model for what ever function was called.

In [15]:
#create the final pipline
reg = Pipeline(steps=[('columnTransform', colum_transformer),
                      ('regression', LinearRegression())])

Finally we now fit the model to our data remembering to separate out the last column as thats our target feature.

In [16]:
reg.fit(df.iloc[:,:-1],df.iloc[:,-1])            #fit the model to the data

Pipeline(steps=[('columnTransform',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'bmi', 'children']),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(drop='if_binary'))]),
                                                  ['sex', 'smoker',
                                                   'region'])])),
                ('regression', LinearRegression())])

In [17]:
reg.named_steps['regression'].coef_             #show the coefficents to check its been trained

array([ 3607.47273619,  2067.69196584,   572.99820995,  -131.3143594 ,
       23848.53454191,   587.00923503,   234.0453356 ,  -448.01281436,
        -373.04175627])

In [18]:
dataCols = list(df.columns[:-1])     #save the name of the input features to a var for later

## Export pipline & Test load

To be able to create API and docker images that utilize a model the model have to be saved as a file type that it can read correctly. The simplest way to save it as a pickle file. 

Note this notebook is in a folder called notebooks, we are saving it in a folder called data which is on the same level as notebooks.

In [19]:
#this is saving
pickle.dump(reg, open('../data/pipline.pickle', 'wb'))
pickle.dump(dataCols, open('../data/columnNames.pickle', 'wb'))

In [20]:
#this is loading
reg2 = pickle.load(open('../data/pipline.pickle', 'rb'))
dataCols = pickle.load(open('../data/columnNames.pickle', 'rb'))


To check that there has been no corruption in data, we loaded the model into a different variable and we check the coefficents, they should be the same as the ones that were displayed above.

In [21]:
reg2.named_steps['regression'].coef_

array([ 3607.47273619,  2067.69196584,   572.99820995,  -131.3143594 ,
       23848.53454191,   587.00923503,   234.0453356 ,  -448.01281436,
        -373.04175627])

## Test prediction for API verification

Later, once the API is running we shall be atempting to make calls to it and getting a prediction back. To verify the answer we are getting back is correct we shall create the test example here and get the prediction for it now.

In [22]:
df.head()     #view the data to remind us of what its meant to look like

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [23]:
#create the test example
test = pd.DataFrame(np.array([[28,'male',30,0,'no','northwest']]),columns=dataCols)
test

Unnamed: 0,age,sex,bmi,children,smoker,region
0,28,male,30,0,no,northwest


In [24]:
list(reg2.predict(test)) #give the test example to the model to predict on

[4944.96464438397]

So now when ever we attempt to hit our API with the test example we expect it to return the above value.

## Building the flask app file

Now we need to build the .py file that will handel the API calls.

Two bits of note:
- When in a jupyter nb to write a contents of a cell to a .py file put '%%writefile filename.py' at the top of a cell
- When the file .py file is being run it will be launched at the same level as the folders so any imports/opening of files will need to be done from this level

In [55]:
%%writefile ../app.py                            
from flask import Flask,request, jsonify         #imports the core functionalities for running the APIs
import traceback                                 #will handle the formatting of any erros coming from functions (OPTIONAL)
import pickle as p                               #needed for the loading of any files i.e our pipline we wish to operate
import pandas as pd                              #needed for giving the data to our pipline
import json                                      #needed for handeling the json data in the request/curl

app = Flask(__name__)                            #This initialises the app. Do not change the __name__

@app.route('/predict', methods=['POST'])         #This sets up the route to activate the following function and the method that it will recieve 
def predict():
    if reg:                                                     #If the pipline exists
        try:                                                    #try to
            json_ = request.json                                #extract the json data from the request
            query = pd.DataFrame(json_)                         #load it into a pandas data frame
            prediction = reg.predict(query)                     #give the data to our pipline to predict on
            return jsonify({'prediction': list(prediction)})    #return the resulting prediction in a json formatt
        except:                                                 #if it failed to predict
            return jsonify({'trace': traceback.format_exc()})   #return the traceback error in json format
    else:                                                       #if the pipline does not exist
        print ('Train the model first')
        return ('No model here to use')                         #return the information


if __name__ == '__main__':                           #when the app first launches it will run the "__main__" body first to set anything up.
    reg = p.load(open('data/pipline.pickle', 'rb'))  #load in our pipline otherwise nothing will happen
    print('model loaded')
    app.run(host='0.0.0.0',port=5000)                #tell the app to run setting the host and the port number

Overwriting ../app.py


You can add more routes and funkier functions, such as pulling in data from an external database, as long as they are all written within the same cell that'll be written to the .py file. I also think that routes and functions can only be paired 1-1.

Make sure you note the port that you set as this will be needed when you create the dockerfile

In CLI navigate to where the above app.py file is and then start the service running by excuting:

## Test hitting the running API

Now that we have the service running we can test it out! When the service starts it'll give you the url that it is running on, but it should be the same as the one we have entered. Now lets make a request to it with our test example and hopefully it should give us the prediction we are expect.

In [66]:
url = 'http://0.0.0.0:5000/predict'
j_data = json.dumps([dict(zip(dataCols,[28,'male',30.0,0,'no','northwest']))])
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}

r = requests.post(url, data=j_data, headers=headers)
print(r, r.text)

<Response [200]> {"prediction":[4944.96464438397]}



BOOM! There we are, we are running our app locally and we have successfully made an API call to it! Make sure to go back to the CLI and stop the running file using ctrl-c

Now to package it all up so we can fully share it to anyone and/or deploy in on the internet somewhere.

# Part 2 - Make a docker image

There are 4 main steps to making the docker image.
 - Making the main .py file that does all the functionality - which we have already made
 - Making the requirements.txt file
 - Writing a Docker image (Dockerfile)
 - Building the Dockerfile
 - Push to dockerhub

## Make the requirments.txt file

The requirments.txt file is a simple list of all the libraries that you would need to pip install to be able to run the main .py file AND anything that it imports. In other words because we are using a sklearn model/pipline that needs to be included too.

You can either use a text editor to write the file or you can run the cell below to make it

In [59]:
%%writefile ../requirements.txt
flask
pandas
sklearn

Overwriting ../requirements.txt


The contents of requirements.txt file can be checked by running the second cell below. This is wise to do incase you have included any libraries that come with python as standard. If you do, you would get the following error message.

In [61]:
pip install -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Writing a Docker image (Dockerfile)

In essence this is quite straight forward to do, its just how you structure the files within this container that you need to be wary off.

The cell below is write a dockerfile for us and everything after the first line is written to the file. 

In [65]:
%%writefile ../Dockerfile
FROM python:3.8.3
WORKDIR code/
COPY requirements.txt .
RUN pip3 -q install pip --upgrade
RUN pip install -r requirements.txt
COPY app.py .
COPY data/pipline.pickle data/
EXPOSE 5000
CMD ["python", "app.py"]

Overwriting ../Dockerfile


The steps for this goes:
 - start from a base image
 - set up a working directory
 - copy across our requirements.txt and then pip install them to make sure we have all the libraries that we need
 - copy across the files and data that we need. Note that the pipline file is put into a subfolder called data mimicing our local file structure. Recall that our .py file loads in the data from a folder called data, we need to replicate this for it to work
 - EXPOSE the port that we wrote into our .py file
 - run the comand "python app.py" this will be what will launch our service

and thats it! Once the above cell is run the dockerfile is written now it just remains to be built

## Builing the Docker image

There are two ways we can do this:
 - Put your work on github and then you can link it to dockerhub. When you do this you can set dockerhub to watch a repo & a branch with in it, whenever the branch is pushed to, it will trigger the build process and automatical build the docker image for you from the Dockerfile in that branch. (Personally, this is my fav and it minimise CLI time)
 - Or you can build it all locally using CLI comands and then, optionally, push it to dockerhub for safe keeping 

In CLI navigate to the folder that contains our Dockerfile and run the following command to build it

If all has been done correctly you'll get a message saying it has been successfully built with an image id and what it has been tagged as. 

If you run the cell below you can see a list of all the docker images on you machine:

Then to run the image run the use the template below to build the run command inputting the required values. If you have been using the settings in this nb you can use the command in the second cell.

Now thats its running we can test it in excatly the same way as we did earlier, AND, if its all been done correctly we should get the answer that we are expecting.

In [69]:
url = 'http://0.0.0.0:5000/predict'
j_data = json.dumps([dict(zip(dataCols,[28,'male',30.0,0,'no','northwest']))])
headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}

r = requests.post(url, data=j_data, headers=headers)
print(r, r.text)

<Response [200]> {"prediction":[4944.96464438397]}



WOOOOOO!!!!! That's it! You are done! (you can stop the container running with ctrl-c) 

Our pipline is now built and transformed into a docker image! Now you can share that docker image with anyone and they can just run it straight on their machine and not have to worry about any setups or install. They will have everything they need and just get straight to using it. If this was being hosted and deployed on the internet somewhere then you only need to supply the docker image we built.

The last semi-optional-steps would be to push this image to dockerhub.

NOTE if you wanted to run a curl command in CLI to this the pipline then you just run the following:

## Push to dockerhub

If you are pushing to dockerhub make sure you go and make a free account, and within that account make a repo (private or public) for us to store this image in.

First you need to appropriatly tag the image. You'll need the image id for this which you can get from running "docker images" then build the following command and run it:

Once you've tagged it then you push it to the repo by building and running the following command:

Easy peasy lemon squeezy. If anyone wanted to pull the image they would then just run:

# Finished

Thats it, we are done, we have successfully built a skl pipline, built an api for it, put it into a docker image and pushed it to dockerhub to share it with anyone.

If you do ever deploy it on the internet and want to make calls to it then you only need to change "http://0.0.0.0:5000/" to whatever the website is.

Special thanks to Junaid Butt (he doesn't want any questions please) for making me realise the power of dockerimages, he also has a nice reference for docker commands: https://paper.dropbox.com/doc/Docker-Commands--A8dM2emiEqwCDDCD2PnVHc5XAg-IL47J9mwFMg67Lmn0vKaC

Dr J. Strudwick