# Chapter 9 - Embedding a ML-model into a web application

For web applications it is not feasible to learn from the training data every single time a prediction has to be made. Therefore, we an save the current state of the model into a pickle opbject in Python's built in Picle model. 

For this purpose we load the out-of-core LR-model from chapter 8.

In [7]:
#First a tokenizer function that cleans unprocessed text data from the movie_data.csv file.
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', "", text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

#define a generator that reads in and returns one document at a time
def stream_docs(path):
    with open(path, 'r', encoding = 'utf-8') as csv:
        next(csv) #skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label
            
#Define function get_minibatch that will take a document streaam from the stream_docs 
# and return a particular umber of documents (specified by the size parameter)
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    
    return docs, y            

#CountVectorizer requires holding the complete vocabulary in memroy and can thus not be used for out-of-core learning.
#Same with TfidfVectorizer, therefore we use HashingVectorizer.
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error = 'ignore',
                        n_features = 2**21,
                        preprocessor = None,
                        tokenizer = tokenizer)
clf = SGDClassifier(loss = 'log', random_state = 1, max_iter = 1)
# = classifier (LR by setting loss to 'log') using Stochastic Gradient Descent (SGD)
# SGD is using one document at a time
doc_stream = stream_docs(path='movie_data.csv')            

In [8]:
#Implement the out-of-core model
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0,1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size = 1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:53


In [9]:
#eveluate performance of the model using 5000 documents
X_test, y_test = get_minibatch(doc_stream, size = 5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


In [10]:
#Update model with those 5000 documents
clf = clf.partial_fit(X_test, y_test)

### Save current state of ML-model with pickle

In [13]:
import pickle
import os
dest = os.path.join('movieclassifier', 'pkl_objects')
#Create a folder if it not eixsts.
if not os.path.exists(dest):
    os.makedirs(dest)
#Save stopword set, so the NLTK library doest not have to be installed on the server
pickle.dump(stop,
           open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol = 4)
#Save model
pickle.dump(clf,
           open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol = 4) #wb = binary mode

The model and stopword set are now saved to a newly created folder. Another option is to use th joblib library (may be more efficient).

The Hashingvectorizer we created in chapter 8 does not have to be fitted an therefore it is not necessary to store it via pickle. We are going to create a new .py file with the hashingvectorizer code.

## Loading vectorizer, preprocess documents and make predictions

In [3]:
%cd c:\Users\rikkr\Documents\R\Python Machine Learning\movieclassifier

import pickle
import re
import os
from vectorizer import vect
clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))

c:\Users\rikkr\Documents\R\Python Machine Learning\movieclassifier


In [5]:
import numpy as np
label = {0: 'negative', 1: 'positive'}

example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' % (label[clf.predict(X)[0]], np.max(clf.predict_proba(X)) * 100))

Prediction: positive
Probability: 82.52%


## Setting up an SQLite database for data storage

Database to collect optional feedback about predictions from users of the web application. We can then use this feedback to update our classification model. SQLite is open source and does not require a seperate server to operate on. In Python already an integrated API exists for SQLite --> sqlite3

In [18]:
import sqlite3
import os

if os.path.exists('reviews.sqlite'):
    os.remove('reviews.sqlite')
conn = sqlite3.connect('reviews.sqlite') #Connect to SQLite dataabse and create new database file
c = conn.cursor() #Create a cursor (allows us to traverse over database records using SQL syntax)
c.execute('CREATE TABLE review_db' ' (review TEXT, sentiment INTEGER, data TEXT)') #Create table with 3 columns

#Store 2 examples (with classlabels 1 and 0)
example1 = 'I love this movie'
c.execute("INSERT INTO review_db"\
          "(review, sentiment, data) VALUES"\
          "(?,?, DATETIME('now'))", (example1, 1)) 

example2 = 'I disliked this movie'
c.execute("INSERT INTO review_db"\
          " (review, sentiment, data) VALUES"\
          " (?,?, DATETIME('now'))", (example2, 0))

conn.commit() #Save changes
conn.close() #Close connection

In [20]:
conn = sqlite3.connect('reviews.sqlite') #connect to the databse
c = conn.cursor() #Create a cursor to traverse over records in database
c.execute("SELECT * FROM review_db WHERE data"\ #Select all recorts in review_db between 1-1-2017 and current date/time
        " BETWEEN '2017-01-01 00:00:00' AND DATETIME('now')")

results = c.fetchall()
conn.close()
print(results)

[('I love this movie', 1, '2019-08-11 06:51:28'), ('I disliked this movie', 0, '2019-08-11 06:51:28')]


Alternatively, a GUI interface for working with SQLite databases can be found as Firefox plugin at https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/

## Developing a web application with Flask

Flask is konown as a micro-framework which means that its core is lean and simple, but can easily be extended with other libraries (an alternative to Flask is Django). 

## The first Flask web-application

We are going to build a simple web application with a form field that lets us enter a name. After entering a name, it will render it on a new page.

For a flask application the directory tree is as follows:

1st_flask_app_1/ <br />
&nbsp;&nbsp;&nbsp;&nbsp;app.py <br />
&nbsp;&nbsp;&nbsp;&nbsp;templates/ <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;first_app.html

<br /><br /><br /><br />
Code in the app.py file:

from flask import Flask, render_template<br /><br />
app = Flask(__name__)<br />
@app.route('/')<br /><br />
def index():<br />
&nbsp;&nbsp;&nbsp;&nbsp;return render_template('first_app.html')<br /><br />
if __name__ == '__main__':<br />
&nbsp;&nbsp;&nbsp;&nbsp;app.run()

@app.route() <-- insert which url should trigger the function below

Check folders on local drive and textbook for more info on web-applications with Flask

## Deploying the app to a public server

Using Pythonanywhere we can deploy a single web application free of charge.

Using the movieclassifier app the model is partially_fitted every time feedback is provided to the model. However, as the code does not include pickling of the model, all progress is lost if the servers crash. One method would be to save the model als pickle object after each feedback-loop, however this is not computationally efficient and may corrupt the pkl file if multiple users try to update the file simultaneously.

Another (better) option is to update the predictive model with feedback data from the SQlite database. For example downloading the responses from the table to our local computer, updating the clf object, upload the new picle file to the webapp.

The newly update will update the model with entries from the SQL-table. However, in practice the data (userfeedback) should be validated prior to updating the model. In addition --> Overfitting if training on same examples every time???