This file contains the explanation for Task 1, as well as the .ipynb implementation of model.py, and testing the model pipeline works as intended.

# Task 1: Research different model serving option(s)and explain what would be the right choice for your case

Here we are deploying a machine learning model using Flask. What we mean by deploying is to integrate the machine learning model we have created into an existing production environment where it can take in an input, which in our case is a comment, and an output, which for us is the predicted labels for that comment.

When deploying a model there are a number of production-related issues we need to consider. We need to consider how the model deals with high traffic or how to deal with storage and management of different versions of the ML model. There are many different ways in which to serve our model but some of these options may not solve these common production-related problems we have, such as the 'model as code' approach which does not consider these production-related issues. As these problems are very common there are general-purpose platforms for serving and deploying ML models. 

There are two serving types that these serving options fall into:
- Model as code: This is the most common way to deploy a model. A trained model is saved in a binary format to then be wrapped in a microservice such as Flask.
- Model as data: Standardizing the model format as to be usable by any programming language meaning you do not need to wrap it in a microservice. By using this approach we can solve some of these production-related issues. 

The advantages of using model as code is that it simplifies the deployment process and can provide the tools to perform canary releases and A/B testing. The issue of using 'model as code' approach is that as the number of models grows, the number of microservices multiplies, increasing the number of failure points and making it difficult to manage. Model as data due to the model being called directly means that we do not need to worry about monitoring or error handling. 

As we will only be having a singular model it means that we do not need to worry about the issue of a multiplying microservices. As using the 'model as code' approach simplifies the deployment process and gives us the tools to test our service it means it is well suited for this project. I will be comparing Tensorflow serving which uses the 'model as data' approach to Flask which uses the 'model as code' approach to decide what is best suited  for this project and what serving type is best for this project. I will comparing a third option too that being Django.

From what was discussed above our group decided to research into these following options when deploying and serving our model, those are Flask, Django, and Tensorflow. We have a lot of experience with TensorFlow and it’s most suitable for updating models when needed, good for batch requests, can use gpu. The issue with Tensorflow serving is that it is unable to support sklearn making it unviable for serving our model. Django is is more powerful and scalable when compared to flask and is good for larger applications. Overall When looking between Flask and Django we came to the conclusion that Flask was best suited for our model as it is easy to use, minimalistic with no restrictions, and Flask being better suited for simple websites which in our case is great as our web application is quite simple as all it will do is take in a comment and predict what labels for it. 

# Setting Up and Creating the Model Pipeline
First, we prepare the data accordingly with the individual experiments.

In [5]:
import joblib
import nltk
import numpy as np
import pandas as pd
import pickle
import sklearn
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow import keras

In [6]:
# File paths

# Multinomial Naive Bayes model file path
MODEL_DIR = "multi_mnb_model.joblib"

# Balanced datasets
BALANCED_TRAIN_DATASET = "../balanced_dataset.pickle"

# Preprocessed balanced data
PREPROCESSED_BAL_TRAIN_DATASET = "../preprocessed_train.pickle"

In [7]:
# Function to load pickle file
# Params:
    # Str - @file_path: File path of pickle file
# Output:
    # Saved object in original file type (list/dataframe)
def load_pickle(file_path):
    return pickle.load(open(file_path, "rb"))

In [8]:
# Get preprocessed train dataset
bal_train_dataset = load_pickle(PREPROCESSED_BAL_TRAIN_DATASET)

# Get train_y
bal_train_y = pd.read_pickle(BALANCED_TRAIN_DATASET)
bal_train_y = bal_train_y.drop(columns="comment_text")

In [9]:
# Imports for model pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline

In [10]:
# Pre-processing imports
from functools import lru_cache
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [19]:
# Dummy function for TfidfVectorizer tokenizer
def fake_function(comments):
    return comments

# Pre-processing functions


# Function to clean comments in dataset
# Params: 
#   Pandas dataframe - @dataset: Data to be cleaned
# Output: 
#   List    - @comment_list: Cleaned comments (2D List)
def clean_data(dataset):

    # Remove punctuation
    regex_str = "[^a-zA-Z\s]"
    dataset['comment_text'] = dataset['comment_text'].replace(regex=regex_str, value="")

    # Remove extra whitespaces
    regex_space = "\s+"
    dataset['comment_text'] = dataset['comment_text'].replace(regex=regex_space, value=" ")

    # Strip whitespaces
    dataset['comment_text'] = dataset['comment_text'].str.strip()

    # Lowercase
    dataset['comment_text'] = dataset['comment_text'].str.lower()

    # Convert comment_text column into a list
    comment_list = dataset['comment_text'].tolist()

    return comment_list

# Function to get NLTK POS Tagger
# Params: 
#   Str - @word: Token
# Output
#   Dict - POS tagger
def nltk_get_wordnet_pos(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()

    # Convert NLTK to wordnet POS notations

    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN) # Default to noun if not found

# Function to use NLTK lemmatizer
# Params: 2D List - Tokenized comments with stopwords removed
# Returns: 2D List - lemmatized tokens
def nltk_lemmatize(comment_stop):

    nltk.download('averaged_perceptron_tagger')
    comment_lemma = []
    lemmatizer = WordNetLemmatizer()
    lemmatizer_cache = lru_cache(maxsize=50000)(lemmatizer.lemmatize)

    for comment in comment_stop:
        temp = []
        temp.append([lemmatizer_cache(word, pos=nltk_get_wordnet_pos(word)) for word in comment])
        comment_lemma += temp

    return comment_lemma

# Function to remove NLTK stopwords
# Params: 
#   2D List - @comment_token:   cleaned & tokenized comments
# Output:
#   2D List - @comment_stop: cleaned tokens with stopwords removed
def nltk_stopwords(comment_token):
    # Stopwords in English only
    STOP_WORDS = set(stopwords.words('english'))

    # Remove stopwords
    comment_stop = []

    for comment in comment_token:
        
        temp_word = []

        for word in comment:
            
            if word not in STOP_WORDS:
                temp_word.append(word)

        comment_stop.append(temp_word)

    return comment_stop

# Function to tokenize comments using NLTK Word Tokenize
# Params: 
#   2D List - @text: cleaned comments
# Output: 
#   2D List - tokenized comments
def nltk_tokenize(text):
    return [word_tokenize(word) for word in text]

# Function for all pre-processing functions without saving as pickle file
# Params:
#   List  - @dataset: Dataset to be pre-processed (train/test)
# Output:
#   List - @comments_list: Preprocessed tokens (2D List)
def preprocess_data_without_pickle(dataset):

    # Prevent re-running on already preprocessed data
    if isinstance(dataset, pd.DataFrame): #if dataframe, data isn't preprocessed

        comments_list = clean_data(dataset)
        
        # NLTK Tokenize
        comments_list = nltk_tokenize(comments_list)

        # Remove NLTK stopwords
        comments_list = nltk_stopwords(comments_list)

        # NLTK Lemmatization
        comments_list = nltk_lemmatize(comments_list)

        return comments_list
    
    else:
        return dataset

Next, we create the pipeline with TfidfVectorizer and our chosen classifier, Multinomial Naive Bayes.

In [20]:
# Create the pipeline with TfidfVectorizer and Multinomial Naive Bayes
# Pass in dummy function into TfidfVectorizer's tokenizer
# Pass in our custom preprocess function into TfidfVectorizer's preprocesser
# Create Multinomial Naive Bayes MultiOutputClassifier model
pipe = Pipeline([ 
    ('tfidf', TfidfVectorizer(
        analyzer='word', 
        tokenizer=fake_function, 
        preprocessor=preprocess_data_without_pickle, 
        token_pattern=None,
        min_df=5, 
        norm='l2', 
        smooth_idf=True, 
        sublinear_tf=True)), 
    ('multi_mnb', MultiOutputClassifier(MultinomialNB(), n_jobs=-1))
    ])

# Fit the pipeline
pipe.fit(bal_train_dataset, bal_train_y)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(min_df=5,
                                 preprocessor=<function preprocess_data_without_pickle at 0x00000227B66CC550>,
                                 sublinear_tf=True, token_pattern=None,
                                 tokenizer=<function fake_function at 0x00000227AFA2AE50>)),
                ('multi_mnb',
                 MultiOutputClassifier(estimator=MultinomialNB(), n_jobs=-1))])

In [21]:
# Save the pipeline
joblib.dump(pipe, 'multi_mnb_model_test.joblib', compress=1)

['multi_mnb_model_test.joblib']

# Testing the Pipeline
The following code tests the model pipeline against the functions created for the web application to ensure the model works as intended, such as ensuring the comment is in a form that can be used by the model (dataframe instead of string). We are also ensuring that the pipeline returns predictions in a format that is expected (e.g. binary outputs instead of probabilities).

In [22]:
# Load the model
pipe = joblib.load('multi_mnb_model_test.joblib')

In [13]:
# List of columns for dataframes (temp and global)
cols = ['comment_text','toxic','severe_toxic','obscene','threat','insult','identity_hate']
labels = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']

df = pd.DataFrame(columns=cols)
print(df.columns)

Index(['comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')


In [14]:
# Convert to df to feed into pipeline
# Params:
#   String - @comment: Input from form in web app
# Output:
#   Dataframe - @temp_df: Temporary dataframe of a single comment to be preprocessed by pipeline
def convert_for_pred(comment):

    temp_df = pd.DataFrame(columns=cols)

    new_row = {'comment_text':comment}

    for i in range(len(labels)):
        new_row[labels[i]] = 0

    temp_df = temp_df.append(new_row, ignore_index=True)

    return temp_df

In [25]:
# Sample test comment
test = "sad fuck one two a b"

comment = convert_for_pred(test)
print(comment)

# Predict
prediction = pipe.predict(comment['comment_text']).tolist()
print(prediction)

           comment_text toxic severe_toxic obscene threat insult identity_hate
0  sad fuck one two a b     0            0       0      0      0             0
[[1, 1, 1, 0, 1, 0]]


In [26]:
# Append comment to global dataframe
new_row = {'comment_text':comment}

for i in range(len(labels)):
    new_row[labels[i]] = prediction[0][i]

    #append row to the dataframe
df = df.append(new_row, ignore_index=True)
print(df)

                                        comment_text toxic severe_toxic  \
0             comment_text toxic severe_toxic obs...     1            1   

  obscene threat insult identity_hate  
0       1      0      1             0  


# Task 4: Discuss the performance of the service you implemented,and justify the good and bad points

Our service is able to take in a comment and then predict what labels it has. All comments are then saved and displayed along with its date it was made on and the labels assigned to it. It is able to do all of this pretty quickly and does not have any errors. Our service is able to take in user inputs and store the data into a database to then be displayed pretty well. The service is also able to actually predict labels to comments showing that the deployment and serving of the model is successful. It is able to do all of this well because of flask was implemented well. The site is able to handle inputs well too. The issue however is with the model and its predictions as it will assign comments that are not toxic such as hello and give it a label that is toxic. The labels most commonly assigned to these non-toxic messages is the toxic label meaning our model has a poor predictive performance when dealing with such comments. This is caused by the model itself not assigning the proper label which can be caused by a multitude of different reasons such as not employing the best data balancing strategy or not preprocessing the training well enough. 

# Task 6: Solution Deployment
As outlined above, we have opted to implement the code required to build and deploy the model using Python code, across files such as **model.py** and **prediction.py**, called by **\__init\__.py**. For this reason, the only unfulfilled requirement is preparing the environment before deployment. 

We have opted to use MLflow for this purpose, as it requires the least effort to implement, and features great compatibility with Anaconda environments, which we have already been using in our project development process.

The file **group7_textClassification/MLproject** specifies the details of the deployment project. It points to **grp7_env.yml** to identify project dependencies to retrieve, and creates a conda environment for the project. Once a project has been run once, its environment will only need to be handled when updates to the .yml file are made. It will then run the **\__init\__.py** file, which will in turn call all relevant .py files in order to build and deploy the model at **http://localhost:5000**.

The MLproject may be run using the Anaconda CLI after installing MLflow, by navigating to the **group7_textClassification** folder and running the command **'mlflow run .'**.