<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---




<p align="center"><h1 align="center">IMDB Movie Review Text Classification Tutorial</h1> <h3 align="center">(Prepare to deploy model and preprocessor to REST API/Web Dashboard in four easy steps...)</h3></p>
<p align="center"><img width="80%" src="https://aimodelsharecontent.s3.amazonaws.com/ModelandPreprocessorObjectPreparation.jpeg" /></p>


---



## **(1) Preprocessor Function & Setup**

### **Obtaining the IMDb Movie Review Dataset**

In [1]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2020-09-18 04:04:53--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2020-09-18 04:05:04 (8.13 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
!tar -zxf aclImdb_v1.tar.gz

### **Load Files Manually**

In [3]:
! pip install pyprind

Collecting pyprind
  Downloading https://files.pythonhosted.org/packages/1e/30/e76fb0c45da8aef49ea8d2a90d4e7a6877b45894c25f12fb961f009a891e/PyPrind-2.11.2-py3-none-any.whl
Installing collected packages: pyprind
Successfully installed pyprind-2.11.2


In [4]:
import pyprind
## conda install -c conda-forge pyprind ##

import pandas as pd
import os

# Change the `basepath` to the directory of the
# unzipped movie dataset.

basepath = 'aclImdb'

labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index=True)
            pbar.update()
df.columns = ['review', 'sentiment']

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:01:42


In [5]:
df

Unnamed: 0,review,sentiment
0,"Surprising, witty, funny and totally engaging,...",1
1,solid documentary about edgey kids who first s...,1
2,This movie is an amazing comedy.. the script i...,1
3,CONTEXT is everything when one goes to rate a ...,1
4,"More of a near miss than a flop, MR. IMPERIUM ...",1
...,...,...
49995,"2005 gave us the very decent ""gore porn"" flick...",0
49996,I work at a Blockbuster store and every week w...,0
49997,Whoever wrote the screenplay for this movie ob...,0
49998,"First up this film, according to the slick sai...",0


In [6]:
df['sentiment'].value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

In [7]:
print(df.head())

                                              review  sentiment
0  Surprising, witty, funny and totally engaging,...          1
1  solid documentary about edgey kids who first s...          1
2  This movie is an amazing comedy.. the script i...          1
3  CONTEXT is everything when one goes to rate a ...          1
4  More of a near miss than a flop, MR. IMPERIUM ...          1


### **Write a Preprocessor Function**

In [8]:
def preprocessor(data):
    preprocessed_data=vect.fit_transform(data)
    return preprocessed_data

## **(2) Build an `sklearn` Model to Predict Positive/Negative Reviews**

In [9]:
from sklearn.model_selection import train_test_split

X = df.review
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1987) # 10% of data reserved for testing.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(stop_words = "english", min_df = 0.01, max_features=1000, binary=True) # Remove stop words appearing in less than 1% of the data; only look at the 1000 most common words as features.

preprocessor(X_train) # Fit the vectorizer on the training set only so as to prevent data leakage to the test set.

<45000x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 2096642 stored elements in Compressed Sparse Row format>

In [11]:
preprocessor(X_train).shape

(45000, 1000)

In [12]:
# Naïve Bayse...
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

nb.fit(preprocessor(X_train), y_train)

nb.score(preprocessor(X_train), y_train) # Fit score.

0.8374

In [13]:
X_test_vect = vect.transform(X_test) # Vectorize test set. Only transform, no refitting to avoid data leakage.

y_pred = nb.predict(X_test_vect)

y_pred

array([1, 1, 0, ..., 1, 0, 1])

In [14]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

cm = confusion_matrix(y_test, y_pred)
list1 = ["true 0", "true 1"]
list2 = ["pred 0", "pred 1"]

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
print("\nF1 Score: {:.2f}".format(f1_score(y_test, y_pred))) # F1 score = weighted avg. of precision and recall, best value at 1, worst at 0.
print("\nConfusion Matrix:\n", pd.DataFrame(cm, list1,list2))

Accuracy: 84.48%

F1 Score: 0.85

Confusion Matrix:
         pred 0  pred 1
true 0    2097     398
true 1     378    2127


## **(3) Save Preprocessor**

In [None]:
# ! pip3 install aimodelshare

In [15]:
def export_preprocessor(preprocessor_function, filepath):
    import dill
    with open(filepath, "wb") as f:
        dill.dump(preprocessor_function, f)

# import aimodelshare as ai # Once we can deploy this, we use it in lieu of the below.
# ai.export_preprocessor(preprocessor, "preprocessor.pkl")

export_preprocessor(preprocessor, "preprocessor.pkl")

## **(4) Save Model to Onnx File Format**

In [None]:
! pip3 install skl2onnx

In [21]:
# Convert into ONNX format...

from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType

initial_type = [('float_input', FloatTensorType([None, 1000]))]
onx = convert_sklearn(nb, initial_types=initial_type)

# Save model to local .onnx file...
with open("my_model.onnx", "wb") as f:
    f.write(onx.SerializeToString())