# MLOps and Cloud Native AI/ML : Data ana Machine learning operationalization

### Author : Oumaima Chqaf
### Professor : Fahd Kalloubi

In this notebook we will try to go over **IMDB Dataset of 5OK Movie Reviews** (*).
We will start by preprocessing our data, then train 5 Machine Learning Models and try to track models performence, versions and parameters.

(*) : You can find the notebook following this link : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [8]:
# This will contain all the necessary libraries

import re
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding,Bidirectional,LSTM,Dense,Dropout,BatchNormalization
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix

import string
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , LSTM , Embedding
from keras.models import Sequential
from keras.callbacks import EarlyStopping

import warnings
warnings.filterwarnings('ignore')

import pickle

### Preprocessing our dataset

We downloaded our dataset and put it in the same file as this notebook

In [2]:
df = pd.read_csv('./IMDBDataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
sentences = df['review']
target= df['sentiment']

In [5]:
target

0        positive
1        positive
2        positive
3        negative
4        positive
           ...   
49995    positive
49996    negative
49997    negative
49998    negative
49999    negative
Name: sentiment, Length: 50000, dtype: object

### Data Cleaning

In [9]:
# stopwords
total_stopwords = set(stopwords.words('english'))

# subtract negative stop words like no, not, don't etc.. from total_stopwords
negative_stop_words = set(word for word in total_stopwords 
                          if "n't" in word or 'no' in word)

final_stopwords = total_stopwords - negative_stop_words

# 
final_stopwords.add("one")
print(final_stopwords)

{'after', 'so', 'above', 'own', 'yours', 'any', 'which', 'having', 'she', 'down', 'weren', 'her', 'their', 'what', 've', 'below', "it's", 'mustn', 'won', 'm', 'themselves', 'to', 'both', 'itself', 'than', 'where', 's', 'on', 'myself', 'o', 'it', 'off', 'wouldn', 'being', 'some', 'most', 'or', 'doesn', 'are', 'more', 'me', 'his', 'one', 't', 'we', 'haven', 'through', 'should', 'll', 'needn', 'does', 'from', 'its', 'the', 'under', 'just', 'them', 'shan', 'few', 'aren', 'here', 'will', 'him', 'with', 'there', 'hasn', 'because', 'is', 'as', 'yourselves', 'shouldn', 'how', 'yourself', 'against', 'until', 'of', 'all', 'ma', 'hers', 'do', 'out', 'at', 'only', 'such', 'but', 'very', 'my', 'hadn', 'can', 'while', 'i', 'these', "that'll", "she's", 'then', 'did', 'your', "you're", 'by', 'up', 'further', 'this', 'mightn', 'why', 'didn', 'y', 'isn', 'doing', 'same', 'was', 'our', 'were', 'those', 'himself', 'has', "you've", 'have', 'other', 'again', 'into', 'theirs', 'each', 'an', 'between', 'herse

Remove unwanted words from reviews

In [10]:
#stemming object
stemmer = PorterStemmer()

# ---------------------------------------------
HTMLTAGS = re.compile('<.*?>')
table = str.maketrans(dict.fromkeys(string.punctuation))
remove_digits = str.maketrans('', '', string.digits)
MULTIPLE_WHITESPACE = re.compile(r"\s+")
# ---------------------------------------------

In [11]:
def preprocessor(review):
    # remove html tags
    review = HTMLTAGS.sub(r'', review)

    # remove puncutuation
    review = review.translate(table)
    
    # remove digits
    review = review.translate(remove_digits)
    
    # lower case all letters
    review = review.lower()
    
    # replace multiple white spaces with single space
    review = MULTIPLE_WHITESPACE.sub(" ", review).strip()
    
    # remove stop words
    review = [word for word in review.split()
              if word not in final_stopwords]
    
    # stemming
    review = ' '.join([stemmer.stem(word) for word in review])
    
    return review

In [15]:
print("Before preprocessing : ")
df.review.iloc[6]

Before preprocessing : 


"I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it."

In [19]:
# apply preprocessing function

df.review = df.review.apply(preprocessor) 
print("After preprocessing : ")
df.review.iloc[6]

After preprocessing : 


'sure would like see resurrect date seahunt seri tech today would bring back kid excit mei grew black white tv seahunt gunsmok hero everi weekyou vote comeback new sea huntw need chang pace tv would work world water adventureoh way thank outlet like view mani viewpoint tv mani moviesso ole way believ ive got wanna saywould nice read plu point sea huntif rhyme would line would let submitor leav doubt quitif must go let'

### Train Test Split

Train set : 70% of data
Test set : 30% of data

In [20]:
X = df.review
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1, stratify=y)

In [21]:
X_train.shape, X_test.shape

((40000,), (10000,))

### Vectorization

Bag of Words Vectorizer

In [22]:
bow_vectorizer = CountVectorizer(max_features=10000)
bow_vectorizer.fit(X_train)

# transform
bow_X_train = bow_vectorizer.transform(X_train)
bow_X_test = bow_vectorizer.transform(X_test)

### Machine Lerning Model : Logistic Regression

In [26]:
### 
def train_and_eval(model, trainX, trainY, testX, testY):

    # training
    _ = model.fit(trainX, trainY)

    # predictions
    y_preds_train = model.predict(trainX)
    y_preds_test = model.predict(testX)

    # evaluation
    print()
    print(model)
    print(f"Train accuracy score : {accuracy_score(y_train, y_preds_train)}")
    print(f"Test accuracy score : {accuracy_score(y_test, y_preds_test)}")
    print('\n',40*'-')

In [27]:
# Hyperparameters
C = [0.001, 0.01, 0.1, 1, 10]

for c in C: 
    # Define model
    log_model = LogisticRegression(C=c, max_iter=500, random_state=1)
    
    # Train and evaluate model
    train_and_eval(model=log_model,
                   trainX=bow_X_train,
                   trainY=y_train,
                   testX=bow_X_test,
                   testY=y_test)


LogisticRegression(C=0.001, max_iter=500, random_state=1)
Train accuracy score : 0.868725
Test accuracy score : 0.8637

 ----------------------------------------

LogisticRegression(C=0.01, max_iter=500, random_state=1)
Train accuracy score : 0.905575
Test accuracy score : 0.8851

 ----------------------------------------

LogisticRegression(C=0.1, max_iter=500, random_state=1)
Train accuracy score : 0.940925
Test accuracy score : 0.8867

 ----------------------------------------

LogisticRegression(C=1, max_iter=500, random_state=1)
Train accuracy score : 0.969375
Test accuracy score : 0.8723

 ----------------------------------------

LogisticRegression(C=10, max_iter=500, random_state=1)
Train accuracy score : 0.991325
Test accuracy score : 0.8519

 ----------------------------------------


 Best model : Logistic Regression(C=0.1) with Bag of Words

In [32]:
bmodel = LogisticRegression(C=0.1, max_iter=500, random_state=1)
bmodel.fit(bow_X_train, y_train)

In [33]:
# predictions
y_preds_train = bmodel.predict(bow_X_train)
y_preds_test = bmodel.predict(bow_X_test)

In [44]:
print(f"Train accuracy score : {accuracy_score(y_train, y_preds_train)}")
print(f"Test accuracy score : {accuracy_score(y_test, y_preds_test)}")

Train accuracy score : 0.940925
Test accuracy score : 0.8867


In [45]:
def plot_cm(y_true, y_pred):
    plt.figure(figsize=(6,6))
    
    cm = confusion_matrix(y_true, y_pred, normalize='true')
    
    sns.heatmap(
        cm, annot=True, cmap='Blues', cbar=False, fmt='.2f',
        xticklabels=target, yticklabels=target)
    
    return plt.show()

Let's save our model and transformer

In [46]:
with open("transformer.pkl", "wb") as f:
    pickle.dump(bow_vectorizer, f)
    
with open("model.pkl", "wb") as f:
    pickle.dump(bmodel, f)

In [53]:
labels = ['Negative', 'Positive']
def get_sentiment(review):
    # preprocessing
    x = preprocessor(review)
    #vectorization
    x = bow_vectorizer.transform([x])
    #prediction
    y = bmodel.predict(x.reshape(1,-1))
    return y

In [54]:
# positve review
review = "This chips packet is very tasty. I highly recommend this!"
print(f"This is a {get_sentiment(review)} review!")

This is a ['positive'] review!
