<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%2010/MIS710_Lab10_NLP_Deployment_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **MIS710 Lab 10 Week 10 Deployment Solution**
Author: Associate Professor Lemai Nguyen

Objectives:
1. To learn text analytics and NLP basics
2. To apply the basic skills on the well-known Internet Movie Database developed by Stanford researcher Andrew Maas.
3. To learn basic MLOps: saving your model and loading and using it later.
4. Optional: To apply the basic NLP skills on another review dataset.


Note: This MIS710_Lab10_NLP_Deployment enables you to load the saved model and vectorizer and apply them to make predictions on new data.


# **1. Import libraries and functions**

In [42]:
# import libraries
import pandas as pd #for data manipulation and analysis
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [43]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

# **2. Case One: IMDb**

**Sentiment analysis**

**Context**
IMDb stands for the Internet Movie Database, which is an online database of information related to films, television programs, and video games. It contains a vast collection of data on various aspects of the entertainment industry, including cast and crew information, production details, plot summaries, and user ratings and reviews.

**Content**
The IMDb dataset has been widely used in sentiment analysis research. The dataset contains 50,000 movie reviews. Each review is labeled as either "positive" or "negative" based on the overall sentiment expressed in the review.

The dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

**Inspiration**
To train and test a sentiment analysis model

**Further information**:
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

http://ai.stanford.edu/~amaas/data/sentiment/


## **3. ML Operationalisation**

### **3.2 Load and use the model**

In [44]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [45]:
import pickle

In [46]:
#modify the code below to point to your picked model
#or use my models from here https://github.com/VanLan0/MIS710-ML/tree/main/Pickled
path_model = '/content/drive/MyDrive/Colab Notebooks/MIS710 2023 T2/Week 10/IMDB_lr_clf.pkl'

In [47]:
#modify the code below to point to your picked model
path_vectorizer = '/content/drive/MyDrive/Colab Notebooks/MIS710 2023 T2/Week 10/tfidf_vectorizer.pkl'

In [48]:
#Load the vectorizer and model
with open(path_vectorizer, 'rb') as f:
    loaded_vectorizer = pickle.load(f)

with open(path_model, 'rb') as f:
    loaded_lr = pickle.load(f)


In [49]:
#check the size of the vocabulary
print("Size of vocabulary:", len(loaded_vectorizer.vocabulary_))

Size of vocabulary: 2776800


We load new reviews from the 'production line'.



In [50]:
url='https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/DrCha_reviews.csv'
new_reviews = pd.read_csv(url)
print(new_reviews)

                                               review sentiment
0   I just watched the first episode and it’s 10/1...  positive
1   this is going to be absolutely amazing I can't...  positive
2   Accidentally watched the first episode, it was...  positive
3                          This actress is hilarious!  positive
4                             Looks very entertaining  positive
5                         annoying unnecesarry  scene  negative
6                           I love this drama so much  positive
7   Too convoluted and over dramatic. the writers ...  negative
8   Really good to see Km Byung Chul in a more com...  positive
9                              disappointed, too long  negative
10  I'm addicted to this movie, but If she doesn't...  positive
11      I love it. Can't wait to see the next episode  positive
12  The movie really disappoint me, his mother wil...  negative
13  Loved the first 2 episodes and excited for the...  positive


In practice, we should create a data pipeline to automate the pre-process of data. Let's repeat the pre-processing steps for now.

In [51]:
#import the Python module re to work with regular expressions
import re

In [52]:
# Define function to clean text
def clean_text(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove punctuation and special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

In [53]:
def lowercasing(text):
  # Convert to lowercase
  text = text.lower()
  return text

In [54]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

In [55]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [56]:
# define stopwords without negation words
stop_words = set(stopwords.words('english'))
negation_words = {'no', 'not', 'nor', 'neither', 'none', 'never'}
filtered_words = [word for word in stop_words if word not in negation_words]

In [57]:
#define a function to perform tokenization, stemming and lemmatization
def tokenize_lemmatize(text):
  #tokenization
  tokens = nltk.word_tokenize(text.lower())

  #initialize stemmer and lemmatizer
  lemmatizer = WordNetLemmatizer()

  #perform stemming and lemmatization
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in filtered_words and token.lower() not in negation_words]
  return ' '.join(lemmatized_tokens)

In [58]:
# Write your code to apply the clean_text function to the 'review' column
new_reviews['review']=new_reviews['review'].apply(clean_text)

In [59]:
# Write your code to apply the lowercasing function to the 'review' column

new_reviews['review']=new_reviews['review'].apply(lowercasing)

In [60]:
# Write yourcode to tokenize, stem, and lemmatize the 'review' column
processed_text = new_reviews['review'].apply(tokenize_lemmatize)


In [61]:
processed_text

0                    watched first episode 1010 im love
1     going absolutely amazing cant wait synopsis funny
2     accidentally watched first episode good couldn...
3                                     actress hilarious
4                                     look entertaining
5                            annoying unnecesarry scene
6                                       love drama much
7     convoluted dramatic writer could made lot stro...
8         really good see km byung chul comedic setting
9                                     disappointed long
10    im addicted movie doesnt divorce husband watch...
11                      love cant wait see next episode
12                  movie really disappoint mother pain
13              loved first 2 episode excited remainder
Name: review, dtype: object

In [62]:
#Use loaded vectorizer to vectorize the new processed text
X_new_tfidf = loaded_vectorizer.transform(processed_text)
y_test_new=new_reviews['sentiment']

In [63]:
# Evaluate model
y_pred = loaded_lr.predict(X_new_tfidf)

In [64]:
#join unseen y_test with predicted value into a data frame
inspection=pd.DataFrame({'Actual':new_reviews['sentiment'], 'Predicted':y_pred})

#join X_test with the new dataframe
inspection=pd.concat([new_reviews['review'],inspection], axis=1)

inspection

Unnamed: 0,review,Actual,Predicted
0,i just watched the first episode and its 1010 ...,positive,positive
1,this is going to be absolutely amazing i cant ...,positive,positive
2,accidentally watched the first episode it was ...,positive,positive
3,this actress is hilarious,positive,positive
4,looks very entertaining,positive,positive
5,annoying unnecesarry scene,negative,negative
6,i love this drama so much,positive,positive
7,too convoluted and over dramatic the writers c...,negative,negative
8,really good to see km byung chul in a more com...,positive,positive
9,disappointed too long,negative,negative


In [65]:
#print confusion matrix and evaluation report
y_test=new_reviews['sentiment']
cm=confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

[[3 1]
 [1 9]]
              precision    recall  f1-score   support

    negative       0.75      0.75      0.75         4
    positive       0.90      0.90      0.90        10

    accuracy                           0.86        14
   macro avg       0.82      0.82      0.82        14
weighted avg       0.86      0.86      0.86        14



You can test it on a new data set, but make sure you do the same preprocessing first.

# **Congratulations**
Well done, you have loaded and used your first pre-trained model!


