<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%2010/MIS710_Lab10_NLP_Deployment_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **MIS710 Lab 10 Week 10 Deployment Solution**
Author: Associate Professor Lemai Nguyen

Objectives:
1. To learn text analytics and NLP basics
2. To apply the basic skills on the well-known Internet Movie Database developed by Stanford researcher Andrew Maas.
3. To learn basic MLOps: saving your model and loading and using it later.
4. Optional: To apply the basic NLP skills on another review dataset.


Note: This MIS710_Lab10_NLP_Deployment enables you to load the saved model and vectorizer and apply them to make predictions on new data.


# **1. Import libraries and functions**

In [26]:
# import libraries
import pandas as pd #for data manipulation and analysis
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [27]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

# **2. Case One: IMDb**

**Sentiment analysis**

**Context**
IMDb stands for the Internet Movie Database, which is an online database of information related to films, television programs, and video games. It contains a vast collection of data on various aspects of the entertainment industry, including cast and crew information, production details, plot summaries, and user ratings and reviews.

**Content**
The IMDb dataset has been widely used in sentiment analysis research. The dataset contains 50,000 movie reviews. Each review is labeled as either "positive" or "negative" based on the overall sentiment expressed in the review.

The dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.

**Inspiration**
To train and test a sentiment analysis model

**Further information**:
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

http://ai.stanford.edu/~amaas/data/sentiment/


## **2.1 Load the vectoriser and model**

In [28]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [29]:
import pickle

In [30]:
#modify the code below to point to your picked model
#or use my models from here https://github.com/VanLan0/MIS710-ML/tree/main/Pickled
path_model = '/content/drive/MyDrive/Colab Notebooks/MIS710 2024 T1/Week 10/IMDB_lr_clf.pkl'

In [31]:
#modify the code below to point to your picked model
path_vectorizer = '/content/drive/MyDrive/Colab Notebooks/MIS710 2024 T1/Week 10/tfidf_vectorizer.pkl'

In [32]:
#Load the vectorizer and model
with open(path_vectorizer, 'rb') as f:
    loaded_vectorizer = pickle.load(f)

with open(path_model, 'rb') as f:
    loaded_lr = pickle.load(f)


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [33]:
#check the size of the vocabulary
print("Size of vocabulary:", len(loaded_vectorizer.vocabulary_))

Size of vocabulary: 2778160


## **2.2 Load data and process the unstrucutred data**

**Now load new reviews from the 'production line'. For you to test the model, we provide the labels. In reality, you don't have the labels for the real production line data.**



In [34]:
url='https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/DrCha_reviews2024.csv'
new_reviews = pd.read_csv(url)
print(new_reviews)

                                                review sentiment
0    This show is a breath of fresh air, really enj...  positive
1    The acting is top-notch and the storyline has ...  positive
2    Absolutely loved the first episode, it set a g...  positive
3    The creativity and originality in this show ar...  positive
4    The acting is top-notch and the storyline has ...  positive
..                                                 ...       ...
99   Outstanding performance by the lead, truly a s...  positive
100  Absolutely loved the first episode, it set a g...  positive
101  Unfortunately, it didn't live up to the hype, ...  negative
102  Outstanding performance by the lead, truly a s...  positive
103  Unfortunately, it didn't live up to the hype, ...  negative

[104 rows x 2 columns]


In practice, we should create a data pipeline to automate the pre-process of data. Let's repeat the pre-processing steps for now.

In [35]:
#import the Python module re to work with regular expressions
import re

In [36]:
# Define function to clean text
def clean_text(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove punctuation and special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

In [37]:
def lowercasing(text):
  # Convert to lowercase
  text = text.lower()
  return text

In [38]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

In [39]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
# define stopwords without negation words
stop_words = set(stopwords.words('english'))
negation_words = {'no', 'not', 'nor', 'neither', 'none', 'never'}
filtered_words = [word for word in stop_words if word not in negation_words]

In [41]:
#define a function to perform tokenization, stemming and lemmatization
def tokenize_lemmatize(text):
  #tokenization
  tokens = nltk.word_tokenize(text.lower())

  #initialize stemmer and lemmatizer
  lemmatizer = WordNetLemmatizer()

  #perform stemming and lemmatization
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in filtered_words and token.lower() not in negation_words]
  return ' '.join(lemmatized_tokens)

In [42]:
##write code to apply the clean_text function to the 'review' column
new_reviews['review']=new_reviews['review'].apply(clean_text)

In [43]:
##write code to apply the lowercasing function to the 'review' column

new_reviews['review']=new_reviews['review'].apply(lowercasing)

In [44]:
##write code to tokenize, stem, and lemmatize the 'review' column
processed_text = new_reviews['review'].apply(tokenize_lemmatize)


In [45]:
processed_text

Unnamed: 0,review
0,show breath fresh air really enjoyed unique humor
1,acting topnotch storyline hooked already
2,absolutely loved first episode set great tone ...
3,creativity originality show amazing
4,acting topnotch storyline hooked already
...,...
99,outstanding performance lead truly standout
100,absolutely loved first episode set great tone ...
101,unfortunately didnt live hype felt quite disap...
102,outstanding performance lead truly standout


In [46]:
##write code to use loaded vectorizer to vectorize the new processed text
X_new_tfidf = loaded_vectorizer.transform(processed_text)


## **2.3 Use the model to make predictions**

In [47]:
##write code to make predictions on new data in the 'production line'
y_pred = loaded_lr.predict(X_new_tfidf)

## **2.4 Monitor the model performance**

Let's assume we have people (experts) to label data to evaluate the results.
In this case, we used ChatGPT to label the date. Do you see any problems?

In [48]:
##write code to use loaded vectorizer to vectorize the new processed text. Hint
y_test_new=new_reviews['sentiment']

In [49]:
##write code to join unseen y_test 'sentiment' with predicted value into a data frame
inspection=pd.DataFrame({'Actual':y_test_new, 'Predicted':y_pred})

##write code to join X_test with the new dataframe
inspection=pd.concat([new_reviews['review'],inspection], axis=1)

inspection

Unnamed: 0,review,Actual,Predicted
0,this show is a breath of fresh air really enjo...,positive,positive
1,the acting is topnotch and the storyline has m...,positive,positive
2,absolutely loved the first episode it set a gr...,positive,positive
3,the creativity and originality in this show ar...,positive,positive
4,the acting is topnotch and the storyline has m...,positive,positive
...,...,...,...
99,outstanding performance by the lead truly a st...,positive,positive
100,absolutely loved the first episode it set a gr...,positive,positive
101,unfortunately it didnt live up to the hype fel...,negative,negative
102,outstanding performance by the lead truly a st...,positive,positive


In [50]:
##write code to print confusion matrix and evaluation report
y_test=new_reviews['sentiment']
cm=confusion_matrix(y_test, y_pred)
print(cm)
print(classification_report(y_test, y_pred))

[[23  6]
 [ 1 74]]
              precision    recall  f1-score   support

    negative       0.96      0.79      0.87        29
    positive       0.93      0.99      0.95        75

    accuracy                           0.93       104
   macro avg       0.94      0.89      0.91       104
weighted avg       0.93      0.93      0.93       104



You can test it on a new data set, but make sure you do the same preprocessing first.

# **Congratulations**
Well done, you have loaded and used your first pre-trained model!


