
# **MIS710 Week 11 Deployment Solution - Self Study**
Author: Associate Professor Lemai Nguyen

Objectives:
1. To practice text analytics and NLP basics
2. To learn basic MLOps: loading and using a pickled model.
3. Data pipeline?

**Self-study**


In [65]:
# import libraries
import pandas as pd #for data manipulation and analysis
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [66]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

# **Case One: ChatGPT Tweets**

**Sentiment analysis**


**Context**

ChatGPT has reached 100 million users just two months after launching. There is a huge debate on general trends and concerns surrounding AI and language models. A diverse range of opinions and viewpoints is emerging.

**Content**

The lab dataset is a cutdown version of a kaggle dataset of 100,000 tweets in English containing the word "chatgpt" between 2023-03-18 and 2023-03-21.

The Lab dataset consists of Processed tweet and Sentiment lables.

**Inspiration**

Analysing public sentiment through datasets like Tweets can provide valuable insights into the opinions and attitudes towards ChatGPT. It's not uncommon for opinions to be divided or for individuals to have mixed feelings about a new technology or innovation.

**Data source**:

https://www.kaggle.com/datasets/sanlian/tweets-about-chatgpt-march-2023


# **1. Use the pickled model and vectorizer**

### **1.1 Load the vectorizer and model**

In [67]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [68]:
import pickle

In [69]:
# Define the path. You need to modify the paths pointing to where you store the pickled files
path_vectorizer = '/content/drive/MyDrive/Colab Notebooks/MIS710 2023 T2/Week 11/ChatGPT_tfidf_vectorizer.pickle'
path_model = '/content/drive/MyDrive/Colab Notebooks/MIS710 2023 T2/Week 11/ChatGPT_lr_clf.pickle'

In [70]:
# Modify the code below to point to your vocabulary
# Load the pickled vectorizer learned from the training data
with open(path_vectorizer, 'rb') as file:
    loaded_vectorizer = pickle.load(file)

In [71]:
#load the model
with open(path_model, 'rb') as f:
  loaded_model = pickle.load(f)


In [72]:
#check the size of the vocabulary
print("Size of vocabulary:", len(loaded_vectorizer.vocabulary_))

Size of vocabulary: 421877


### **1.2 Load new dataset and pre-process it**

We load new reviews from the 'production line'.



In [73]:
url='/content/drive/MyDrive/Colab Notebooks/MIS710/unseen-chatgpt-tweets.csv'
new_tweets = pd.read_csv(url)
print(new_tweets)

                                         processed_tweet sentiment_label
0      life is too short to waste time worrying about...        positive
1      don‚Äôt have the spaces link just yet but soon...         neutral
2      actually the idea of mixing chatgpt with neura...        negative
3      gotta think gigs like chatterbox williams has ...        negative
4      the ceo behind the company that created believ...        negative
...                                                  ...             ...
18253  sorry my phones on da blink n my dyslexic mess...        negative
18254  if someone were to tell a future version of ch...         neutral
18255  chatgpt can talk to your aws infrastructure fo...         neutral
18256  chatgpt is not yet a human, but some humans ha...         neutral
18257  one hopes that it knows about nonbinary logic....         neutral

[18258 rows x 2 columns]


In practice, we should create a data pipeline to automate the pre-process of data. Let's repeat the pre-processing steps for now.

In [74]:
# Drop rows with missing data in the 'Age' column
new_tweets.dropna(subset=['processed_tweet'], inplace=True)

### **1.3 Pre-process Text**

In [75]:
#import the Python module re to work with regular expressions
import re

In [76]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

In [77]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [78]:
# Define function to clean text
def clean_text(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove punctuation and special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

In [79]:
def lowercasing(text):
  # Convert to lowercase
  text = text.lower()
  return text

### **1.4 Linguistically pre-process Text**

In [80]:
# define stopwords without negation words
stop_words = set(stopwords.words('english'))
negation_words = {'no', 'not', 'nor', 'neither', 'none', 'never'}
filtered_words = [word for word in stop_words if word not in negation_words]

In [81]:
#define a function to perform tokenization, stemming and lemmatization
def tokenize_lemmatize(text):
  #tokenization
  tokens = nltk.word_tokenize(text.lower())

  #initialize a lemmatizer
  lemmatizer = WordNetLemmatizer()

  #perform lemmatization
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in filtered_words and token.lower() not in negation_words]
  return ' '.join(lemmatized_tokens)

In [82]:
# Write your code to apply the clean_text function to the 'review' column
new_tweets['processed_tweet']=new_tweets['processed_tweet'].apply(clean_text)

In [83]:
# Write your code to apply the lowercasing function to the 'review' column

new_tweets['processed_tweet']=new_tweets['processed_tweet'].apply(lowercasing)

In [84]:
# Write yourcode to tokenize, stem, and lemmatize the 'review' column
processed_text = new_tweets['processed_tweet'].apply(tokenize_lemmatize)


In [85]:
processed_text

0        life short waste time worrying others think fo...
1        donäôt space link yet soon also official littl...
2        actually idea mixing chatgpt neuralink somethi...
3        got ta think gig like chatterbox williams firs...
4        ceo behind company created belief come real da...
                               ...                        
18253    sorry phone da blink n dyslexic message hopele...
18254    someone tell future version chat gpt spin webs...
18255            chatgpt talk aws infrastructure footprint
18256    chatgpt yet human human already become languag...
18257    one hope know nonbinary logic first conversati...
Name: processed_tweet, Length: 18257, dtype: object

In [86]:
X=processed_text
y=new_tweets.sentiment_label

### **1.5 Use the loaded vectorizer and model**

In [87]:
X_vec=loaded_vectorizer.transform(X)

In [88]:
# Evaluate model
y_pred = loaded_model.predict(X_vec)

### **1.6 Evaluate performance**
In the production line we don't have labels to evaluate the results in this way.

In [89]:
#join unseen y_test with predicted value into a data frame
inspection=pd.DataFrame({'Actual':new_tweets['sentiment_label'], 'Predicted':y_pred})

#join X_test with the new dataframe
inspection=pd.concat([new_tweets['processed_tweet'],inspection], axis=1)

inspection

Unnamed: 0,processed_tweet,Actual,Predicted
0,life is too short to waste time worrying about...,positive,positive
1,donäôt have the spaces link just yet but soon ...,neutral,neutral
2,actually the idea of mixing chatgpt with neura...,negative,positive
3,gotta think gigs like chatterbox williams has ...,negative,negative
4,the ceo behind the company that created believ...,negative,negative
...,...,...,...
18253,sorry my phones on da blink n my dyslexic mess...,negative,neutral
18254,if someone were to tell a future version of ch...,neutral,neutral
18255,chatgpt can talk to your aws infrastructure fo...,neutral,neutral
18256,chatgpt is not yet a human but some humans hav...,neutral,neutral


In [90]:
#print confusion matrix and evaluation report
cm=confusion_matrix(new_tweets['sentiment_label'], y_pred)
print(cm)
print(classification_report(new_tweets['sentiment_label'], y_pred))

[[2284 1439  334]
 [ 830 7759  840]
 [ 274 1222 3275]]
              precision    recall  f1-score   support

    negative       0.67      0.56      0.61      4057
     neutral       0.74      0.82      0.78      9429
    positive       0.74      0.69      0.71      4771

    accuracy                           0.73     18257
   macro avg       0.72      0.69      0.70     18257
weighted avg       0.73      0.73      0.73     18257



## **Congratulations**
Well done, you have loaded and used another pre-trained model!

# **2. Use the pickled pipeline**

## **2.1 Define the same environment**

In [91]:
# Import necessary libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Download required NLTK resources (one-time setup)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Remove stop words and perform tokenization and lemmatization
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [92]:
# Define function to clean text
def clean_text(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove punctuation and special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

def preprocess_text(text):
    prep_txt = clean_text(text)
    tokens = word_tokenize(prep_txt.lower())
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalnum() and token not in stop_words]
    processed_text = ' '.join(filtered_tokens)
    return processed_text


## **2.2 Load the pickled pipeline**

In [93]:
import pickle

In [94]:
# Define the path
path_pipeline= '/content/drive/MyDrive/Colab Notebooks/MIS710 2023 T2/Week 11/ChatGPT_pipeline.pickle'

In [95]:
with open(path_pipeline, 'rb') as f:
    loaded_pipeline = pickle.load(f)

## **2.3 Load the dataset**

In [96]:
url='https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/unseen-chatgpt-tweets.csv'
new_tweets = pd.read_csv(url)
print(new_tweets)

                                         processed_tweet sentiment_label
0      life is too short to waste time worrying about...        positive
1      don‚Äôt have the spaces link just yet but soon...         neutral
2      actually the idea of mixing chatgpt with neura...        negative
3      gotta think gigs like chatterbox williams has ...        negative
4      the ceo behind the company that created believ...        negative
...                                                  ...             ...
18253  sorry my phones on da blink n my dyslexic mess...        negative
18254  if someone were to tell a future version of ch...         neutral
18255  chatgpt can talk to your aws infrastructure fo...         neutral
18256  chatgpt is not yet a human, but some humans ha...         neutral
18257  one hopes that it knows about nonbinary logic....         neutral

[18258 rows x 2 columns]


In [97]:
# Drop rows with missing data
new_tweets.dropna(subset=['processed_tweet'], inplace=True)

## **2.4 Make predictions**

In [98]:
predictions = loaded_pipeline.predict(new_tweets.processed_tweet)

## **2.5 Evaluate performance**

In [99]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

In [100]:
accuracy = accuracy_score(new_tweets.sentiment_label, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.7320479815961001


In [101]:
#print confusion matrix and evaluation report
cm=confusion_matrix(new_tweets['sentiment_label'], predictions)
print(cm)
print(classification_report(new_tweets['sentiment_label'], predictions))

[[2222 1556  279]
 [ 712 8068  649]
 [ 258 1438 3075]]
              precision    recall  f1-score   support

    negative       0.70      0.55      0.61      4057
     neutral       0.73      0.86      0.79      9429
    positive       0.77      0.64      0.70      4771

    accuracy                           0.73     18257
   macro avg       0.73      0.68      0.70     18257
weighted avg       0.73      0.73      0.73     18257



## **Congratulations**
Well done, you have loaded and used pipeline!


