<a href="https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day5/computer_vision_hackathon/day5_nlp_movie_reviews_notebook2_SOLUTION_and_llm_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Machine Learning  
**Natural Language Processing Hackathon: Hackathon Solution  
Wintersession 2023  
Tuesday, January 24, 2023**

The material here is based on Chapter 8 of 
Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.

In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews.

In [1]:
import re
import textwrap
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Download Data and Make Dataframe

Download the data:

In [2]:
!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv

--2023-01-24 17:31:21--  https://tigress-web.princeton.edu/~jdh4/movie_data.csv
Resolving tigress-web.princeton.edu (tigress-web.princeton.edu)... 128.112.173.29
Connecting to tigress-web.princeton.edu (tigress-web.princeton.edu)|128.112.173.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65862309 (63M) [text/csv]
Saving to: ‘movie_data.csv’


2023-01-24 17:31:22 (68.0 MB/s) - ‘movie_data.csv’ saved [65862309/65862309]



Read in the CSV file and print the first 5 rows of the Pandas dataframe:

In [3]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(5)

Unnamed: 0,review,sentiment
0,"This is one of the most calming, relaxing, and...",1
1,I saw this movie on Mystery Science Theater 30...,0
2,"Produced by International Playhouse Pictures, ...",0
3,Just saw Baby Blue Marine again after 30 years...,1
4,I gave this movie a rating of 1 because it is ...,0


In [4]:
df["raw-review"] = df["review"]

In [5]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [6]:
remove_html_tags('What is <b>this</b>, said the toad? Where is <p class="new">the time</a> probe?')

'What is this, said the toad? Where is the time probe?'

In [7]:
df["raw-review"] = df["raw-review"].apply(remove_html_tags)

Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split.

# Preprocessing and Train-Test Split

In [8]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [9]:
df['review'] = df['review'].apply(preprocessor)

In [10]:
idx = 25000
X_train = df.loc[:idx - 1, 'review'].values
y_train = df.loc[:idx - 1, 'sentiment'].values
X_test  = df.loc[idx:, 'review'].values
y_test  = df.loc[idx:, 'sentiment'].values

In [11]:
def tokenizer(text):
    return text.split()

In [12]:
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [13]:
nltk.download('stopwords')
stop = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Preprocessing and Training Pipeline

In [14]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop],
               'vect__tokenizer': [tokenizer],
               'vect__use_idf': [True],
               'vect__norm': [None],
               'clf__penalty': ['l2'],
               'clf__C': [1.0]}]

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(solver='liblinear'))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

print(gs_lr_tfidf.best_params_)
print(gs_lr_tfidf.best_score_)

clf = gs_lr_tfidf.best_estimator_
print('Accuracy (test):', clf.score(X_test, y_test))

Fitting 5 folds for each of 1 candidates, totalling 5 fits
{'clf__C': 1.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__norm': None, 'vect__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', '

In [15]:
param_grid = [{'vect__ngram_range': [(1, 3)],
               'vect__stop_words': [None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l2'],
               'clf__C': [1.0, 10.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer],
               'vect__use_idf': [True, False],
               'vect__norm': [None],
               'clf__penalty': ['l2'],
               'clf__C': [1.0, 10.0]}]

# Pretrained Large Language Model

For an introduction to transformers see the Colab notebook: https://tinyurl.com/hugfacetutorial

For an introduction to transformers on the Princeton Research Computing clusters see this repo by David Turner of PNI: [GitHub](https://github.com/davidt0x/hf_tutorial). In particular, see slides.pptx

In [16]:
%%capture
%pip install transformers[sentencepiece]

In [17]:
from transformers import pipeline

sentiment_pipeline = pipeline('text-classification', model="distilbert-base-uncased-finetuned-sst-2-english")

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [18]:
review = df.loc[0]['raw-review']
print(review)

This is one of the most calming, relaxing, and beautifully made animation films I've ever seen. With beautiful music throughout the movie, the sounds and music can make you feel like you're in the movie! This movie is not just great for kids, but adults too. It teaches you lessons, such as never forget who you are, you can do whatever you stick your mind to, and to brave and daring. This movie can make you cry at times too, which is always a nice touch in movies. This movie is funny, sad, cute, and keeps you on the edge of your seat! Some movies really give you a fuzzy feeling after you see them, and the movie "Spirit" is definitely one of them! With my vote of 9/10 stars for animation, music, and a wonderful idea for a movie, it gave me a whole lot of Spirit!


In [19]:
sentiment_pipeline(review)[0]['label']

'POSITIVE'

In [20]:
df["truncated-review"] = df['raw-review'].apply(lambda x: x if len(x.split()) < 300 else ' '.join(x.split()[:300]))

In [21]:
df_sub = df[:250].copy()

In [22]:
df_sub.head()

Unnamed: 0,review,sentiment,raw-review,truncated-review
0,this is one of the most calming relaxing and b...,1,"This is one of the most calming, relaxing, and...","This is one of the most calming, relaxing, and..."
1,i saw this movie on mystery science theater 30...,0,I saw this movie on Mystery Science Theater 30...,I saw this movie on Mystery Science Theater 30...
2,produced by international playhouse pictures i...,0,"Produced by International Playhouse Pictures, ...","Produced by International Playhouse Pictures, ..."
3,just saw baby blue marine again after 30 years...,1,Just saw Baby Blue Marine again after 30 years...,Just saw Baby Blue Marine again after 30 years...
4,i gave this movie a rating of 1 because it is ...,0,I gave this movie a rating of 1 because it is ...,I gave this movie a rating of 1 because it is ...


In [23]:
df_sub["pretrained-distillbert-pred"] = df_sub['truncated-review'].apply(lambda x: sentiment_pipeline(x)[0]['label'])

In [24]:
df_sub["pretrained-distillbert-pred"].value_counts()

NEGATIVE    128
POSITIVE    122
Name: pretrained-distillbert-pred, dtype: int64

In [25]:
df_sub["pretrained-distillbert-pred"] = df_sub["pretrained-distillbert-pred"].apply(lambda x: 0 if x == 'NEGATIVE' else 1)

In [26]:
distillbert_accuracy = df_sub[df_sub["pretrained-distillbert-pred"] == df_sub["sentiment"]].shape[0] / df_sub.shape[0]
print(f'{100 * distillbert_accuracy}%')

88.0%


We get almost the same accuracy but with no training from the LLM versus our ML model.

Exercise: Use the LLM to summary one of the reviews.

In [27]:
summarization_pipeline = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [28]:
review = df.loc[6]["raw-review"]
review

"I absolutely, positively loved the movie. I just saw it and can't wait for it to come out on DVD. It is a beautifully, well-drawn masterpiece. I am always amazed with the intricately drawn work of Ghibli studios. Others have commented on Sosuke calling Risa by her first name. He never calls his Father by his first name unless he is speaking about him to someone else. I didn't get the impression that Risa was his mother. It was never even mentioned or implied by anyone. It is quite obvious that she is his step-mother. That is why he makes her promise to come home and why he gets so upset when he finds her empty car. His mother must have died when he was an infant because he mentions being nursed by Risa. This coupled with his father being out to sea a lot is why he has abandonment issues. Everyone also talks about how mature he is. This usually occurs when a child loses a parent."

In [29]:
outputs = summarization_pipeline(review, max_length=80, clean_up_tokenization_spaces=True)
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)
print(wrapper.fill(outputs[0]['summary_text']))

 I didn't get the impression that Risa was his mother. It is quite obvious that
she is his step-mother. His mother must have died when he was an infant because
he mentions being nursed by Risa. This coupled with his father being out to sea
a lot is why he has abandonment issues.
