<a href="https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day5/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_SOLUTION_and_llm_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Machine Learning  
**Natural Language Processing Hackathon: Hackathon Solution  
Wintersession 2023  
Tuesday, January 24, 2023**

The material here is based on Chapter 8 of 
Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.

In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews.

In [None]:
import re
import textwrap
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Download Data and Make Dataframe

Download the data:

In [None]:
!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv

Read in the CSV file and print the first 5 rows of the Pandas dataframe:

In [None]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(5)

In [None]:
df["raw-review"] = df["review"]

In [None]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [None]:
remove_html_tags('What is <b>this</b>, said the toad? Where is <p class="new">the time</a> probe?')

In [None]:
df["raw-review"] = df["raw-review"].apply(remove_html_tags)

Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split.

# Preprocessing and Train-Test Split

In [None]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
df['review'] = df['review'].apply(preprocessor)

In [None]:
idx = 25000
X_train = df.loc[:idx - 1, 'review'].values
y_train = df.loc[:idx - 1, 'sentiment'].values
X_test  = df.loc[idx:, 'review'].values
y_test  = df.loc[idx:, 'sentiment'].values

In [None]:
def tokenizer(text):
    return text.split()

In [None]:
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [None]:
nltk.download('stopwords')
stop = stopwords.words("english")

# Preprocessing and Training Pipeline

In [None]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop],
               'vect__tokenizer': [tokenizer],
               'vect__use_idf': [True],
               'vect__norm': [None],
               'clf__penalty': ['l2'],
               'clf__C': [1.0]}]

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(solver='liblinear'))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)

print(gs_lr_tfidf.best_params_)
print(gs_lr_tfidf.best_score_)

clf = gs_lr_tfidf.best_estimator_
print('Accuracy (test):', clf.score(X_test, y_test))

Pipelines can be expensive to evaulate. In the above, the param_grid is chosen with one set of parameters. For a more extensive search use the param_grid below:

In [None]:
param_grid = [{'vect__ngram_range': [(1, 3)],
               'vect__stop_words': [None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l2'],
               'clf__C': [1.0, 10.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer],
               'vect__use_idf': [True, False],
               'vect__norm': [None],
               'clf__penalty': ['l2'],
               'clf__C': [1.0, 10.0]}]

# Pretrained Large Language Model

For an introduction to transformers see the Colab notebook: https://tinyurl.com/hugfacetutorial

For an introduction to transformers on the Princeton Research Computing clusters see this repo by David Turner of PNI: [GitHub](https://github.com/davidt0x/hf_tutorial). In particular, see slides.pptx

In [None]:
%%capture
%pip install transformers[sentencepiece]

In [None]:
from transformers import pipeline

sentiment_pipeline = pipeline('text-classification', model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
review = df.loc[0]['raw-review']
print(review)

In [None]:
sentiment_pipeline(review)[0]['label']

In [None]:
df["truncated-review"] = df['raw-review'].apply(lambda x: x if len(x.split()) < 300 else ' '.join(x.split()[:300]))

In [None]:
df_sub = df[:250].copy()

In [None]:
df_sub.head()

In [None]:
df_sub["pretrained-distillbert-pred"] = df_sub['truncated-review'].apply(lambda x: sentiment_pipeline(x)[0]['label'])

In [None]:
df_sub["pretrained-distillbert-pred"].value_counts()

In [None]:
df_sub["pretrained-distillbert-pred"] = df_sub["pretrained-distillbert-pred"].apply(lambda x: 0 if x == 'NEGATIVE' else 1)

In [None]:
distillbert_accuracy = df_sub[df_sub["pretrained-distillbert-pred"] == df_sub["sentiment"]].shape[0] / df_sub.shape[0]
print(f'{100 * distillbert_accuracy}%')

We get almost the same accuracy but with no training from the LLM versus our ML model.

Exercise: Use the LLM to summarize one of the reviews.

In [None]:
summarization_pipeline = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

In [None]:
review = df.loc[6]["raw-review"]
review

In [None]:
outputs = summarization_pipeline(review, max_length=80, clean_up_tokenization_spaces=True)
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)
print(wrapper.fill(outputs[0]['summary_text']))