<a href="https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day5/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook2_hackathon_HINTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Machine Learning  
**Natural Language Processing Hackathon: Notebook 2 HINTS   
Wintersession  
Tuesday, January 24, 2023**

The material here is based on Chapter 8 of 
Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.

In this notebook we are going to work with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative reviews.

In [None]:
import re
import textwrap
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

Download the data set:

In [None]:
!wget https://tigress-web.princeton.edu/~jdh4/movie_data.csv

Read in the CSV file and print the first 5 rows of the Pandas dataframe:

In [None]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(5)

Let's look at the number of total rows and the data types:

In [None]:
df.info()

Let's check for class imbalance:

In [None]:
df["sentiment"].value_counts()

The classes are balanced so we do not need to worry about imbalance. Next, let's print some reviews to get a sense of the content.

In [None]:
def print_reviews_and_sentiment(d, start_index=42, num=3, width=80):
    wrapper = textwrap.TextWrapper(width=width, break_long_words=False, break_on_hyphens=False)
    for i in range(start_index, start_index + num):
        print(wrapper.fill(str(d.loc[i]["review"])))
        print('------------')
        print(f'Sentiment: {d.loc[i]["sentiment"]}\n')

In [None]:
print_reviews_and_sentiment(df, start_index=42, num=2)

Change the value of idx to vary that amount of train and test data. The default value is 25000 or a 50/50 split.

In [None]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

Via the first regex, <[^>]*>, in the preceding code section, we tried to remove all of the HTML markup from the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in removing HTML markup and do not plan to use the HTML markup further, using regex to do the job should be acceptable. However, if you prefer to use sophisticated tools for removing HTML markup from text, you can take a look at Python’s HTML parser module, which is described at https://docs.python.org/3/library/html.parser.html. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all non-word characters from the text via the regex [\W]+ and converted the text into lowercase characters.

In [None]:
df['review'] = df['review'].apply(preprocessor)

In [None]:
print_reviews_and_sentiment(df, start_index=42, num=2)

Create a train-test split:

In [None]:
idx = 25000
X_train = df.loc[:idx - 1, 'review'].values
y_train = df.loc[:idx - 1, 'sentiment'].values
X_test  = df.loc[idx:, 'review'].values
y_test  = df.loc[idx:, 'sentiment'].values

Let's try using the word counts as the features to get started:

In [None]:
tfidf = TfidfVectorizer(use_idf=False, norm=None, smooth_idf=False)
word_counts = tfidf.fit_transform(X_train)

In [None]:
type(word_counts)

In [None]:
word_counts.shape

In [None]:
list(tfidf.vocabulary_.items())[:10]

In [None]:
print(df.loc[1]["review"])

In [None]:
print(word_counts[1,:])

In [None]:
tfidf.vocabulary_["window"]

In [None]:
clf = LogisticRegression(C=1.0, solver='liblinear')
clf = clf.fit(word_counts, y_train)

The accuracy on the test set is:

In [None]:
clf.score(tfidf.transform(X_test), y_test)

Notice that the .transform() method was applied to the test set while .fit_transform() was applied to the train set. In this notebook we only worked with unnormalized word counts. We did nothing with stop-words, stemming, inverse document frequency weighting, n-grams, etc. The full solution in the next notebook uses a Pipeline to tryout various combinations of these choices to find the best one.