<a href="https://colab.research.google.com/github/PrincetonUniversity/intro_machine_learning/blob/main/day5/natural_language_processing_hackathon/day5_nlp_movie_reviews_notebook1_bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to Machine Learning  
**Natural Language Processing Hackathon: Notebook 1  
Wintersession  
Tuesday, January 24, 2023**

The material here is based on Chapter 8 of 
Machine Learning with PyTorch and Scikit-Learn by Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili and Dmytro Dzhulgakov. The book is available via the PU library.

In [None]:
import re
import pandas as pd
import numpy as np
import pprint
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# How to process natural language using a computer?

Our focus for this project will be sentiment analysis or opinion mining. That is, for a given document, is the sentiment or tone of the document positive or negative?

"best movie ever"  
"we found this movie to be very entertaining"  
"this movie was the worst movie ever"  

In order to use computers to do natural language processing we need to convert the text to numbers. What simple approaches can one think of to do this?

# Bag of Words

One approach is to count the number of times that each word appears in each document and associate these counts with the class label. This approach is called bag of words. Let's look at an example.

In [None]:
df = pd.DataFrame({"review":["best movie ever",
                             "we found this movie to be very entertaining",
                             "this movie was the worst movie ever"],
                   "sentimemt":[1, 1, 0]})
df

We'll use a tool called a CountVectorizer to perform the counting. See the documentation for the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
count = CountVectorizer(stop_words=None)
bag = count.fit_transform(df["review"])

The dataframe below shows the term frequencies for each review:

In [None]:
numbers = pd.DataFrame(bag.toarray())
numbers.columns = sorted(count.vocabulary_.keys())
numbers

We now have features that can be used for training a machine learning model! Let's add a few more pieces.

# Term Frequency-Inverse Document Frequency

Some words appear in many of the reviews (or documents in general) while others only appear rarely. Let's come up with a scheme for up-weighting the rare words and down-weighting the common words. Our hypothesis is that the rare words have more importance.

One solution is to multiply the term frequency of a given word in a document by the log of the ratio of the number of documents divided by the number of documents containing that word. Like this:

tf(w, r) = count of word w in review r  
N = total number of reviews  
n(w) = number of reviews containing word w  


tf-idf = tf(w, r) log ((N + 1) / (n(w) + 1))

The log of the ratio is used to prevent very rare words from getting excess weight. Let's try it out and see it the results make sense.

In [None]:
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
tbl = tfidf.fit_transform(bag).toarray()
numbers = pd.DataFrame(tbl)
numbers.columns = sorted(count.vocabulary_.keys())
numbers.round(decimals=2)

In the first row above, "best" has the largest value. This makes sense since it only appears once in that review and not in others. The word "movie" appears in all reviews and its magnitude is smallest. In the third row, "movie" has the largest magnitude despite being a common word. This arises because appears twice so its term frequency is 2 which is high.

The values in the table above have been normalized by row. Let's check that each row is normalized:

In [None]:
print([np.linalg.norm(tbl[i]) for i in [0, 1, 2]])

Note that using use_idf=False, norm=None and smooth_idf=False simply gives the word counts:

In [None]:
tfidf = TfidfTransformer(use_idf=False, norm=None, smooth_idf=False)
print(tfidf.fit_transform(bag).toarray())

# Stemming

Words like running and run are closely related. They derive from the same stem. We can reduce the number of words by applying stemming.

In [None]:
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

There is also the trivial tokenizer which does not perform stemming:

In [None]:
def tokenizer(text):
    return text.split()
tokenizer('runners like running and thus they run')
['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

# Text Cleaning

In [None]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [None]:
preprocessor("</a>This :) is :( a test :-)!")

Via the first regex, <[^>]*>, in the preceding code section, we tried to remove all of the HTML markup from the movie reviews. Although many programmers generally advise against the use of regex to parse HTML, this regex should be sufficient to clean this particular dataset. Since we are only interested in removing HTML markup and do not plan to use the HTML markup further, using regex to do the job should be acceptable. However, if you prefer to use sophisticated tools for removing HTML markup from text, you can take a look at Python’s HTML parser module, which is described at https://docs.python.org/3/library/html.parser.html. After we removed the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next, we removed all non-word characters from the text via the regex [\W]+ and converted the text into lowercase characters.

# Stop-words

The most common words that may not contribute much information are called stop words. We may consider removing these when pre-processing the text.

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]
['runner', 'like', 'run', 'run', 'lot']

In [None]:
', '.join(sorted(stop))

# n-grams

We can make tokens out of multiple words. This allows us to capture features like "very bad" or "very good".

In [None]:
count = CountVectorizer(stop_words=None, ngram_range=(1, 2))
bag = count.fit_transform(df["review"])

In [None]:
import pprint
pprint.pprint(count.vocabulary_)