# 1. [20 pts] In this step, we will develop a movie review vocabulary to be used as a baseline for processing movie reviews (such as newly posted) in our NLP pipeline. 

# First, discuss what could be the tools, approaches to find out most important keywords that give information about a review. For example, do you think such keywords should occur more in movie reviews compared to any other review of other items or events, etc.?

If we were to develop a movie review vocabulary to be used as a baseline for processing movie reviews, one of the first things that I would do is to identify is that we could first pre-process all of the reviews, using NLTK. We should remove the stopwords, and apply stemmers. Next, we should try to figure out what words occur most frequently in each distinct group of words. We should further remove additional stopwords that tend to occur in both good and bad reviews. When we do this, we can avoid information that would not be valuable for classifying the different words.

# 2. [40 pts] Use the scikit-learn CountVectorizer to compute word frequencies. Find out the 30 most occurring words in reviews without dividing the set into {sentiment: 0, 1} grouping. This library function can use a tokenizer (you can pass your own tokenizer), and then use the complete reviews dataset to generate counts.


Note that CountVectorizer generates a sparse matrix and we need to sum up column
(terms) elements (each row is a document) for a particular term.
Optional: You can also convert it to a regular matrix if your computing platform has enough
memory and does not complain: X = X_cvec.todense()
Use the following for fast column sum, which will probably be faster than the regular form:
row_counts = sum(X_cvec[:,]) / N
counts = np.squeeze(np.asarray(row_counts.todense()))

In [1]:
%%time
import csv
from sklearn.feature_extraction.text import CountVectorizer

CPU times: total: 312 ms
Wall time: 2.06 s


In [2]:
%%time
import csv

# Read the reviews and tokenize them - note the encoding
reviews_sentiment = []
reviews_sentiment_w_sent = []
with open('../movie_data.csv', 'r', encoding="utf-8") as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    next(reader)  # skip header
    lines = []
    for line in reader:
        reviews_sentiment.append(line[0])
        reviews_sentiment_w_sent.append(line)

CPU times: total: 328 ms
Wall time: 1.05 s


In [16]:
def tokenize(text):
    terms = word_tokenize(text)
    # all lower case
    terms = [w.lower() for w in terms]
    # filter stop words
    terms = [w for w in terms if w not in Stop_words and not w.isdigit()]
    # remove contractions, best way might be having a list 
    terms = [w for w in terms if not re.search(r'^\W\w+$', w)]
    return terms

In [22]:
%%time
cvec = CountVectorizer(stop_words=Stop_words)
import pandas as pd

X_cvec = cvec.fit_transform(reviews_sentiment)
count_array = X_cvec.toarray()
count_array=count_array.sum(axis=0)
count_array = count_array.reshape(1, -1)
df = pd.DataFrame(data=count_array,columns = cvec.get_feature_names_out())

df=df.T
df=df.sort_values(0,ascending=False)
df.head(30)

CPU times: total: 1min 22s
Wall time: 3min 55s


Unnamed: 0,0
br,201951
movie,87971
film,79705
one,53603
like,40172
good,29753
time,25110
even,24871
would,24602
story,23119


# What is X_cvec and what does it contain?

X_cvec contains a sparse matrix, where each row is a document, and each columnn represents the count of the various vocabulary words in each review. We can use this to count up the number of instances of particular words. We can use this for further analysis in question 3 and 4.

# 3. [20 pts] Use the following list to show if these keywords from a movie related web page actually match with the ones in our dataset (i.e., compared by frequencies or ranks):
['script', 'soundtrack', 'actor', 'film', 'producer', 'director', 'special',
'effect', 'score', 'cameraman', 'editor', 'blooper', 'box', 'office', 'cast',
'choreographer', 'cinema', 'movie', 'theater', 'costumer', 'critic', 'dubbing',
'extra', 'flashback', 'flash', 'forward', 'grip', 'hairstylist', 'lighting',
'negative', 'outtake', 'premiere', 'sequel', 'puppeteer', 'reel', 'scene',
'set', 'stunt', 'man', 'subtitle', 'synopsis', 'studio', 'squib', 'sound',
'effect', 'voice', 'writer', 'zoom']
(Source: https://www.vocabulary.com/lists/277003)

In [4]:
%%time
list_keywords=['script', 'soundtrack', 'actor', 'film', 'producer', 'director', 'special', 'effect', 'score', 'cameraman', 'editor', 'blooper', 'box', 'office', 'cast', 'choreographer', 'cinema', 'movie', 'theater', 'costumer', 'critic', 'dubbing', 'extra', 'flashback', 'flash', 'forward', 'grip', 'hairstylist', 'lighting', 'negative', 'outtake', 'premiere', 'sequel', 'puppeteer', 'reel', 'scene', 'set', 'stunt', 'man', 'subtitle', 'synopsis', 'studio', 'squib', 'sound', 'effect', 'voice', 'writer', 'zoom']
from nltk.corpus import reuters
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize
import re 
# Combination of stop words and punctuations
Stop_words = stopwords.words('english') + list(punctuation)
def tokenize(text):
    terms = word_tokenize(text)
    # all lower case
    terms = [w.lower() for w in terms]
    # filter stop words
    terms = [w for w in terms if w not in Stop_words and not w.isdigit()]
    # remove contractions, best way might be having a list 
    terms = [w for w in terms if not re.search(r'^\W\w+$', w)]
    return terms

CPU times: total: 156 ms
Wall time: 2.39 s


In [5]:
%%time
Vocabulary = set()
for review in reviews_sentiment:
    terms = tokenize(review)
    Vocabulary.update(terms)  # add multiple terms at once

CPU times: total: 1min 41s
Wall time: 1min 53s


In [10]:
%%time
# Re-structure Vocabulary to a list so we can use indices to efficiently represent terms
# Term_index will be used later in TfidfVectorizer
Vocabulary = list(Vocabulary)
Term_index = {w: idx for idx, w in enumerate(Vocabulary)}

print(f'Vocabulary size= {len(Vocabulary)}')

# Re-structure Vocabulary to a list so we can use indices to efficiently represent terms
# Term_index will be used later in TfidfVectorizer
Vocabulary = list(Vocabulary)
Term_index = {w: idx for idx, w in enumerate(Vocabulary)}

N = len(reviews_sentiment)

print(f'Reuters documents count N= {N}')
print(f'Vocabulary size= {len(Vocabulary)}')

Vocabulary size= 156921
Reuters documents count N= 50000
Vocabulary size= 156921
CPU times: total: 78.1 ms
Wall time: 94.6 ms


In [11]:
%%time
from collections import defaultdict
from math import log

Term_idf = defaultdict(int)
for review in reviews_sentiment:
    terms = set(tokenize(review))  # Count the document once
    for term in terms:
        Term_idf[term] += 1

for term in Vocabulary:
    # Do not forget to convert the count to float, i.e. "1.0" below
    Term_idf[term] = log(N / (1.0 + Term_idf[term]))

CPU times: total: 1min 43s
Wall time: 1min 51s


In [12]:
%%time
list_keywords=['script', 'soundtrack', 'actor', 'film', 'producer', 'director', 'special', 'effect', 'score', 'cameraman', 'editor', 'blooper', 'box', 'office', 'cast', 'choreographer', 'cinema', 'movie', 'theater', 'costumer', 'critic', 'dubbing', 'extra', 'flashback', 'flash', 'forward', 'grip', 'hairstylist', 'lighting', 'negative', 'outtake', 'premiere', 'sequel', 'puppeteer', 'reel', 'scene', 'set', 'stunt', 'man', 'subtitle', 'synopsis', 'studio', 'squib', 'sound', 'effect', 'voice', 'writer', 'zoom']
for t in list_keywords:
    print(f'{Term_idf[t]:.3f} {t}')  
mn_key = min(Term_idf, key=Term_idf.get)
mx_key = max(Term_idf, key=Term_idf.get)
print(f"In Reuters corpus:\n\
    min idf= {Term_idf[mn_key]:.4f} '{mn_key}'\n\
    max idf= {Term_idf[mx_key]:.4f} '{mx_key}'")

2.339 script
3.545 soundtrack
2.596 actor
0.604 film
4.256 producer
2.049 director
2.660 special
3.785 effect
3.294 score
6.235 cameraman
5.413 editor
8.422 blooper
3.837 box
4.058 office
2.100 cast
7.013 choreographer
3.102 cinema
0.506 movie
3.642 theater
8.740 costumer
5.395 critic
5.369 dubbing
4.443 extra
5.036 flashback
5.440 flash
3.786 forward
6.128 grip
9.721 hairstylist
4.360 lighting
4.387 negative
8.517 outtake
5.796 premiere
3.720 sequel
8.623 puppeteer
5.850 reel
1.886 scene
2.559 set
5.417 stunt
1.891 man
7.419 subtitle
5.552 synopsis
4.128 studio
9.028 squib
3.103 sound
3.785 effect
3.454 voice
3.564 writer
6.928 zoom
In Reuters corpus:
    min idf= 0.5059 'movie'
    max idf= 10.1266 'perjurer'
CPU times: total: 46.9 ms
Wall time: 64.3 ms


# 4. [20 pts] Group the reviews into two groups by {sentiment 0, 1} and list the most frequent 30 terms in these two groups. Do you suggest any word that can make a separation between sentiment 0 and 1?

In [26]:
sentiment_0=[]
sentiment_1=[]
for review in reviews_sentiment_w_sent:
    if(review[1] == "0"):
        sentiment_0.append(review[0])
    if(review[1] == "1"):
        sentiment_1.append(review[0])

In [27]:
def tokenize(text):
    terms = word_tokenize(text)
    # all lower case
    terms = [w.lower() for w in terms]
    # filter stop words
    terms = [w for w in terms if w not in Stop_words and not w.isdigit()]
    # remove contractions, best way might be having a list 
    terms = [w for w in terms if not re.search(r'^\W\w+$', w)]
    return terms

In [57]:
%%time
cvec = CountVectorizer(stop_words=Stop_words)
import pandas as pd

X_cvec = cvec.fit_transform(sentiment_0)
count_array = X_cvec.toarray()
count_array=count_array.sum(axis=0)
count_array = count_array.reshape(1, -1)
df = pd.DataFrame(data=count_array,columns = cvec.get_feature_names_out())

df=df.T
df=df.sort_values(0,ascending=False)
df.head(30)

CPU times: total: 22.4 s
Wall time: 35 s


Unnamed: 0,0
br,103997
movie,50117
film,37595
one,26283
like,22458
even,15254
good,14728
bad,14726
would,14007
time,12358


In [58]:
%%time
cvec = CountVectorizer(stop_words=Stop_words)
import pandas as pd

X_cvec = cvec.fit_transform(sentiment_1)
count_array = X_cvec.toarray()
count_array=count_array.sum(axis=0)
count_array = count_array.reshape(1, -1)
df_2 = pd.DataFrame(data=count_array,columns = cvec.get_feature_names_out())

df_2=df_2.T
df_2=df_2.sort_values(0,ascending=False)
df_2.head(30)

CPU times: total: 33.4 s
Wall time: 45.1 s


Unnamed: 0,0
br,97954
film,42110
movie,37854
one,27320
like,17714
good,15025
great,12964
story,12934
time,12752
well,12729


In [59]:
l=df.head(30)[0].keys()
l_2=df_2.head(30)[0].keys()

In [60]:
for entry in l:
    if(entry not in l_2):
        print(entry)

bad
make
could
plot
acting
watch
character


In [61]:
for entry in l_2:
    if(entry not in l):
        print(entry)

great
love
best
life
many
films
two


Some words that may work are great,love, best,life,many,films and to for differentiating between sentiment 1 and 0. Some words that may work for differentiating between sentiment 0 and 1 are bad,make,could,plot,acting, watch, and character. This is unsuprising in my opinion, as many of these words are frequently used describing positive or negative things.