In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np
import json
%matplotlib inline
import matplotlib.pyplot as plt



We'll be using a dataset of movie scripts (this is the same dataset you'll be using in A5!)

In [2]:
from google.colab import files
uploaded = files.upload()

ModuleNotFoundError: No module named 'google'

In [5]:
with open("movie_scripts_data.json") as f:
    data = json.loads(f.readlines()[0])
print("Loaded {} movie transcripts".format(len(data)))
print("Each movie transcript is a dictionary with the following keys...")
print(data[0].keys())

# Here, we will assign an index for each movie_id. This index will help us access data in numpy matrices.
movie_id_to_index = {movie_id:index for index, movie_id in enumerate([d['movie_id'] for d in data])}

# We will also need a dictionary maping movie names to movie ids
movie_name_to_id = {name:mid for name, mid in zip([d['movie_name'] for d in data],
                                                     [d['movie_id'] for d in data])}
movie_id_to_name = {v:k for k,v in movie_name_to_id.items()}

# and because it might be useful...
movie_name_to_index = {name:movie_id_to_index[movie_name_to_id[name]] for name in [d['movie_name'] for d in data]}
movie_index_to_name = {v:k for k,v in movie_name_to_index.items()}

print("The index of \"{}\" is {}".format(data[7]['movie_name'], movie_id_to_index[data[7]['movie_id']]))

Loaded 617 movie transcripts
Each movie transcript is a dictionary with the following keys...
dict_keys(['movie_name', 'movie_id', 'categories', 'script'])
The index of "spare me" is 7


We can see that each movie is assigned an "index" (from 0 to 616). These will correspond to the rows of a document-by-term matrix.

Conveniently, the scikit-learn package, which we imported at the top of this notebook, includes a "CountVectorizer" class which efficiently implements code to build a document-by-term matrix. It includes many useful settings, including the ability to specify whether we want the matrix elements to be term frequencies (`binary = False`) or boolean/binary (`binary = True`). We will use the latter.

In [6]:
count_vec = CountVectorizer(stop_words = "english", max_df = 0.8, min_df = 10, max_features=1000, binary = True)

doc_by_vocab = count_vec.fit_transform([x['script'] for x in data])


In [7]:
doc_by_vocab 

<617x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 213542 stored elements in Compressed Sparse Row format>

Observe that the dimensions are 617 x 1000. The first dimension (rows) exactly equals the number of movie transcripts in our data; this tells us that the format of the matrix is document-by-term (each row is a document, each column is a term)

In [8]:
features = count_vec.get_feature_names()
print(features)

['able', 'absolutely', 'accept', 'accident', 'account', 'act', 'acting', 'action', 'actually', 'address', 'admit', 'advice', 'afford', 'afraid', 'afternoon', 'age', 'agent', 'ago', 'agree', 'ah', 'ahead', 'ain', 'air', 'alive', 'alright', 'amazing', 'america', 'american', 'angry', 'animal', 'answer', 'answers', 'anybody', 'anymore', 'apart', 'apartment', 'apologize', 'appreciate', 'area', 'aren', 'arm', 'arms', 'army', 'arrest', 'art', 'asked', 'asking', 'asleep', 'ass', 'asshole', 'attack', 'attention', 'awful', 'baby', 'bag', 'ball', 'bank', 'bar', 'bastard', 'bathroom', 'beat', 'beautiful', 'beauty', 'bed', 'beer', 'beg', 'begin', 'beginning', 'belong', 'best', 'bet', 'bigger', 'biggest', 'birthday', 'bit', 'bitch', 'black', 'blame', 'blew', 'blind', 'blood', 'blow', 'blue', 'board', 'boat', 'body', 'book', 'books', 'born', 'boss', 'bother', 'bought', 'bout', 'box', 'boy', 'boys', 'brain', 'brains', 'break', 'breakfast', 'breath', 'bring', 'bringing', 'broke', 'broken', 'brother', '



In [9]:
doc_by_vocab = doc_by_vocab.toarray()


Right now we have a document-by-term matrix. For deriving a co-occurrence matrix, it is more natural to start from a term-by-document matrix (such that each row is a term), which we can easily obtain by taking the transpose.

In [10]:
term_document_matrix = doc_by_vocab.T

Then, the co-occurrence matrix can be derived by taking the dot product between every pair of terms, which can be efficiently done via... [?]

In [11]:
cooccurence_matrix = np.dot(term_document_matrix, doc_by_vocab)

Let's try using this to compute similarity between words!

In [12]:
def find_most_similar_words(word, sim_matrix = cooccurence_matrix, topk=100):
    if word not in features:
        print(word, 'is OOV.')
        return None 
    idx = features.index(word)
    sorted_words = np.argsort(sim_matrix[idx])[::-1]
    print('Most similar {} words to "{}" are:'.format(topk, word))
    for i in range(topk):
        j = sorted_words[i]
        print(features[j], sim_matrix[idx, j])

In [13]:
find_most_similar_words("class", sim_matrix = cooccurence_matrix , topk = 10)

Most similar 10 words to "class" are:
class 171
best 155
trying 154
care 149
guess 148
thank 148
fine 148
went 148
seen 147
matter 147


In [15]:
find_most_similar_words("guess", sim_matrix = cooccurence_matrix , topk = 5)

Most similar 5 words to "guess" are:
guess 474
okay 399
trying 395
care 395
hey 391


Let's efficiently compute PMI, which for *ranking* purposes can be simplified to n_ab / (n_a * n_b). Our co-occurrence matrix already contains the numerator (co-occurrence counts of all term pairs). Both terms in the denominator can be obtained from the document frequency (DF) vector, which we can easily compute from our term-document matrix.

In [16]:
df = np.sum(term_document_matrix,1)

Dividing each row by the DF vector (which is easy to express in numpy) does the division by n_b...

In [17]:
PMI_part = cooccurence_matrix / df

...then we just need to divide again over each column (which is also easy to express in numpy, as long as we turn the DF vector into a column-shaped vector) to do the division by n_a.

In [18]:
PMI = PMI_part/df.reshape(df.shape[0],1)

Let's see how the similar words changed!

In [19]:
find_most_similar_words("class", sim_matrix = PMI , topk = 5)

Most similar 5 words to "class" are:
class 0.005847953216374269
weekend 0.002789849240839101
biggest 0.0026990553306342783
rules 0.002549107812265707
board 0.002519849759901108


In [None]:
x = np.array([[1,2,3], [4,5,6], [7,8,9]])
v = np.array([1, -1, -2])

In [None]:
x

In [None]:
x/v

In [None]:
v.reshape(3,1)

In [None]:
x/v.reshape(3,1)