# CSC2611 Exercise: Meaning construction from text

### Step 1. Import NLTK in Python: http://www.nltk.org/. Download the Brown Corpus http://www.nltk.org/book/ch02.html for analyses below.

In [1]:
import nltk
from nltk.corpus import brown

nltk.download('brown')

# read in lowercased words
brown_lower_words = [x.lower() for x in brown.words()]

[nltk_data] Downloading package brown to C:\Users\Jungeun (June)
[nltk_data]     Lim\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


### Step 2-1. Extract the 5000 most common English words (denoted by W) based on unigram frequencies in the Brown corpus.

In [2]:
from collections import Counter
import re

K=5000

def create_vocab(data):
    """Extract the K most common words based on unigram frequencies in the given corpus.
    Return the list of the K words.
    """
    
    vocab=Counter()
    for i, word in enumerate(data):
        # considering multiple words connected with punctuations such as
        # "face-saving", "a.k.a.", or "you're" as well
        if re.match('^[a-z]', word):
            vocab[word]+=1
        
    W=[word for word,v in vocab.most_common(K)]

    return W

### Report the 5 most and least common words you have found in the 5000 words.

In [3]:
W = create_vocab(brown_lower_words)

print("The 5 most common words:", W[:5])
print("The 5 least common words:", W[-5:])

The 5 most common words: ['the', 'of', 'and', 'to', 'a']
The 5 least common words: ['packed', 'lacked', 'condemned', 'documents', 'corporate']


### Step 2-2. Update W by adding n words where n is the set of words in Table 1 of RG65 that were not included in the top 5000 words from the Brown corpus. Denote the total number of words in W as |W|.

In [4]:
rg65_fname="data/RG65_table1.txt"

more_words=set([w for w in re.split(r'\s|\n', open(rg65_fname).read()) if w.isalpha()])

for w in more_words:
    if not w in W:
        W.append(w)
        
print("After adding the words from RG65, the total number of words in W is now:", len(W))

After adding the words from RG65, the total number of words in W is now: 5030


### Step 3. Construct a word-context vector model (denoted by M1) by collecting bigram counts for words in W. The output should be a |W|*|W| matrix (consider using sparse matrices for better efficiency), where each row is a word in W, and each column is a context in W that precedes row words in sentences.

In [5]:
import pandas as pd
import numpy as np

def make_word_context_vector(data, W):
    M1 = pd.DataFrame(0, columns=W, index=W)
    
    for i, word in enumerate(data):
        if type(word) != str or word not in W or i == 0:
            continue
        context = data[i-1]
        if context in W:
            M1[context][word]+=1
        
    return M1

In [6]:
%%time

# takes 2-3 mins

M1 = make_word_context_vector(brown_lower_words, W)
M1 = M1.astype(pd.SparseDtype("int", 0))

#save the matrix for future use
#M1.to_pickle("./M1.pkl")

Wall time: 2min 48s


In [7]:
#the saved matrix as a .pkl file may be loaded instead of building it from scratch
#M1 = pd.read_pickle("./M1.pkl")

### Step 4. Compute positive pointwise mutual information on M1. Denote this model as M1+.

In [8]:
def ppmi(M1):
    """Given a 2d pandas dataframe filled with raw word co-occurrence counts,
    Return a 2d pandas dataframe filled with positive pointwise mutual information score for each entry.
    Formula: log( p(x|y)/p(x) ), y being the column(context-word), x being the row(word).
    PMI is zero if X and Y are independent. 
    PPMI considers negative PMI as zero.
    """
    
    # Get numpy array from pandas dataframe for faster computation
    arr = M1.to_numpy()

    # p(x|y)
    col_totals = arr.sum(axis=0).astype(int)
    prob_rows_given_col = arr / col_totals

    # p(x)
    row_totals = arr.sum(axis=1).astype(int)
    prob_rows = row_totals / sum(row_totals)

    # PMI: log( p(x|y) / p(x) )
    ratio = (prob_rows_given_col.T / prob_rows).T
    ratio[ratio==0] = 0.00001
    pmi = np.log(ratio)
    pmi[pmi < 0] = 0

    return pd.DataFrame(data=pmi, index=M1.index, columns=M1.columns).fillna(0)

In [9]:
M1_plus = ppmi(M1)
M1_plus = M1_plus.astype(pd.SparseDtype("int", 0))

#save the matrix for future use
#M1_plus.to_pickle("./M1_plus.pkl")

  


In [10]:
#the saved matrix as a .pkl file may be loaded instead of building it from scratch
#M1_plus = pd.read_pickle("./M1_plus.pkl")

### Step 5. Construct a latent semantic model (denoted by M2) by applying principal components analysis to M1+. The output should return 3 matrices, with different truncated dimenions at 10 (or a |W|*10 matrix, denoted by M2_10), 100 (M2_100), and 300 (M2_300).

In [11]:
from sklearn.decomposition import PCA 

def pca(df, n):
    arr = PCA(n_components=n).fit_transform(M1_plus)
    return pd.DataFrame(arr, columns=np.arange(n), index=M1_plus.index)

M2_10 = pca(M1_plus, 10)
M2_100 = pca(M1_plus, 100)
M2_300 = pca(M1_plus, 300)

### Step 6. Find all pairs of words in Table 1 of RG65 that are also available in W. Denote these pairs as P. Record the human-judged similarities of these word pairs from the table and denote similarity values as S.

In [12]:
rg_PS_dic = {} # this dictionary contains P as keys and S as values

with open(rg65_fname, 'r',) as rg:
    rg_list = [s.strip().split(' ') for s in rg.read().split('\n')]
    for i, v in enumerate(rg_list):
        if i % 2 == 0:
            rg_pair = tuple(v)
        else:
            rg_PS_dic[rg_pair] = float(v[0])

### Step 7. Perform the following calculations on each of these models M1, M1+, M2_10, M2_100, M2_300, separately: Calculate cosine similarity between each pair of words in P, based on the constructed word vectors. Record model-predicted similarities: S_M1, S_M2_10 , S_M2_100 , S_M2_300 .

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

def cos_sim_two_words(df, w1, w2):
    return cosine_similarity(X=df.loc[w1].to_numpy().reshape(1, -1), Y=df.loc[w2].to_numpy().reshape(1, -1))

In [14]:
%%time

# Calculate cosine similarity between each pair of words
# Takes about 17 seconds

S_M1 = dict.fromkeys(rg_PS_dic.keys(),-1.0)
S_M1_plus = dict.fromkeys(rg_PS_dic.keys(),-1.0)
S_M2_10 = dict.fromkeys(rg_PS_dic.keys(),-1.0)
S_M2_100 = dict.fromkeys(rg_PS_dic.keys(),-1.0)
S_M2_300 = dict.fromkeys(rg_PS_dic.keys(),-1.0)

for pair in rg_PS_dic:
    S_M1[pair] = cos_sim_two_words(M1, pair[0], pair[1])[0][0]
    S_M1_plus[pair] = cos_sim_two_words(M1_plus, pair[0], pair[1])[0][0]
    S_M2_10[pair] = cos_sim_two_words(M2_10, pair[0], pair[1])[0][0]
    S_M2_100[pair] = cos_sim_two_words(M2_100, pair[0], pair[1])[0][0]
    S_M2_300[pair] = cos_sim_two_words(M2_300, pair[0], pair[1])[0][0]

Wall time: 20.2 s


### Step 8. Report Pearson correlation between S and each of the model-predicted similarities. 

In [15]:
from scipy.stats import pearsonr

corr_S_M1 = pearsonr(list(S_M1.values()), list(rg_PS_dic.values()))
corr_S_M1_plus = pearsonr(list(S_M1_plus.values()), list(rg_PS_dic.values()))
corr_S_M2_10 = pearsonr(list(S_M2_10.values()), list(rg_PS_dic.values()))
corr_S_M2_100 = pearsonr(list(S_M2_100.values()), list(rg_PS_dic.values()))
corr_S_M2_300 = pearsonr(list(S_M2_300.values()), list(rg_PS_dic.values()))

print(f"Pearson’s correlation coefficient between humans and M1:\t {corr_S_M1[0]:.2}", f"\tp-value: {corr_S_M1[1]:.2}")
print(f"Pearson’s correlation coefficient between humans and M1_plus:\t {corr_S_M1_plus[0]:.2}", f"\tp-value: {corr_S_M1_plus[1]:.2}")
print(f"Pearson’s correlation coefficient between humans and M2_10:\t {corr_S_M2_10[0]:.2}", f"\tp-value: {corr_S_M2_10[1]:.2}")
print(f"Pearson’s correlation coefficient between humans and M2_100:\t {corr_S_M2_100[0]:.2}", f"\tp-value: {corr_S_M2_100[1]:.2}")
print(f"Pearson’s correlation coefficient between humans and M2_300:\t {corr_S_M2_300[0]:.2}", f"\tp-value: {corr_S_M2_300[1]:.2}")

Pearson’s correlation coefficient between humans and M1:	 0.025 	p-value: 0.84
Pearson’s correlation coefficient between humans and M1_plus:	 0.29 	p-value: 0.018
Pearson’s correlation coefficient between humans and M2_10:	 0.21 	p-value: 0.09
Pearson’s correlation coefficient between humans and M2_100:	 0.33 	p-value: 0.0068
Pearson’s correlation coefficient between humans and M2_300:	 0.36 	p-value: 0.0037


# CSC2611 Lab: Word embedding and semantic change
## Part 1 Synchronic word embedding

### Step 1. Download the pre-trained word2vec embeddings from https://code.google.com/archive/p/word2vec/, specifically, the file "GoogleNews-vectors-negative300.bin.gz".

In [16]:
%%time

# load the pre-trained word2vec embeddings (the file is manually downloaded from the provided URL)
# takes 2-4 mins 

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

Wall time: 3min 12s


### Step 2 & 3. Using gensim, extract embeddings of words in Table 1 of RG65 that also appeared in the set W from the earlier exercise, i.e., the pairs of words should be identical in all analyses. Calculate cosine distance between each pair of word embeddings you have extracted, and report the Pearson correlation between word2vec-based and human similarities. Comment on this value in comparison to those from LSA and word-context vectors from analyses in the earlier exercise.

In [17]:
# Extract embeddings of words in Table 1 of RG65 that also appeared in the set W from the earlier exercise,
# and calculate cosine distance between each pair of the word embeddings

S_WB = dict.fromkeys(rg_PS_dic.keys(),-1.0)

for pair in rg_PS_dic:
    S_WB[pair] = cosine_similarity(X=model[pair[0]].reshape(1, -1), Y=model[pair[1]].reshape(1, -1))[0][0]

In [18]:
# Compute and report the Pearson correlation between word2vec-based and human similarities

corr_S_WB = pearsonr(list(S_WB.values()), list(rg_PS_dic.values()))

print(f"Pearson’s correlation coefficient between humans and word2vec:\t {corr_S_WB[0]:.2}", f"\tp-value: {corr_S_WB[1]:.2}")

Pearson’s correlation coefficient between humans and word2vec:	 0.77 	p-value: 5.1e-14


### Step 4. Perform the analogy test based on the provided data with the pre-trained word2vec embeddings. Report the accuracy on the semantic analogy test and the syntactic analogy test. Repeat the analysis with LSA vectors (300 dimensions) from the earlier exercise, and commment on the results in comparison to those from word2vec.

In [19]:
# Download the analogy test file

import requests

anal_file = requests.get('http://www.fit.vutbr.cz/~imikolov/rnnlm/word-test.v1.txt')

In [20]:
# Read the analogy test file

anal_data = {}     
for line in anal_file.text.split('\n')[1:]:
    if line.startswith(':'):
        anal_type = line[1:].strip()
        anal_data[anal_type] = []
    else:
        anal_data[anal_type].append(line.lower().strip().split())

anal_data['gram9-plural-verbs'].pop() # to remove a blank line at the end of the file

[]

In [21]:
# Filter the words that are not in the 5,000 most frequent words from the Brown corpus;
# only the analogy test cases in which all the 4 words are in the 5,000 most frequent words will be used

anal_data_final = {key: list() for key in anal_data.keys()}

for anal_type, test_case_list in anal_data.items():
    for test_case in test_case_list:
        if (len([word for word in test_case if word in W])) == 4:
            anal_data_final[anal_type].append(test_case)

In [22]:
# How many analogy test cases do I have now?

num_test_cases = 0

for k in anal_data_final.keys():
    num_test_cases += len(anal_data_final[k])
    print(f"{k}: {len(anal_data_final[k])} cases left")

num_sem_test_cases = len(anal_data_final['capital-common-countries']) + len(anal_data_final['capital-world']) + len(anal_data_final['city-in-state']) + len(anal_data_final['family'])
num_syn_test_cases = num_test_cases - num_sem_test_cases
print(f"\n{num_sem_test_cases} semantic cases to test")
print(f"{num_syn_test_cases} syntactic cases to test")
print(f"{num_test_cases} cases in total to test")

capital-common-countries: 20 cases left
capital-world: 6 cases left
currency: 0 cases left
city-in-state: 46 cases left
family: 90 cases left
gram1-adjective-to-adverb: 380 cases left
gram2-opposite: 20 cases left
gram3-comparative: 240 cases left
gram4-superlative: 42 cases left
gram5-present-participle: 272 cases left
gram6-nationality-adjective: 53 cases left
gram7-past-tense: 600 cases left
gram8-plural: 306 cases left
gram9-plural-verbs: 132 cases left

162 semantic cases to test
2045 syntactic cases to test
2207 cases in total to test


In [23]:
# since there is no test cases of currency, remove the key from the data
del anal_data_final['currency']

# Store the analogy cases to test into text files (in the same format as the initial data file)
# One file for semantic analogies, the other for syntatic analogies

sem_file = open("data/semantic_analogies.txt", 'w')
syn_file = open("data/syntactic_analogies.txt", 'w')

count = 0
for k, v in anal_data_final.items():
    category = ": " + k + "\n"
    count += 1
    if count < 6:
        FILE = sem_file
    else:
        FILE = syn_file
    FILE.write(category)
    for case in v:
        for word in case:  
            FILE.write(word + " ")
        FILE.write("\n")

sem_file.close()
syn_file.close()

In [24]:
%%time

# Word2vec word analogy test: uses 
# takes about 1-2 minutes

sem_result = model.evaluate_word_analogies("data/semantic_analogies.txt")
syn_result = model.evaluate_word_analogies("data/syntactic_analogies.txt")

Wall time: 8min 11s


In [25]:
%%time

# LSA vectors(300 dimensions) semantic analogy test
# takes about 2-3 minutes

M2_300_np = M2_300.to_numpy()

sem_correct_count = 0   # how many answers are correct?
sem_correct_list = []   # store the correct answers
sem_incorrect_list = [] # store the incorrect answers

keys_list = list(anal_data_final.keys())

for i in range(0,4):
    for question in anal_data_final[keys_list[i]]:
        w1 = question[0]
        w2 = question[1]
        w3 = question[2]
        w4 = question[3] # word to guess!
        
        v1 = M2_300.loc[w1].to_numpy()
        v2 = M2_300.loc[w2].to_numpy()
        v3 = M2_300.loc[w3].to_numpy()
        
        v4_expected = v3-v1+v2
            
        cos_sim_max = -1
        v4_loc = 0
            
        # find the closest word vector (answer) to the acquired vector in terms of cosine similarity
        for i in range(len(M2_300_np)):
            # see the similarity b/w each word and the expected v4
            cos_sim = cosine_similarity(X=M2_300_np[i].reshape(1, -1), Y=v4_expected.reshape(1, -1))
            
            # if the word is more similar to the expected v4
            # than any other previously seen words, record that word
            # * the words in the analogy question is ignored
            if cos_sim_max < cos_sim and M2_300.index[i] != w1 and M2_300.index[i] != w2 and M2_300.index[i] != w3:
                cos_sim_max = cos_sim
                v4_loc = i

        answer = M2_300.index[v4_loc]
        
        if answer == w4:
            sem_correct_count += 1
            sem_correct_list.append([question, answer])
        else:
            sem_incorrect_list.append([question, answer])

Wall time: 3min 6s


In [26]:
%%time

# LSA vectors(300 dimensions) syntactic analogy test
# takes about 40 minutes!
# I may want to find a more efficient way to find the closest vector to the given vector...

syn_correct_count = 0   # how many answers are correct?
syn_correct_list = []   # store the correct answers
syn_incorrect_list = [] # store the incorrect answers

for i in range(4,13):
    for question in anal_data_final[keys_list[i]]:
        w1 = question[0]
        w2 = question[1]
        w3 = question[2]
        w4 = question[3] # word to guess!
        
        v1 = M2_300.loc[w1].to_numpy()
        v2 = M2_300.loc[w2].to_numpy()
        v3 = M2_300.loc[w3].to_numpy()
        
        v4_expected = v3-v1+v2
            
        cos_sim_max = -1
        v4_loc = 0
            
        # find the closest word vector (answer) to the acquired vector in terms of cosine similarity
        for i in range(len(M2_300_np)):
            cos_sim = cosine_similarity(X=M2_300_np[i].reshape(1, -1), Y=v4_expected.reshape(1, -1))
            
            # if the word is more similar to the expected v4
            # than any other previously seen words, record that word
            # * the words in the analogy question is ignored
            if cos_sim_max < cos_sim and M2_300.index[i] != w1 and M2_300.index[i] != w2 and M2_300.index[i] != w3:
                cos_sim_max = cos_sim
                v4_loc = i

        answer = M2_300.index[v4_loc]
        
        if answer == w4:
            syn_correct_count += 1
            syn_correct_list.append([question, answer])
        else:
            syn_incorrect_list.append([question, answer])

Wall time: 41min 24s


In [27]:
# Report the result

print(f"Word2vec accuracy on the semantic analogy test: {sem_result[0]:.2}, {len(sem_result[1][-1]['correct'])} out of {num_sem_test_cases}")
print(f"Word2vec accuracy on the syntactic analogy test: {syn_result[0]:.2}, {len(syn_result[1][-1]['correct'])} out of {num_syn_test_cases}")
print(f"LSA(300d) accuracy on the semantic analogy test: {sem_correct_count/num_sem_test_cases:.2}, {sem_correct_count} out of {num_sem_test_cases}")
print(f"LSA(300d) accuracy on the syntactic analogy test: {syn_correct_count/num_syn_test_cases:.2}, {syn_correct_count} out of {num_syn_test_cases}")

Word2vec accuracy on the semantic analogy test: 0.46, 251 out of 162
Word2vec accuracy on the syntactic analogy test: 0.77, 1281 out of 2045
LSA(300d) accuracy on the semantic analogy test: 0.17, 28 out of 162
LSA(300d) accuracy on the syntactic analogy test: 0.053, 109 out of 2045
