Task 1: Language Models [30 marks]

1. Implement an N-gram language model (e.g., bigram or trigram) using the four provided text files (country.txt, pop.txt, rap.txt, and rock.txt).
2. Predict the next word of each n-gram of the words in Sentence 1 using your N-gram model. For example, what is the predicted next word given two words “Tonight I” using the trigram model. You can answer this question in the following format.

In [1]:
import nltk
from nltk.util import ngrams
from collections import defaultdict, Counter

files = ["/Users/mengrui/Desktop/A3/country.txt", "/Users/mengrui/Desktop/A3/pop.txt", "/users/mengrui/Desktop/A3/rap.txt", "/Users/mengrui/Desktop/A3/rock.txt"]

def load_data(file_paths):
    text = ""
    for file_path in file_paths:
        with open(file_path, 'r') as file:
            text += file.read().lower() + " "  
    return text

text = load_data(files)
tokens = nltk.word_tokenize(text)

N = 3

ngrams_list = list(ngrams(tokens, N, pad_left=True, pad_right=True))
n_minus_1_grams_list = list(ngrams(tokens, N-1, pad_left=True, pad_right=True))

n_minus_1_grams_counts = Counter(n_minus_1_grams_list)
n_grams_counts = Counter(ngrams_list)

vocab_size = len(set(tokens))

conditional_probs = defaultdict(lambda: defaultdict(float))

for ngram in n_grams_counts:
    n_minus_1_gram = ngram[:-1]
    word = ngram[-1]
    conditional_probs[n_minus_1_gram][word] = (n_grams_counts[ngram] + 1) / (n_minus_1_grams_counts[n_minus_1_gram] + vocab_size)

def predict_next_word(context):
    context = tuple(context[-(N-1):])  
    if context in conditional_probs:
        possible_words = conditional_probs[context]
        return max(possible_words, key=possible_words.get)
    else:
        return max(n_grams_counts, key=n_grams_counts.get)[-1] if n_grams_counts else None

sentence = "Tonight I will make the evening meal."  
sentence_tokens = nltk.word_tokenize(sentence.lower())

for i in range(len(sentence_tokens) - (N-1)):
    context = sentence_tokens[i:i + (N-1)]
    next_word = predict_next_word(context)
    print(f"Input: {tuple(context)} --- Output: {next_word}")


Input: ('tonight', 'i') --- Output: 'm
Input: ('i', 'will') --- Output: follow
Input: ('will', 'make') --- Output: it
Input: ('make', 'the') --- Output: world
Input: ('the', 'evening') --- Output: dump
Input: ('evening', 'meal') --- Output: >


Task 2: Word Embeddings [70 marks]

1. Use a pretrained word2vec model to conduct some experiments. Please refer to the hints12. Please load the pretrained word2vec model with the following codes.
     !pip -qq install gensim
     import gensim.downloader as api
     model = api.load(’word2vec-google-news-300’)
(a) Get the five most similar words to “speech”.

In [5]:
import gensim.downloader as api

model=api.load('word2vec-google-news-300')

similar_words = model.similar_by_word("speech", topn=5)
for word, similarity in similar_words:
    print(f"Word: {word}, Similarity: {similarity}")
 

Word: speeches, Similarity: 0.6758114099502563
Word: keynote_speech, Similarity: 0.6579364538192749
Word: speach, Similarity: 0.6468180418014526
Word: remarks, Similarity: 0.6410110592842102
Word: Speech, Similarity: 0.6331154704093933


(b) Confirm that “queen′′ = “king′′ − “male′′ + “f emale′′ (“queen′′ should be the three most similar words of the right-hand equation.).

To prove the similarity between "queen" and "king"-"male"+"female", we first set a vector of "king"-"male"+"female", then we use gensim.models.Word2Vec.similar_by_vector func to find the three most familiar words with "king"-"male"+"female"vector, and in the out put there is "queen", thus we can confirm that( “queen′′ should be the three most similar words of the right-hand equation.)

In [4]:
import gensim.downloader as api

model=api.load('word2vec-google-news-300')

result_vector = model['king'] - model['male'] + model['female']

similar_words = model.similar_by_vector(result_vector, topn=3)

for word, similarity in similar_words:
    print(f"Word: {word}, Similarity: {similarity}")

 

Word: king, Similarity: 0.8830681443214417
Word: queen, Similarity: 0.6669612526893616
Word: kings, Similarity: 0.6140398979187012


Calculate the similarity of two sentences using the word2vec model.

In [6]:
from gensim.models import KeyedVectors
import gensim.downloader as api

model = api.load('word2vec-google-news-300')

sentence1 = "Tonight, I will make the evening meal."
sentence2 = "I am going to make dinner tonight."

words1 = sentence1.lower().split()
words2 = sentence2.lower().split()

similarity_score = model.n_similarity(words1, words2)
print(f"Sentence Similarity: {similarity_score}")


Sentence Similarity: 0.773637056350708


3. The function below is called Jaccard similarity. Explain how Jaccard similarity computes the similarity of sentences in a few sentences. And, calculate the Jaccard similarity of Sentence 1 and Sentence 2.

First tokenize two sentences and convert the words to lowercase, then we have two sets of words. Then we calculate intersection, which contains the words both in two sets. And also we calculate union, which contains all the words present in either set. Finally we calculate the similarity by dividing the size of intersection by the size of union.

In [7]:
sentence1 = "Tonight, I will make the evening meal."
sentence2 = "I am going to make dinner tonight."

def jaccard_similarity(sentence1, sentence2):
    tokens1 = set(sentence1.lower().split())
    tokens2 = set(sentence2.lower().split())
    intersection = tokens1.intersection(tokens2)
    union = tokens1.union(tokens2)
    return len(intersection) / len(union)

similarity_score = jaccard_similarity(sentence1, sentence2)
print(f"Jaccard Similarity: {similarity_score}")


Jaccard Similarity: 0.16666666666666666


4. Why do the similarity scores of word2vec and Jaccard similarity differ a lot?

In my opion, it's mainly because the underlying machanism of two models are different. Word2Vec model capture semantic similarity based on the context in which words appear in larger corpora. And the vector representation of words are dense. Thus Word2Vec model can recognize synoyms and semantically related words. In contrast, Jaccard similarity is a simple metric which solely rely on presence or absence of words in the sets derived from the sentences. It does not consider word meanings or relationships.