<a href="https://colab.research.google.com/github/ItWasAllYellow/NLP_2025/blob/main/NLP_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1

In this assignment, you will explore about word vectors.

- Submision: A report in ``pdf``, your completed notebook file in ``ipynb``, and training data in ``txt``
    - The assignment will be evalulated mainly with report. So please include every detail you want to present in your report, including figures.
    - Report: Free format. You can copy and paste part of your code for some problems.
      - Report has to be written in English
    - ipynb: Save your notebook (with output of each cell if possible) as ipynb and submit it
- Evaluation criteria
    - How interesting and original are the presented examples
    - How well you describe the reason of success or failure of your examples by considering how Word2Vec is trained
    - Any description that is suspicious for using LLM without understanding the content can be penalized. You may use LLM for translation, but you have to describe it in your report.

## 0. Setup
- Check ``gensim`` library is installed
  - if not, you can install using ``!pip install gensim``
- List the downloadable vectors from ``gensim``


In [1]:
!pip install gensim



In [2]:
import gensim
import numpy as np
import pprint as pp

In [3]:
import gensim.downloader
list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

- Among the Word2Vec model codes above, select one model of your choice among ``glove-wiki-gigaword`` or ``glove-twitter``
    - numbers at the last represents the number of dimension of each Word2Vec Model
        - e.g. ``glove-twitter-200`` was trained on twitter dataset while embedding each word into 200-dim vector
        - e.g. ``glove-wiki-gigaword-300`` was trained on wikipedia dataset while embedding each word into 300-dim vector
- Download the selected model and load it as a ``model``

In [4]:
your_model_code = 'glove-wiki-gigaword-300' # select among the model code aboves
model = gensim.downloader.load(your_model_code) # download and load the model. It can take some time



In [5]:
# test the model output
model['cat']

array([-0.29353  ,  0.33247  , -0.047372 , -0.12247  ,  0.071956 ,
       -0.23408  , -0.06238  , -0.0037192, -0.39462  , -0.69411  ,
        0.36731  , -0.12141  , -0.044485 , -0.15268  ,  0.34864  ,
        0.22926  ,  0.54361  ,  0.25215  ,  0.097972 , -0.087305 ,
        0.87058  , -0.12211  , -0.079825 ,  0.28712  , -0.68563  ,
       -0.27265  ,  0.22056  , -0.75752  ,  0.56293  ,  0.091377 ,
       -0.71004  , -0.3142   , -0.56826  , -0.26684  , -0.60102  ,
        0.26959  , -0.17992  ,  0.10701  , -0.57858  ,  0.38161  ,
       -0.67127  ,  0.10927  ,  0.079426 ,  0.022372 , -0.081147 ,
        0.011182 ,  0.67089  , -0.19094  , -0.33676  , -0.48471  ,
       -0.35406  , -0.15209  ,  0.44503  ,  0.46385  ,  0.38409  ,
        0.045081 , -0.59079  ,  0.21763  ,  0.38576  , -0.44567  ,
        0.009332 ,  0.442    ,  0.097062 ,  0.38005  , -0.11881  ,
       -0.42718  , -0.31005  , -0.025058 ,  0.12689  , -0.13468  ,
        0.11976  ,  0.76253  ,  0.2524   , -0.26934  ,  0.0686

## Problem 1. Simple Mathematics with Word2Vec
- In this problem, you have to complete the given functions ``word_analogy_with_vector`` and ``get_cosine_similarity``
  - To get the exactly same result with ``model.most_similar()``, you have to normalize each vector before doing arithmetic.
  - Using L2 norm (sqrt of sum of square of every item in the vector)
  - The result will also naturally include the positive query words itsef.
- In your report, **please include your code for these functions**


In [12]:
def word_analogy_with_vector(model, x_1, x_2, y_1):
  '''
  This function takes a gensim Word2Vec model and outputs a vector to find y2 that corresponds to x_1 → x_2 == y_1 → y_2
  e.g. x_1 (man) → x_2 (king) == y_1 (woman) → y_2(?)

  inputs
  model (gensim.models.keyedvectors.KeyedVectors): Word2Vec model in KeyedVectors in gensim library
  x_1, x_2, y_1 (str): Words in the model's vocabulary.

  output (np.ndarray): A vector in np.ndarray, which can be used to find proper y_2 for given (model, x_1, x_2, y_1)

  CAUTION: You have to normalize (divide vector by its length) the vector before doing arithmetic.
  '''

  # Write your code from here

  v_x1 = model[x_1] / (sum(model[x_1] ** 2) ** 0.5)
  v_x2 = model[x_2] / (sum(model[x_2] ** 2) ** 0.5)
  v_y1 = model[y_1] / (sum(model[y_1] ** 2) ** 0.5)

  v_y2 = v_y1 + (v_x2 - v_x1)

  return v_y2

# test whether the function works well
result_vector = word_analogy_with_vector(model, 'man', 'king', 'woman')
print('result vector is ', result_vector)
assert isinstance(result_vector, np.ndarray), "Output of the function has to be np.ndarray"
model.most_similar(result_vector)

result vector is  [-0.03121225 -0.04885079 -0.00052049  0.08669862  0.09005838 -0.06614241
  0.03420759  0.00142278  0.01974262 -0.13622826 -0.07791571 -0.09958889
  0.06360311  0.08859422  0.03489114  0.01636234  0.06570574  0.05126741
 -0.03242207 -0.04330669 -0.15668158  0.0368449  -0.02985314  0.01033623
  0.03430319 -0.07671046 -0.04376435 -0.04739117  0.06318487  0.01455305
 -0.06891646 -0.05026019 -0.06613863  0.03281747 -0.14246     0.0012542
 -0.0561171   0.00502324  0.06503114  0.00781858  0.02678905 -0.0506545
 -0.01988168  0.05223122  0.02118688 -0.03840835  0.05407127  0.04449621
  0.00837921 -0.10452366 -0.03278344  0.01043135  0.12034565 -0.10535439
 -0.06399182 -0.02953902  0.05116515  0.00461546  0.0933569   0.01697323
  0.00224307 -0.04162461  0.01235028  0.07910962  0.04070729 -0.13898228
  0.02305285 -0.00028938  0.01022689  0.08450352  0.004185   -0.073237
 -0.03146702  0.01343366  0.05988747  0.0379662   0.07760164 -0.10404901
  0.03800888  0.00652905  0.05621087 

[('king', 0.7572609186172485),
 ('queen', 0.6713276505470276),
 ('princess', 0.5432624220848083),
 ('throne', 0.5386104583740234),
 ('monarch', 0.5347574353218079),
 ('daughter', 0.498025119304657),
 ('mother', 0.49564430117607117),
 ('elizabeth', 0.4832652509212494),
 ('kingdom', 0.47747087478637695),
 ('prince', 0.4668239951133728)]

In [15]:
def get_cosine_similarity(model, x, y):
  '''
  This function returns cosine similarity of x,y

  inputs
  model (gensim.models.keyedvectors.KeyedVectors): Word2Vec model in KeyedVectors in gensim library
  x, y (str): Words in the model's vocabulary.

  output
  similarity (float): cosine similarity between x's vector and y's vector
  '''
  # Write your codes from here

  sim = np.dot(model[x], model[y]) / ((sum(model[x] ** 2) ** 0.5) * (sum(model[y] ** 2) ** 0.5))

  return sim

# test the output with your own choice
word_a = 'chocolate'
word_b = 'vanilla'

similarity = get_cosine_similarity(model, word_a, word_b)
print(similarity)
assert -1 <= similarity <= 1, "Similarity has to be between -1 and 1"

print('gensim library result:', model.similarity(word_a, word_b))

0.6099975687524788
gensim library result: 0.6099975


## Problem 2. Find Most Similar Words
- One of the most simple and typical use case of Word2Vec is finding a word based on similarity.
- You can list the most similar words for a given query word by using ``model.most_similar(your_word)``
    - Usually, every word in Word2Vec model is in lowercase
- **In your report**, present more than **5** interesting examples and explain **why it was interesting for you**
    - Try to explain why those words are regarded similar in Word2Vec, considering how it was trained
- Caution: The model was trained with multilingual dataset. This means the meaning of the word can follow non-English words.
    - e.g. "die", "war" are more frequently used in German than English.

In [16]:
target_word = 'coldplay' # Enter your word string here
# check the word is in the vocabulary of the model
assert model.has_index_for(target_word), f"The selected word, {target_word}, is not included in the model's vocabulary"
model.most_similar(target_word)

[('radiohead', 0.6285437345504761),
 ('u2', 0.5711713433265686),
 ('aerosmith', 0.4990842938423157),
 ('beyonce', 0.49628856778144836),
 ('shakira', 0.48919782042503357),
 ('frontman', 0.4847787022590637),
 ('björk', 0.47576695680618286),
 ('minogue', 0.46712526679039),
 ('rihanna', 0.4642452597618103),
 ('springsteen', 0.45410770177841187)]

## Problem 3. Word Analogy
- Another interesting thing you can play with Word2Vec is word analogy
- Word analogy is done by adding and subtracting the word vector
- In the cell below, you can run an example like this
    - ``analogy('man', 'king', 'woman')`` represents a question of "man is to king as woman is to what?"
- **Caution**: Do not confuse the relation between each input word.
    - Some wrong examples:
      - ``analogy(model, 'student', 'school', 'employee')``  Student: School -> Employee: Teacher?
        - This is wrong because the relation between Student and School is not the same as the relation between Employee and Teacher.
      - ``analogy(model, 'android', 'electricity', 'blood')``
        - This is wrong because it calculates ``electricity - android + blood`` instead of ``android - electricity + blood``
- Try with your own choice.
- **In your report**, present at least **5** interesting examples of your choice
    - You can include the failure case
    - Describe what did you expect and why the result was interesting for you

In [20]:
def analogy(model, x1, x2, y1):
  pp.pprint(model.most_similar([x2, y1], negative=[x1]))

# Try with your own word choice
analogy(model, 'student', 'freshman', 'player') # y1 + (x2 - x1)

[('rookie', 0.5814343690872192),
 ('quarterback', 0.5601049065589905),
 ('players', 0.5584985613822937),
 ('starters', 0.5566496253013611),
 ('lineman', 0.5547109246253967),
 ('sophomore', 0.5498921275138855),
 ('scorer', 0.5400611758232117),
 ('defensive', 0.5371955633163452),
 ('redshirt', 0.5318215489387512),
 ('standout', 0.5229174494743347)]


## Problem 4. Visualize Word Vectors
- Select a list of words of your interest
    - **At least 30 words for minimum**
    - ``word_list`` is a list of strings
    - every element in ``word_list`` has to be included in the model's vocabulary
- Visualize the vectors of words using dimensionality reduction (in this case, PCA)
- In your report, describe how words are located in 2D space
    - How are the words clustered?
    - Do you think the words are properly located based on their semantic meanings?
    - Is there anything suprising or unexpected examples?

In [21]:
# Run this cell to
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import plotly.express as px

def display_pca_scatterplot(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")
  word_vectors = np.array([model[w] for w in words])

  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # plt.figure(figsize=(12,12))
  # plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
  # for word, (x,y) in zip(words, twodim):
  #     plt.text(x+0.05, y+0.05, word, fontsize=15)
  fig = px.scatter(twodim, x=0, y=1, text=words)
  fig.update_traces(textposition='top center')
  fig.show()



In [49]:
# Select word list of your own interests
word_list = [
    'guitar', 'base', 'drum', 'keyboard', 'vocal',
    'bjj', 'judo', 'wrestling', 'boxing', 'taekwondo',
    'coldplay', 'oasis', 'radiohead', 'keane', 'maroon5',
    'football', 'soccer', 'baseball', 'basketball', 'messi',
    'sogang', 'snu', 'yonsei', 'ewha', 'hongik'
]

display_pca_scatterplot(model, word_list)

Current length of input word list: 25


## Problem 5. Train New Word2Vec
- Word2Vec models can be trained on different corpus (text)
- Train your own model with your custom selection of text
- In your report, present at least **5** interesting examples that makes different result by dataset selection
    - You can compare some word analogy examples or similairites or visualization
    - You don't have to repeat all the analysis again. Select some examples that you think are interesting
- Explain the difference of the result by dataset selection
- You can refer [Official Documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) Word2Vec Model

In [50]:
# You don't have to change this cell
import string
from gensim.models import Word2Vec

def remove_punctuation(x):
  return x.translate(''.maketrans('', '', string.punctuation))
def make_tokenized_corpus(corpus):
  out= [ [y.lower() for y in remove_punctuation(sentence).split(' ') if y] for sentence in corpus]
  return [x for x in out if x!=[]]

In [None]:
your_text_fn = '' # Enter your text file name here

with open(your_text_fn, 'r') as f:
  strings = f.readlines()

'''
This line is for the case when the text file is not properly formatted.
It was used to ignore linebreaks and join the sentences into one string, since the text example included linebreak following printed book lines.

strings = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('. ')
'''
# The strings has to be a list of list of strings, where inner list is a sentence

print("Checking the first 5 sentences in the text file")
for i in range(5):
  print(f"Sentence {i+1}: {strings[i]}")
corpus = make_tokenized_corpus(strings)

- gensim Word2Vec arguments
  - ``sentences``: list of list of strings
  - ``vector_size``: dimension of word vector
  - ``epochs``: number of epoch to train the word2vec model
  - ``window``: maximum distance between the current and predicted word within a sentence
  - ``min_count``: ignore all words with total frequency lower than this
  - ``sg``: training algorithm: 1 for skip-gram; otherwise CBOW
  - ``negative``: if > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown (usually between 5-20)

In [None]:
model = Word2Vec(sentences=corpus, vector_size=200, window=5, min_count=2, epochs=50, sg=1)
model = model.wv # To match with previous codes, we use wv (KeyedVector) of the Word2Vec class
# Try the function above with the newely trained model