# Word Semantics and Embeddings



## Exercise 1

Download [word vectors](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing) that are pretrained on Google News dataset (approx. 100 billion words). The file contains word vectors of 3 million words/phrases, whose dimentionalities are 300. Print out the word vector of the term “United States”. Note that “United States” is represented as “United_States” in the file.

In [None]:

!wget http://download.tensorflow.org/data/word2vec/googlenews-vectors-negative300.bin.gz

--2024-03-15 16:10:04--  http://download.tensorflow.org/data/word2vec/googlenews-vectors-negative300.bin.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 142.251.8.207, 173.194.174.207, 108.177.125.207, ...
Connecting to download.tensorflow.org (download.tensorflow.org)|142.251.8.207|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2024-03-15 16:10:04 ERROR 403: Forbidden.



In [None]:
!pip install gensim



In [None]:
from gensim.models import KeyedVectors
from google.colab import drive
drive.mount('/content/drive/')
model = KeyedVectors.load_word2vec_format("/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz", binary=True)

# Print out the word vector of the term "United States"
print(model["United_States"])

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
[-3.61328125e-02 -4.83398438e-02  2.35351562e-01  1.74804688e-01
 -1.46484375e-01 -7.42187500e-02 -1.01562500e-01 -7.71484375e-02
  1.09375000e-01 -5.71289062e-02 -1.48437500e-01 -6.00585938e-02
  1.74804688e-01 -7.71484375e-02  2.58789062e-02 -7.66601562e-02
 -3.80859375e-02  1.35742188e-01  3.75976562e-02 -4.19921875e-02
 -3.56445312e-02  5.34667969e-02  3.68118286e-04 -1.66992188e-01
 -1.17187500e-01  1.41601562e-01 -1.69921875e-01 -6.49414062e-02
 -1.66992188e-01  1.00585938e-01  1.15722656e-01 -2.18750000e-01
 -9.86328125e-02 -2.56347656e-02  1.23046875e-01 -3.54003906e-02
 -1.58203125e-01 -1.60156250e-01  2.94189453e-02  8.15429688e-02
  6.88476562e-02  1.87500000e-01  6.49414062e-02  1.15234375e-01
 -2.27050781e-02  3.32031250e-01 -3.27148438e-02  1.77734375e-01
 -2.08007812e-01  4.54101562e-02 -1.23901367e-02  1.19628906e-01
  7.44628906e-03 -9.0332

## Exercise 2

Compute the cosine similarity between “United States” and “U.S.”

In [None]:
#TODO: Write your code here
import numpy as np
from gensim.models import KeyedVectors

# Get the word vectors for "United States" and "U.S."
vector_us = model["United_States"]
vector_us_abbr = model["U.S."]

# Compute the cosine similarity
cos_sim = np.dot(vector_us, vector_us_abbr) / (np.linalg.norm(vector_us) * np.linalg.norm(vector_us_abbr))

print("Cosine similarity: {:.4f}".format(cos_sim))

Cosine similarity: 0.7311


## Exercise 3
Find the top-10 words that have the highest cosine similarity with the word “United States” and print out the similarity score.

In [None]:
#TODO: Write your code here
# Find the top-10 words that have the highest cosine similarity with "United States"
top_10 = model.most_similar("United_States", topn=10)

# Print out the similarity score
for word, similarity in top_10:
    print("{}: {:.4f}".format(word, similarity))

Unites_States: 0.7877
Untied_States: 0.7541
United_Sates: 0.7401
U.S.: 0.7311
theUnited_States: 0.6404
America: 0.6178
UnitedStates: 0.6167
Europe: 0.6133
countries: 0.6045
Canada: 0.6019


## Exercise 4

Subtract the vector of “Madrid” from the vector of “Spain” and then add the vector of “Athens”. Compute the top-10 most similar words with the output vector.

In [None]:
#TODO: Write your code here
import numpy as np
from gensim.models import KeyedVectors

# Get the word vectors for "Madrid", "Spain", and "Athens"
vector_madrid = model["Madrid"]
vector_spain = model["Spain"]
vector_athens = model["Athens"]

# Subtract the vector of "Madrid" from the vector of "Spain" and then add the vector of "Athens"
output_vector = vector_spain - vector_madrid + vector_athens

# Find the top-10 most similar words with the output vector
top_10 = model.most_similar(positive=[output_vector], topn=10)

# Print out the similarity score
for word, similarity in top_10:
    print("{}: {:.4f}".format(word, similarity))

Athens: 0.7528
Greece: 0.6685
Aristeidis_Grigoriadis: 0.5496
Ioannis_Drymonakos: 0.5361
Greeks: 0.5352
Ioannis_Christou: 0.5330
Hrysopiyi_Devetzi: 0.5088
Iraklion: 0.5059
Greek: 0.5041
Athens_Greece: 0.5034


## Exercise 5
Download [word analogy evaluation dataset](http://download.tensorflow.org/data/questions-words.txt). Compute the vector as follows: vec(word in second column) - vec(word in first column) + vec(word in third column). From the output vector, find the most similar word. Append the most similar word and its similarity to each row of the downloaded file.

In [None]:

import numpy as np
from gensim.models import KeyedVectors
import requests

# Download the word analogy evaluation dataset
url = "http://download.tensorflow.org/data/questions-words.txt"
data = requests.get(url).text.split("\n")[:1000]

correct = 0
total = 0
# Process each row in the dataset
for line in data:
    words = line.strip().split()
    if len(words) == 4:
        word1, word2, word3, word4 = words
        if all(word in model for word in [word1, word2, word3]):
            # Compute analogy vector
            vector = model[word2] - model[word1] + model[word3]
            # Find most similar word
            most_similar_word, similarity = model.similar_by_vector(vector, topn=1)[0]
            # Append results to the row
            line += f"\t{most_similar_word}\t{similarity}\n"
            # Check if the most similar word is the same as the word in the 4th column
            if most_similar_word == word4:
                correct += 1
            # Increment the total counter
            total += 1
        else:
            line += "\tNA\tNA\n"  # If any of the words are not in the vocabulary
    else:
        line += "\tNA\tNA\n"  # If the row is not a valid analogy question

    # Write the updated row to the output file
    with open('output_file.txt', 'a') as output_file:
        output_file.write(line)









## Exercise 6 (Bonus points)

From the output of the exercise 5, compute the accuracy score. It means that you will calculate the percentage of cases in which the most similar words returned by your code are the same as the words in 4th column.



In [None]:
#I added a part of exercise 6 in exercise 5.
# Compute the accuracy score
accuracy = correct / total * 100
print(f"Accuracy score: {accuracy}%")

Accuracy score: 30.160320641282567%
