This Assignment will evaluate your skills in web scraping and text representation using
Word2vec. Include in your submission the code used to generate answers as a Jupyter Notebook
or Python program file as well as any files generated. Make sure it is clear what code answers
each question.
This Assignment is meant to be completed individually. You may discuss the questions at a high
level with other students but the final work submitted must be your own. Please reference any
external resources you use to complete this Assignment using ACM referencing format.

## Exercise 1

For this exercise, you will be scraping information from Books to Scrape. Write a function to
extract the book titles, price, stock availability, and rating of all the books on the first 20 pages
of the site. Write the results into a table that collects the information as separate columns.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://books.toscrape.com/catalogue/category/books_1/index.html"

# create an empty list to store the book information
book_info = []

# loop through the first 20 pages of the website
for i in range(1, 21):
    page_url = f"http://books.toscrape.com/catalogue/category/books_1/page-{i}.html"
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, "html.parser")
    books = soup.find_all("article", class_="product_pod")

    # loop through each book on the page and extract the relevant information
    for book in books:
        title = book.h3.a["title"]
        price = book.select(".price_color")[0].get_text()[1:]
        availability = book.select(".availability")[0].get_text().strip()
        rating = book.select("p.star-rating")[0].get("class")[1]
        book_info.append((title, price, availability, rating))

# create a table from the extracted information using Pandas
df = pd.DataFrame(book_info, columns=["Title", "Price", "Availability", "Rating"])
display(df)


Unnamed: 0,Title,Price,Availability,Rating
0,A Light in the Attic,£51.77,In stock,Three
1,Tipping the Velvet,£53.74,In stock,One
2,Soumission,£50.10,In stock,One
3,Sharp Objects,£47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,£54.23,In stock,Five
...,...,...,...,...
395,Take Me Home Tonight (Rock Star Romance #3),£53.98,In stock,Three
396,Sleeping Giants (Themis Files #1),£48.74,In stock,One
397,"Setting the World on Fire: The Brief, Astonish...",£21.15,In stock,Two
398,Playing with Fire,£13.71,In stock,Three


## Exercise 2

In [1]:
!mkdir ex2

1. Read the reviews in IMDB_Dataset.csv. Convert the table of reviews into a bag-of-words
(BOW) matrix using scikit-learn or some other method.

In [2]:
import csv

file = open("/content/ex2/IMDB_Dataset.csv", "r")
data = list(csv.reader(file, delimiter=","))


In [3]:
[row[0] for row in data]

['review',
 "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is d

In [4]:
# let's do a little preprocessing by lowercasing and getting rid of punctuation
x = [row[0] for row in data]
processed_docs = [doc.lower().replace(".","") for doc in x]

#look at the documents list
print("Our corpus: ", processed_docs)

Our corpus:  ['review', "one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked they are right, as this is exactly what happened with me<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go trust me, this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs, sex or violence its is hardcore, in the classic use of the word<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda em city is home to manyaryans, muslims, gangstas, latinos, christians, italians, irish and moreso scuffles, death stares, dodgy dealings and shady agreements are never far away<br /><br />i would say the main appeal of the show is due 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the count vectorizer object
count_vect = CountVectorizer()

# Build a BOW representation
bow_rep = count_vect.fit_transform(processed_docs)

# Print out the vocabulary
print("Vocabulary", count_vect.vocabulary_, "\n")

# print("Vocabulary index for cat", count_vect.vocabulary_.get("cat"), "\n")

#see the BOW rep for documents
for ind in range(0,len(processed_docs)):
    print("BoW representation for Document {}: ".format(ind), bow_rep[ind].toarray())

Vocabulary {'review': 476, 'one': 397, 'of': 390, 'the': 581, 'other': 405, 'reviewers': 477, 'has': 253, 'mentioned': 357, 'that': 580, 'after': 14, 'watching': 637, 'just': 297, 'oz': 410, 'episode': 171, 'you': 674, 'll': 326, 'be': 49, 'hooked': 265, 'they': 589, 'are': 32, 'right': 479, 'as': 36, 'this': 593, 'is': 289, 'exactly': 175, 'what': 646, 'happened': 247, 'with': 656, 'me': 352, 'br': 64, 'first': 199, 'thing': 590, 'struck': 553, 'about': 6, 'was': 632, 'its': 292, 'brutality': 70, 'and': 26, 'unflinching': 616, 'scenes': 493, 'violence': 626, 'which': 649, 'set': 511, 'in': 279, 'from': 210, 'word': 662, 'go': 225, 'trust': 609, 'not': 387, 'show': 519, 'for': 204, 'faint': 185, 'hearted': 257, 'or': 400, 'timid': 599, 'pulls': 453, 'no': 385, 'punches': 454, 'regards': 468, 'to': 600, 'drugs': 158, 'sex': 513, 'hardcore': 251, 'classic': 93, 'use': 619, 'it': 290, 'called': 74, 'nickname': 384, 'given': 221, 'oswald': 404, 'maximum': 350, 'security': 501, 'state': 547

In [7]:
import pandas as pd
df_bow = pd.DataFrame(bow_rep.toarray(),columns=count_vect.get_feature_names_out())

display(df_bow)

Unnamed: 0,10,15,1990,25,70,950,about,accustomed,acting,action,...,wouldn,wrenching,writing,written,years,york,you,young,your,zombie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,1,0,0,...,1,0,0,0,0,0,3,0,1,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,2
5,0,0,0,0,0,0,2,0,1,1,...,0,0,0,0,0,1,0,0,0,0
6,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
7,1,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,2,0,0,0
8,0,0,1,0,1,0,0,0,0,0,...,1,0,1,0,1,0,0,0,0,0
9,0,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


2. Do the same thing for each bigram of the text.

In [8]:
# bigram vectorization example with count vectorizer bigrams
count_vect = CountVectorizer(ngram_range=(2, 2))
# Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

# Vocabulary mapping
print("Vocabulary: ", count_vect.vocabulary_)

Vocabulary:  {'one of': 772, 'of the': 744, 'the other': 1090, 'other reviewers': 797, 'reviewers has': 891, 'has mentioned': 424, 'mentioned that': 656, 'that after': 1031, 'after watching': 23, 'watching just': 1270, 'just oz': 567, 'oz episode': 809, 'episode you': 298, 'you ll': 1369, 'll be': 607, 'be hooked': 126, 'hooked they': 454, 'they are': 1142, 'are right': 90, 'right as': 893, 'as this': 105, 'this is': 1153, 'is exactly': 506, 'exactly what': 306, 'what happened': 1295, 'happened with': 415, 'with me': 1325, 'me br': 646, 'br br': 156, 'br the': 161, 'the first': 1070, 'first thing': 345, 'thing that': 1147, 'that struck': 1041, 'struck me': 996, 'me about': 644, 'about oz': 9, 'oz was': 810, 'was its': 1254, 'its brutality': 552, 'brutality and': 170, 'and unflinching': 75, 'unflinching scenes': 1224, 'scenes of': 913, 'of violence': 751, 'violence which': 1247, 'which set': 1308, 'set in': 939, 'in right': 481, 'right from': 894, 'from the': 368, 'the word': 1118, 'wor

In [9]:
import pandas as pd
df_bg = pd.DataFrame(bow_rep.toarray(),columns=count_vect.get_feature_names_out())

display(df_bg)

Unnamed: 0,10 just,10 lines,15 or,1990 the,25 years,70 when,950 films,about human,about one,about oz,...,you ll,you may,you must,you re,you will,young or,young woman,your darker,zombie br,zombie in
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,1,1,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,1,1
5,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


3. Calculate the TF-IDF of these documents using scikit-learn.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(processed_docs) #note, you can use n-grams for TF-IDF as well

#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names_out())
print("\n")
#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("\n")


#TFIDF representation for all documents in our corpus 
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("\n")

df_tfidf = pd.DataFrame(bow_rep_tfidf.toarray(),columns=tfidf.get_feature_names_out())

display(df_tfidf)

All words in the vocabulary ['10' '15' '1990' '25' '70' '950' 'about' 'accustomed' 'acting' 'action'
 'actors' 'addiction' 'adrian' 'adventureoh' 'after' 'agenda' 'agreements'
 'air' 'aired' 'alive' 'all' 'allen' 'almost' 'also' 'amazing' 'an' 'and'
 'another' 'anxiously' 'any' 'anymore' 'appeal' 'are' 'arguing' 'around'
 'arthur' 'as' 'at' 'audiences' 'average' 'await' 'awakening' 'away'
 'awful' 'back' 'bad' 'band' 'basically' 'bbc' 'be' 'become' 'been'
 'being' 'believable' 'believe' 'best' 'bette' 'big' 'bit' 'bitches'
 'black' 'boogeyman' 'boring' 'boy' 'br' 'bread' 'brilliance' 'brilliant'
 'bring' 'brings' 'brutality' 'buscemi' 'but' 'by' 'called' 'camp' 'can'
 'career' 'carol' 'case' 'cast' 'cause' 'cells' 'change' 'characters'
 'charm' 'cheap' 'children' 'chose' 'chosen' 'christians' 'city' 'class'
 'classic' 'closet' 'come' 'comeback' 'comedies' 'comedy' 'comes'
 'comfortable' 'comforting' 'comments' 'complete' 'concerning'
 'conditioned' 'connect' 'connected' 'contact' 'cont

Unnamed: 0,10,15,1990,25,70,950,about,accustomed,acting,action,...,wouldn,wrenching,writing,written,years,york,you,young,your,zombie
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.029947,0.049378,0.0,0.0,...,0.042207,0.0,0.0,0.0,0.0,0.0,0.08984,0.0,0.049378,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.04456,0.0,0.0,0.0,...,0.0,0.0,0.0,0.073474,0.0,0.0,0.04456,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.062851,0.0,0.0,0.071467,0.0,0.0
4,0.072624,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.103057,0.0,0.0,0.169927
5,0.0,0.0,0.0,0.0,0.0,0.0,0.066109,0.0,0.046587,0.054502,...,0.0,0.0,0.0,0.0,0.0,0.054502,0.0,0.0,0.0,0.0
6,0.0,0.10307,0.0,0.10307,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.07748,0.0,0.0,0.0,0.0,0.0
7,0.070431,0.0,0.0,0.0,0.0,0.0,0.099946,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.099946,0.0,0.0,0.0
8,0.0,0.0,0.069707,0.0,0.069707,0.0,0.0,0.0,0.0,0.0,...,0.059583,0.0,0.069707,0.0,0.0524,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.09654,0.05855,0.0,0.082519,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


4. Describe 2 reasons why one hot encoding is rarely used for text representation. Are
there other vector space model representations that also share the issues you described?
Name them.

According to the lectures where we discussed the drawbacks of one hot encoding, there are several drawbacks for using one hot encoding. The first one is that the **relationships between tokens lost** because  One hot encoding does not capture any information about the relationships between words. It represents each word as a unique entity, ignoring the similarity or dissimilarity between words. For instance, in the exercise that we had, one-hot encoding, the words "kill" and "killer" are represented as completely different entities even though they have semantic similarities. This can result in a loss of semantic information, making it harder for machine learning models to capture the meaning of the text and make accurate predictions. The second reson can be **sparcity** because one hot encoding results in a high-dimensional sparse matrix, which means that most of the entries in the matrix are zero. This can result in a large amount of memory usage and slow down the computation. 

For the second part, another vector space model representation that has the same issue is **bag of words** where the size of the vector still increases with the size of the vocabulary so it has the sparsity issue. 
There is another limtation for using one hot encoding and bag of words which is OOV (Out of Vocabulary) which refers to the words that are not present in the vocabulary used for encoding the text. One hot encoding requires a fixed vocabulary of unique words to be defined beforehand, and any word that is not present in this vocabulary will be treated as an out of vocabulary word or unknown word. This problem can also be seen in Bag of words.

https://www.quora.com/What-are-the-advantages-and-disadvantages-of-TF-IDF

 **TF-IDF** can help identify important words in a document, but it also does not capture the semantic meaning of the words. For example, the words "car" and "automobile" may have similar importance in a document, but TF-IDF does not capture the fact that they have the same or similar meanings. So overal, One hot encoding, bag of words, and TF-IDF all have the problem with semantic similarities. 

## Exercise 3

1. Assume you have to make your own training set to pass as input to Word2Vec instead of
just a list of lists of text. Write a windowing function that can take text as input and make 1
training set in the CBOW format as seen in slide 5 of the Word2Vec slides. Make the window
size a parameter of your function that can be changed. Test your code with file nlptext.txt.
Note that you do not have to train a Word2Vec model with this dataset.*italicized text*

In [None]:
# import numpy as np

# def create_cbow_training_set(file_path, window_size):
#     # Load the text file
#     with open(file_path, 'r') as f:
#         text = f.read()
    
#     # Split the text into a list of words
#     words = text.split()
    
#     # Create a list of unique words and a dictionary to map each word to an index
#     vocab = list(set(words))
#     word_to_idx = {word: i for i, word in enumerate(vocab)}
    
#     # Initialize the input and output arrays for the training set
#     input_data = []
#     output_data = []
    
#     # Loop over all words in the text
#     for i, word in enumerate(words):
#         # Skip the first `window_size` and last `window_size` words
#         if i < window_size or i >= len(words) - window_size:
#             continue
        
#         # Get the context words for the current word
#         context_words = words[i - window_size:i] + words[i + 1:i + window_size + 1]
        
#         # Convert the context words to indices
#         context_idxs = [word_to_idx[w] for w in context_words]
        
#         # Add the context words to the input data
#         input_data.append(context_idxs)
        
#         # Add the current word to the output data
#         output_data.append(word_to_idx[word])
    
#     input_data = np.array(input_data)
#     output_data = np.array(output_data)
    
#     return input_data, output_data


In [13]:
def create_training_set(text, window_size=5):
    # Split the text into individual words
    words = text.split()

    # Create a list to hold the training data
    training_set = []

    # Loop over each word in the text
    for i, word in enumerate(words):
        # Define the context window
        start_index = max(0, i - window_size)
        end_index = min(len(words) - 1, i + window_size)
        context = words[start_index:i] + words[i+1:end_index+1]

        # Add the current word and its context to the training set
        training_set.append((word, context))

    return training_set


The text parameter is the input text to create a training set for, and the window_size parameter is the size of the context window to use for each word.
For each word, the function defines a context window consisting of the window_size words to the left and right of the current word (unless the start or end of the text is reached). The function then adds a tuple consisting of the current word and its context to the training set.

In [11]:
!mkdir ex3

In [14]:
# Read in the contents of the file
with open('/content/ex3/nlptext.txt', 'r') as f:
    text = f.read()

# Generate the training set
training_set = create_training_set(text, window_size=2)

# Print the first five examples in the training set
print(training_set[:5])



[('Most', ['natural', 'language']), ('natural', ['Most', 'language', 'processing']), ('language', ['Most', 'natural', 'processing', 'systems']), ('processing', ['natural', 'language', 'systems', 'were']), ('systems', ['language', 'processing', 'were', 'based'])]


This output shows that the training set is generated as expected, with each example consisting of a target word and its context words. Note that this function does not actually train a Word2Vec model; it just generates the training set in the format expected by Word2Vec.

2. Suppose you are training a Word2Vec model with the text in part A. Given that you set
your hyperparameters for Word2Vec as size=50, window=3, and min_count=5, what will the
dimensions of the embedding matrix be and why (i.e. what do the dimensions of the embedding
matrix represent)? What will the dimensions be for the context matrix? Assume you do not do
any preprocessing of the text other than simple word tokenization with NLTK.

If we train a Word2Vec model on the given text with hyperparameters size=50, window=3, and min_count=5, the dimensions of the embedding matrix will be V x 50, where V is the size of the vocabulary in the corpus. 
The embedding matrix contains the learned representations for each word in the vocabulary. Each row of the matrix corresponds to the embedding vector for a particular word in the vocabulary. Since we set the embedding dimension to be 50, each embedding vector will have 50 dimensions. The number of rows in the embedding matrix is equal to the size of the vocabulary, which is the number of unique words in the corpus that have a frequency greater than or equal to min_count.

The context matrix, on the other hand, is not directly output by the Word2Vec model. Instead, the context matrix is used to learn the embedding matrix during the training process. The context matrix is an intermediate representation that stores the one-hot encoded context vectors for each training example. The context matrix has dimensions N x V, where N is the total number of training examples and V is the size of the vocabulary.

During training, the model uses the context matrix to update the weights in the embedding matrix to maximize the likelihood of predicting the context words given a target word. The output of the Word2Vec model is the learned embedding matrix, not the context matrix.

It's worth noting that the actual size of the embedding matrix and vocabulary will depend on the specific text being used and the frequency of the words in the corpus. However, given the hyperparameters specified and assuming no preprocessing other than simple tokenization, the embedding matrix will have dimensions V x 50.

3. What is the purpose of negative sampling when training Word2Vec?

https://analyticsindiamag.com/how-to-use-negative-sampling-with-word2vec-model/#:~:text=When%20the%20size%20of%20training,to%20get%20modified%20during%20training.

The purpose of negative sampling in Word2Vec is to improve the efficiency and training time of the model, while still maintaining the quality of the learned word embeddings.

In the original Word2Vec algorithm, the training objective is to maximize the probability of predicting the surrounding context words given a target word (CBOW model) or the target word given its surrounding context words (skip-gram model). To achieve this, the model needs to compute the probability of selecting each word in the vocabulary as a context word (or target word) for a given target word (or context word), which involves computing the softmax function over the entire vocabulary.

However, computing the softmax function can be computationally expensive and slow, especially when dealing with large vocabularies. To address this issue, negative sampling was introduced in the Word2Vec algorithm.

Negative sampling involves randomly selecting a small number of "negative" examples (i.e., words that are not in the current context of the target word) and optimizing the model to predict these negative examples with low probability. Specifically, instead of computing the softmax over the entire vocabulary, the model only needs to compute the probability of selecting a small number of negative examples and the true context word. By doing this, the model can be trained much faster and more efficiently.

In practice, negative sampling involves randomly selecting a small number (e.g., 5-20) of negative examples for each positive example during training. The number of negative examples is usually much smaller than the size of the entire vocabulary, which can be on the order of millions of words. The actual number of negative examples to use is a hyperparameter that can be tuned to balance training time and model performance.

4. Why does CBOW train faster than Skip-gram? Hint: Think about how each model is
updated.

https://stats.stackexchange.com/questions/321304/why-does-the-skipgram-model-takes-more-time-to-train-compared-to-cbow

CBOW generally trains faster than Skip-gram because CBOW updates the word embeddings for all the words in a given context in one go, while Skip-gram updates the word embeddings for each context word individually.

In the CBOW model, the input to the neural network is a bag of context words, and the output is the predicted target word. The weights connecting the input layer to the hidden layer represent the embeddings for each of the context words, and these weights are updated based on the error in predicting the target word. Since all the context words share the same weights, updating the weights for each context word is equivalent to updating the weights for all the context words at once. This can make the training process more efficient.

In contrast, the Skip-gram model predicts the context words given a target word. To do this, the model needs to update the weights for each context word independently, since each context word has its own embedding in the model. This means that for each training example, the model needs to update the weights for each context word, which can be slower and more computationally expensive than updating the weights for all the context words at once in CBOW.

However, it's worth noting that the efficiency of training depends on the specific implementation and the size of the dataset being used. In some cases, Skip-gram may be faster than CBOW, especially when the training data is very large or the context window size is very small.

## Exercise 4

SpaCy uses a modified version of Word2Vec to get token representations called FastText.
FastText is essentially the same as Word2Vec, except that instead of operating on tokens of
entire words, it operates on sets of characters.


1. Import fasttext from Gensim and get the representation of at least two out of vocab words
when you train a model with corpus = [["horse", "pulled", "cart"], ["dog", "say", "woof"]] and
the most similar words to it. Will any word work with this fasttext model? Why or why not?

In [None]:
from gensim.models.fasttext import FastText

# Define the corpus
corpus = [["horse", "pulled", "cart"], ["dog", "say", "woof"]]

# Train the FastText model
model = FastText(corpus, size=100, window=5, min_count=1, workers=4, sg=1)

# Get the representations and most similar words for two out-of-vocabulary words
word1 = "push"
word2 = "cat"
vector1 = model.wv[word1]
vector2 = model.wv[word2]
similars1 = model.wv.most_similar(word1)
similars2 = model.wv.most_similar(word2)

# Print the results
print("Representation of", word1, ":", vector1)
print("Most similar words to", word1, ":", similars1)
print("Representation of", word2, ":", vector2)
print("Most similar words to", word2, ":", similars2)




Representation of push : [ 1.5395569e-03  7.5077750e-03  2.1713087e-03 -4.9668085e-03
 -4.0774015e-03  6.5845117e-04  9.2415642e-03 -6.3100881e-03
  1.9796756e-04 -3.1242380e-03  5.3945067e-03  6.0573299e-03
 -1.5088018e-03 -5.9175501e-03 -8.6582107e-03 -6.0270345e-03
 -4.5519830e-03  1.9757783e-03  7.4616699e-03 -7.4353814e-03
  9.1637745e-03  3.6634170e-03  4.8392848e-03  9.6576158e-03
 -1.6779846e-03  6.3166930e-04  3.5842590e-03  2.5750286e-04
 -4.0177531e-03 -7.8923069e-03 -4.3013571e-03  5.3770570e-03
  2.8156540e-03  6.0266140e-03  3.4241064e-04 -5.3579994e-03
  2.5592386e-03 -3.9198902e-03 -9.4094248e-03  8.0666235e-03
 -1.5926086e-03 -1.4773844e-03  4.8250346e-03  8.9691682e-03
 -8.5496893e-03 -6.5459781e-03 -3.5528501e-03 -5.0315149e-03
 -9.0100373e-05  6.4891921e-03  7.0794481e-03  1.9165989e-03
 -5.1965252e-03 -6.2701465e-03  4.8678103e-03 -5.2557187e-03
  7.8981783e-04  4.9856156e-03 -5.4350179e-03 -6.5099765e-03
 -9.2052864e-03 -7.0486232e-03  7.0287781e-03 -7.9072816e-03

 I first define the corpus as a list of two lists of words. I then train the FastText model using the FastText() function from Gensim, with a vector size of 100, a window size of 5, and a minimum word count of 1. I also set the sg parameter to 1 to use skip-gram instead of CBOW.

To answer the second part:

In [None]:
from gensim.models.fasttext import FastText

# Define the corpus
corpus = [["horse", "pulled", "cart"], ["dog", "say", "woof"]]

# Train the FastText model
model = FastText(corpus, size=100, window=5, min_count=1, workers=4, sg=1)

# Get the representations and most similar words for two out-of-vocabulary words
word1 = "giraffe"
word2 = "cat"
vector1 = model.wv[word1]
vector2 = model.wv[word2]
similars1 = model.wv.most_similar(word1)
similars2 = model.wv.most_similar(word2)

# Print the results
print("Representation of", word1, ":", vector1)
print("Most similar words to", word1, ":", similars1)
print("Representation of", word2, ":", vector2)
print("Most similar words to", word2, ":", similars2)




KeyError: ignored

As can be seen, for the word giraffe i faced an error. The word that Im trying to query is not present in the FastText model vocabulary. This can happen if the word is rare or if the model was trained on a very limited vocabulary. It means that the model is not able to learn a good representation for this word based on the character n-grams in the training corpus.

2. Use pretrained Word2Vec and FastText models word2vec-google-news-300 and fasttextwiki-
news-subwords-300, respectively. Load these models and come up with at least four
examples to compare syntactic (2 examples) and semantic (2 examples, hint: find out how to
make analogies with Gensim) representations between the two models. Compare and contrast
the results.

Syntactic Comparison: Verb Tense and plural Tense

In [None]:
import gensim.downloader as api

# Load the models
w2v_model = api.load('word2vec-google-news-300')




In [None]:
import gensim.downloader as api
ft_model = api.load('fasttext-wiki-news-subwords-300')

In [None]:
# Compare the past and present tense forms of "run" in each model
w2v_past = w2v_model['ran']
w2v_present = w2v_model['run']
ft_past = ft_model['ran']
ft_present = ft_model['run']

# Calculate the cosine similarity between the past and present tense forms of "run"
w2v_sim = w2v_model.similarity('ran', 'run')
ft_sim = ft_model.similarity('ran', 'run')

# Print the results
print("Syntactic Comparison: Verb Tense")
print("Cosine similarity between 'ran' and 'run' in Word2Vec:", w2v_sim)
print("Cosine similarity between 'ran' and 'run' in FastText:", ft_sim)


Syntactic Comparison: Verb Tense
Cosine similarity between 'ran' and 'run' in Word2Vec: 0.47649786
Cosine similarity between 'ran' and 'run' in FastText: 0.7697737


So as can be seen the similarity between "run" and "ran" has a higer score inFastText.

In [None]:


# Compare the singular and plural forms of "cat" in each model
w2v_sing = w2v_model['cat']
w2v_plur = w2v_model['cats']
ft_sing = ft_model['cat']
ft_plur = ft_model['cats']

# Calculate the cosine similarity between the singular and plural forms of "cat"
w2v_sim = w2v_model.similarity('cat', 'cats')
ft_sim = ft_model.similarity('cat', 'cats')

# Print the results
print("Syntactic Comparison: Plurals")
print("Cosine similarity between 'cat' and 'cats' in Word2Vec:", w2v_sim)
print("Cosine similarity between 'cat' and 'cats' in FastText:", ft_sim)


Syntactic Comparison: Plurals
Cosine similarity between 'cat' and 'cats' in Word2Vec: 0.8099379
Cosine similarity between 'cat' and 'cats' in FastText: 0.8368597


As can be seen "cat" and "cats" are similar but the similarity has a higher score in Fasttext again.

Semantic Comparison: Country-Capital Relationship/ Synonyms and Antonyms

In [None]:
# Calculate the most similar words to the vector difference between "France" and "Paris" in each model
w2v_similar = w2v_model.most_similar(positive=['France', 'capital'], negative=['Paris'])
# ft_similar = ft_model.most_similar(positive=['France', 'capital'], negative=['Paris'])

# Print the results
print("Semantic Comparison: Country-Capital Relationship")
print("Most similar words to 'France - Paris + capital' in Word2Vec:", w2v_similar[:5])
# print("Most similar words to 'France - Paris + capital' in FastText:", ft_similar[:5])


Semantic Comparison: Country-Capital Relationship
Most similar words to 'France - Paris + capital' in Word2Vec: [('captial', 0.42486903071403503), ('undistributed_profits', 0.4229094982147217), ('invest_ment', 0.4195176959037781), ('worth_##mln_rub', 0.40540462732315063), ('Vietnam_reunifying', 0.3988904058933258)]


In [None]:
# Calculate the most similar words to the vector difference between "France" and "Paris" in each model
# w2v_similar = w2v_model.most_similar(positive=['France', 'capital'], negative=['Paris'])
ft_similar = ft_model.most_similar(positive=['France', 'capital'], negative=['Paris'])

# Print the results
print("Semantic Comparison: Country-Capital Relationship")
# print("Most similar words to 'France - Paris + capital' in Word2Vec:", w2v_similar[:5])
print("Most similar words to 'France - Paris + capital' in FastText:", ft_similar[:5])

Semantic Comparison: Country-Capital Relationship
Most similar words to 'France - Paris + capital' in FastText: [('non-capital', 0.5866502523422241), ('investment', 0.5786392092704773), ('capital-', 0.5744650363922119), ('capital-labour', 0.5675524473190308), ('capital-rich', 0.5655425786972046)]


In the mentioned examples, I tried to find the word that is most similar to 'Paris' in the same way that 'France' is similar to 'Capital'.

As another example:

In [None]:
w2v_result = w2v_model.most_similar(positive=['Berlin', 'France'], negative=['Germany'], topn=1)
print(w2v_result)

# Syntactic representation using FastText
ft_result = ft_model.most_similar(positive=['Berlin', 'France'], negative=['Germany'], topn=1)
print(ft_result)

To find the word that is most similar to 'Paris' in the same way that 'Berlin' is similar to 'Germany' we can see that both models perform pretty good but word2vec is better here. So overal the choice of model will depend on the specific task and data being used.

In [None]:
# Compare the similarity between the synonyms "happy" and "joyful" and the antonyms "happy" and "sad" in each model
w2v_sim1 = w2v_model.similarity('happy', 'joyful')
w2v_sim2 = w2v_model.similarity('happy', 'sad')
ft_sim1 = ft_model.similarity('happy', 'joyful')
ft_sim2 = ft_model.similarity('happy', 'sad')

# Print the results
print("Semantic Comparison: Synonyms and Antonyms")
print("Cosine similarity between 'happy' and 'joyful' in Word2Vec:", w2v_sim1)
print("Cosine similarity between 'happy' and 'joyful' in FastText:", ft_sim1)
print("Cosine similarity between 'happy' and 'sad' in Word2Vec:", w2v_sim2)
print("Cosine similarity between 'happy' and 'sad' in FastText:", ft_sim1)


Semantic Comparison: Synonyms and Antonyms
Cosine similarity between 'happy' and 'joyful' in Word2Vec: 0.42381963
Cosine similarity between 'happy' and 'joyful' in FastText: 0.71287423
Cosine similarity between 'happy' and 'sad' in Word2Vec: 0.5354614
Cosine similarity between 'happy' and 'sad' in FastText: 0.71287423


The semantic similarity between words like "happy" and "joyful" is more in fasttext. The non-similarity between "sad" and "happy" is more in FastText. It seems that if two words are similar or very non-similar, fastText assign a higher score in comparision to word2vec.

3. What do you think the potential benefits are of using FastText over Word2Vec and vice
versa? 

https://medium.com/swlh/a-quick-overview-of-the-main-difference-between-word2vec-and-fasttext-b9d3f6e274e9

https://cesconi.com/what-is-the-main-difference-between-word2vec-and-fasttext-57bdaf3a69ef


One of the primary benefits of FastText is its ability to capture subword information, which makes it particularly useful in handling out-of-vocabulary words, especially in languages with complex morphologies or character-level differences. By breaking words down into their constituent subwords, FastText can capture relationships between words that share similar prefixes or suffixes, even if they are not exact matches. This is particularly useful for tasks such as text classification, where subword information can help the model generalize better to unseen words. In contrast, Word2Vec is more suited to tasks that require a larger vocabulary of exact word matches, such as word analogy tasks.

Another benefit of FastText is that it can capture polysemy better than Word2Vec, meaning it can distinguish between different senses of a word based on the contexts in which it appears. This is because FastText's subword information can help disambiguate the different meanings of a word, whereas Word2Vec may conflate them if they are represented by the same vector. For example, the word "bank" can refer to a financial institution or the side of a river, and FastText can better capture these distinct meanings.

On the other hand, Word2Vec has been around longer and has been widely adopted and extensively studied in the NLP community. It has been shown to be effective in a variety of applications, including sentiment analysis, text classification, and machine translation. It is also computationally more efficient than FastText, especially for larger vocabularies, due to its use of a hierarchical softmax.