# Word2Vec

## 1.1 Word2Vec Implementation
Task:
- CBOW / Skip-gram using gensim
- embedding dimensions 100,300,500
- subset of wikipedia data -> perform preprocessing
- evaluation using WordSim353 dataset, computer Spearman's correlation coefficient between embeddings and WordSim353 dataset

my dataset taken from: https://www.kaggle.com/datasets/jjinho/wikipedia-20230701?resource=download&select=b.parquet
articles starting with ab

### Preprocess the dataset

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


df = pd.read_parquet(r'C:\Users\matte\Tsinghua\SecondSemester\NLP\Assignments\NLP_assignment_1\b.parquet\b.parquet')

# elimnate stopwords, and lowercase
stopwords = set(stopwords.words('english'))
df = df['text'].apply(lambda x: ' '.join([word.lower() for word in word_tokenize(x) if word.lower() not in stopwords]))

print(df.head())

In [None]:
import gensim
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import pandas as pd

# nltk resources
#nltk.download('punkt')
#nltk.download('punkt_tab')
#nltk.download('stopwords')


# ------------------------
# 2. Training CBOW Model with gensim
# ------------------------
# In gensim's Word2Vec, the CBOW architecture is used by default (sg=0).
# The negative parameter specifies the number of negative samples.
model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,   # Dimensionality of the word vectors
    window=5,          # Context window size
    min_count=1,       # Minimum frequency to consider a word in the vocabulary
    sg=0,              # Use CBOW (sg=1 would switch to Skip-gram)
    negative=5,        # Number of negative samples (for negative sampling)
    epochs=1          # Number of training epochs
)
# Save the model to a file
model.save("cbow_model.model")
# ------------------------
# 3. Inspecting the Learned Embeddings
# ------------------------
# For example, get the vector for a specific word:
word = 'word'
if word in model.wv:
    vector = model.wv[word]
    print(f"Vector for '{word}':")
    print(vector)
else:
    print(f"The word '{word}' is not in the vocabulary.")

# Optionally, find similar words
similar_words = model.wv.most_similar(word, topn=5)
print(f"\nWords similar to '{word}':")
for similar_word, similarity in similar_words:
    print(f"{similar_word}: {similarity:.4f}")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\matte\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\matte\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Vector for 'word':
[-0.49217764 -1.6475337  -2.6037877   1.9936352   0.2410293  -2.884838
 -0.54240084 -0.2847878   3.8169143  -0.35356095 -0.32403955 -2.3672535
  0.49424002 -0.82202256 -1.1617094   1.6686698   0.68227816 -2.3130293
  3.1425831  -1.6599718  -1.1164575   1.8766254  -0.51606256  0.14188436
  1.6705462   0.26702815 -4.636656    0.8631962   1.031978   -2.7034738
  1.1326134   2.6448078  -1.5893145   1.6827046   0.27565143 -1.0085858
 -0.56139183  1.0290754  -0.96094507  0.6398163   2.3282611  -2.0165942
 -0.8735068   0.7642665  -1.5062762  -2.4455447  -3.587279    2.0453951
  2.885207    1.4384016  -2.9046617  -0.13190281  1.784654    2.3659742
  0.86444473  2.5305827   2.6790931   1.4854783   2.0200462   2.273612
  0.5642112  -2.3650575   2.8583832   2.1108289  -0.8617417   0.93796104
  2.931189   -0.691035    0.8769942  -0.586949   -3.638812    2.7810962
  1.6093385  -1.549856   -0.6651917  -0.4263288   1.369236    0.18044865
  1.9094195   1.4904673  -2.4179108  -0.6886

: 