### Word embeddings

Word2vec is a family of algorithms used to associate vectors to words. There are two techniques (both using neural networks) to create such vectors:
- CBOW (Continuous Bag of Words)
- Skip-gram

For this lesson we will use the [gensim module](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec). It is recommended to use virtual environments to avoid version conflicts between numpy and gensim.

In [None]:
# Run first in the terminal: python -m venv myenv

!pip install gensim
!pip install kagglehub

In [None]:

import gensim

We can load a pretrained model (trained with the Google News dataset) in the “word2vec C format” with the following command:

**model = gensim.models.KeyedVectors.load_word2vec_format(modelPath, binary=True)**

You need to set the modelPath variable to the Google news file. You can download the model from [kaggle](https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300).

In [None]:
import kagglehub
import os

# download the dataset to the current working directory
os.environ['KAGGLEHUB_CACHE'] = os.getcwd()

# Download latest version
modelPath = kagglehub.dataset_download("leadbest/googlenewsvectorsnegative300")

print("Path to dataset files:", modelPath)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: c:\Users\mihaela.petrevlad\folder\datasets\leadbest\googlenewsvectorsnegative300\versions\2


In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format(modelPath, binary=True)

If the download takes too long, you can download a smaller pre-trained model that comes with gensim, for example glove-twitter-25:

In [None]:
model = gensim.downloader.load('glove-twitter-25')

We can easily find the similarity between two words:

In [None]:
print(model.similarity("cat","dog"))
print(model.similarity("cat","car"))

Or the most similar words of a word:

In [None]:
model.most_similar('cat', topn=3)

In order to obtain the vector of a word, you can use the get_vector(word) method:

In [None]:
model.get_vector("cat")

The vocabulary is stored in the vocab model's property. Therefore you can check if a certain word appears in the vocabulary:

In [None]:
"word" in model.key_to_index

### Exercises (0.1 each)

Use a pretrained Word2vec model (Google news). Choose a short English text (about 400-500 words). For example you can take a wikipedia article or book excerpt. The text must also contain proper nouns. Solve the following tasks:

1. Print the number of words in the model's vocabulary.
2. Print all the words in the text that do not appear in the model's vocabulary.
3. Which are the two most distant words in the text, and which are the closest? Print the distance too.
4. Print the clusters of words that are the most similar in the text (you can use sklearn's Kmeans) based on their vectors in the model.
5. Using NER (Named Entity Recognition) find the named entities in the text. Print the first 5 most similar words to them both in upper and lowercase.

In [None]:
text = """
  The brain is an organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for special senses such as vision, hearing, and olfaction. Being the most specialized organ, it is responsible for receiving information from the sensory nervous system, processing those information (thought, cognition, and intelligence) and the coordination of motor control (muscle activity and endocrine system).

While invertebrate brains arise from paired segmental ganglia (each of which is only responsible for the respective body segment) of the ventral nerve cord, vertebrate brains develop axially from the midline dorsal nerve cord as a vesicular enlargement at the rostral end of the neural tube, with centralized control over all body segments. All vertebrate brains can be embryonically divided into three parts: the forebrain (prosencephalon, subdivided into telencephalon and diencephalon), midbrain (mesencephalon) and hindbrain (rhombencephalon, subdivided into metencephalon and myelencephalon). The spinal cord, which directly interacts with somatic functions below the head, can be considered a caudal extension of the myelencephalon enclosed inside the vertebral column. Together, the brain and spinal cord constitute the central nervous system in all vertebrates.

In humans, the cerebral cortex contains approximately 14-16 billion neurons,[1] and the estimated number of neurons in the cerebellum is 55–70 billion.[2] Each neuron is connected by synapses to several thousand other neurons, typically communicating with one another via cytoplasmic processes known as dendrites and axons. Axons are usually myelinated and carry trains of rapid micro-electric signal pulses called action potentials to target specific recipient cells in other areas of the brain or distant parts of the body. The prefrontal cortex, which controls executive functions, is particularly well developed in humans.
"""

# The following cells each represents an exercise

In [None]:
print(len(model.key_to_index))

In [None]:
from nltk.tokenize import word_tokenize
text = word_tokenize(text)
words_not_found = set()
for word in text:
  # print(type(word))
  if word not in model.key_to_index and word not in ['.', '(', ')', '"', "'", ',', ']', '[', ':']:
    words_not_found.add(word)
print(words_not_found)

In [None]:
import numpy as np
import itertools

filtered_words = [w for w in text if w in model.key_to_index]
if len(filtered_words) < 2:
  print("Not enough in-vocabulary words to compare.")

# 4. Initialize tracking variables
max_distance = -np.inf
min_distance = np.inf
most_distant_pair = (None, None)
closest_pair = (None, None)

# 5. Compute pairwise cosine distances on all unique word combinations
for w1, w2 in itertools.combinations(filtered_words, 2):
  sim = model.similarity(w1, w2)
  dist = 1.0 - sim
  if dist > max_distance and w1 != w2:
    max_distance = dist
    most_distant_pair = (w1, w2)
  if dist < min_distance and w1 != w2:
    min_distance = dist
    closest_pair = (w1, w2)

# 6. Print results
print(f"Most distant pair: {most_distant_pair} (distance = {max_distance:.4f})")
print(f"Closest pair: {closest_pair} (distance = {min_distance:.4f})")

In [None]:
from sklearn.cluster import KMeans

if len(filtered_words) == 0:
    print("None of the tokens appear in the Word2Vec vocabulary.")

vectors = np.stack([model[w] for w in filtered_words])

kmeans = KMeans(n_clusters=3, random_state=13)
labels = kmeans.fit_predict(vectors)

clusters = {}
for word, lbl in zip(filtered_words, labels):
    clusters.setdefault(lbl, []).append(word)

print("\nClusters (label → words):")
for lbl, word_list in clusters.items():
    print(f"  Cluster {lbl}: {word_list}")

In [None]:
#! python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")

text = """
  The brain is an organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for special senses such as vision, hearing, and olfaction. Being the most specialized organ, it is responsible for receiving information from the sensory nervous system, processing those information (thought, cognition, and intelligence) and the coordination of motor control (muscle activity and endocrine system).

While invertebrate brains arise from paired segmental ganglia (each of which is only responsible for the respective body segment) of the ventral nerve cord, vertebrate brains develop axially from the midline dorsal nerve cord as a vesicular enlargement at the rostral end of the neural tube, with centralized control over all body segments. All vertebrate brains can be embryonically divided into three parts: the forebrain (prosencephalon, subdivided into telencephalon and diencephalon), midbrain (mesencephalon) and hindbrain (rhombencephalon, subdivided into metencephalon and myelencephalon). The spinal cord, which directly interacts with somatic functions below the head, can be considered a caudal extension of the myelencephalon enclosed inside the vertebral column. Together, the brain and spinal cord constitute the central nervous system in all vertebrates.

In humans, the cerebral cortex contains approximately 14-16 billion neurons,[1] and the estimated number of neurons in the cerebellum is 55–70 billion.[2] Each neuron is connected by synapses to several thousand other neurons, typically communicating with one another via cytoplasmic processes known as dendrites and axons. Axons are usually myelinated and carry trains of rapid micro-electric signal pulses called action potentials to target specific recipient cells in other areas of the brain or distant parts of the body. The prefrontal cortex, which controls executive functions, is particularly well developed in humans.
"""
doc = nlp(text)
entities = {ent.text for ent in doc.ents}
if not entities:
  print("No named entities found in the text.")

for ent_text in entities:
  token = ent_text.replace(" ", "_")
  print(f"\nEntity: '{ent_text}' -> tokenized as '{token}'")
  if token not in model.key_to_index:
      print(f"{token}' not in Word2Vec vocabulary. Skipping.")
      continue
  similar = model.most_similar(token, topn=5)
  print(f"  Top 5 most similar words to '{ent_text}':")
  for i, (sim_word, sim_score) in enumerate(similar, start=1):
      print(f"{i}.{sim_word.upper()} / {sim_word.lower()} (score: {sim_score:.2f})")