<a href="https://colab.research.google.com/github/R3beAM/Laboratorio-2-Modelos-de-lenguaje/blob/main/Lab2_ModelosLenguaje.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<br>

#### Activity 2: Exploring Word Embeddings with GloVe and Numpy
<br>

- Objective:
    - To understand the concept of word embeddings and their significance in Natural Language Processing.
    - To learn how to manipulate and visualize high-dimensional data using dimensionality reduction techniques like PCA and t-SNE.
    - To gain hands-on experience in implementing word similarity and analogies using GloVe embeddings and Numpy.
    
<br>

- Instructions:
    - Download GloVe pre-trained vectors from the provided link in Canvas, the official public project:
    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation
    https://nlp.stanford.edu/data/glove.6B.zip

    - Create a dictorionay of the embeddings so that you carry out fast look ups. Save that dictionary e.g. as a serialized file for faster loading in future uses.
    
    - PCA and t-SNE Visualization: After loading the GloVe embeddings, use Numpy and Sklearn to perform PCA and t-SNE to reduce the dimensionality of the embeddings and visualize them in a 2D or 3D space.

    - Word Similarity: Implement a function that takes a word as input and returns the 'n' most similar words based on their embeddings. You should use Numpy to implement this function, using libraries that already implement this function (e.g. Gensim) will result in zero points.

    - Word Analogies: Implement a function to solve analogies between words. For example, "man is to king as woman is to ____". You should use Numpy to implement this function, using libraries that already implement this function (e.g. Gensim) will result in zero points.

    - Submission: This activity is to be submitted in teams of 3 or 4. Only one person should submit the final work, with the full names of all team members included in a markdown cell at the beginning of the notebook.
    
<br>

- Evaluation Criteria:

    - Code Quality (40%): Your code should be well-organized, clearly commented, and easy to follow. Use also markdown cells for clarity.
    
   - Functionality (60%): All functions should work as intended, without errors.
       - Visualization of PCA and t-SNE (10% each for a total of 20%)
       - Similarity function (20%)
       - Analogy function (20%)
|



#### Import libraries

In [7]:
# Import libraries
import torch
import torch.nn.functional as F
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
from numpy.linalg import norm
import pickle
from google.colab import drive
drive.mount('/content/drive')
PATH = '/content/drive/MyDrive/glove.6B.50d.txt'
plt.style.use('ggplot')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Load file

In [21]:
# PATH = '/media/pepe/DataUbuntu/Databases/glove_embeddings/glove.6B.200d.txt'
PATH = '/content/drive/MyDrive/glove.6B.50d.txt'
emb_dim = 50

In [23]:
# Create dictionary with embeddings
def create_emb_dictionary(path="/content/drive/MyDrive/glove.6B.50d.txt"):

    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            if not parts:
                continue
            word, *vector = parts
            values = np.asarray(vector, dtype=np.float32)
            embeddings[word] = torch.from_numpy(values)
    return embeddings

In [25]:
# create dictionary
embeddings_dict = create_emb_dictionary("/content/drive/MyDrive/glove.6B.50d.txt")

In [27]:
# Serialize
with open('embeddings_dict_50D.pkl', 'wb') as f:
    pickle.dump(embeddings_dict, f)

# Deserialize
# with open('embeddings_dict_200D.pkl', 'rb') as f:
#     embeddings_dict = pickle.load(f)

#### See some embeddings

In [5]:
# Show some
def show_n_first_words(path, n_words):
        with open(path, 'r') as f:
            for i, line in enumerate(f):
                print(line.split(), len(line.split()[1:]))
                if i>=n_words: break

In [6]:
show_n_first_words(PATH, 5)

FileNotFoundError: [Errno 2] No such file or directory: '/media/pepe/DataUbuntu/Databases/glove_embeddings/glove.6B.50d.txt'

### Plot some embeddings

In [28]:

#Add PCA and t-SNE visualization

def plot_embeddings(emb_path, words2show, emb_dim, embeddings_dict, func=PCA, n_components=2, random_state=42):
    """Reduce the dimensionality of selected embeddings and plot them.

    Parameters
    ----------
    emb_path : str
        Path to the embeddings file (kept for backwards compatibility).
    words2show : Iterable[str]
        Collection of tokens to visualise.
    emb_dim : int
        Dimensionality of the embeddings (unused but kept for clarity).
    embeddings_dict : Dict[str, torch.Tensor]
        Dictionary mapping tokens to their embedding vectors.
    func : Callable
        A dimensionality reduction class from sklearn (e.g. PCA or TSNE).
    n_components : int, optional
        Number of dimensions to project to. Defaults to 2.
    random_state : int, optional
        Random seed used by stochastic reducers such as t-SNE.
    """
    import inspect

    vectors = []
    labels = []
    for word in words2show:
        vector = embeddings_dict.get(word)
        if vector is None:
            print(f"Word '{word}' not found in the embeddings dictionary.")
            continue
        # ensure numpy array
        vectors.append(vector.cpu().numpy())
        labels.append(word)

    if not vectors:
        raise ValueError("No valid words were provided to plot.")

    matrix = np.vstack(vectors)

    reducer_kwargs = {"n_components": n_components}
    if func is TSNE:
        # perplexity must be less than the number of samples
        max_perplexity = max(1, len(labels) - 1)
        perplexity = min(30, max_perplexity)
        if perplexity < 5 and max_perplexity >= 5:
            perplexity = max_perplexity
        reducer_kwargs.update({
            "random_state": random_state,
            "init": "pca",
            "learning_rate": "auto",
            "perplexity": perplexity
        })
    else:
        params = inspect.signature(func.__init__).parameters
        if "random_state" in params:
            reducer_kwargs["random_state"] = random_state

    reducer = func(**reducer_kwargs)
    reduced = reducer.fit_transform(matrix)

    if n_components not in (2, 3):
        raise ValueError("Only 2D or 3D plots are supported.")

    if n_components == 2:
        fig, ax = plt.subplots(figsize=(12, 8))
        ax.scatter(reduced[:, 0], reduced[:, 1], alpha=0.7)
        for label, x, y in zip(labels, reduced[:, 0], reduced[:, 1]):
            ax.annotate(label, (x, y))
        ax.set_xlabel("Component 1")
        ax.set_ylabel("Component 2")
    else:
        from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 - imported for side effects
        fig = plt.figure(figsize=(12, 8))
        ax = fig.add_subplot(111, projection="3d")
        ax.scatter(reduced[:, 0], reduced[:, 1], reduced[:, 2], alpha=0.7)
        for label, x, y, z in zip(labels, reduced[:, 0], reduced[:, 1], reduced[:, 2]):
            ax.text(x, y, z, label)
        ax.set_xlabel("Component 1")
        ax.set_ylabel("Component 2")
        ax.set_zlabel("Component 3")

    ax.set_title(f"{func.__name__} projection of GloVe embeddings")
    plt.show()

    return {"labels": labels, "embeddings": reduced}

In [None]:
words= ['burger', 'tortilla', 'bread', 'pizza', 'beef', 'steak', 'fries', 'chips',
            'argentina', 'mexico', 'spain', 'usa', 'france', 'italy', 'greece', 'china',
            'water', 'beer', 'tequila', 'wine', 'whisky', 'brandy', 'vodka', 'coffee', 'tea',
            'apple', 'banana', 'orange', 'lemon', 'grapefruit', 'grape', 'strawberry', 'raspberry',
            'school', 'work', 'university', 'highschool']


In [None]:
#
plot_embeddings(PATH, words, emb_dim, embeddings_dict, PCA)

In [None]:
# t-SNE dimensionality reduction for visualization
embeddings = plot_embeddings(PATH, words, emb_dim, embeddings_dict, tSNE)

### Let us compute analogies

In [None]:
# analogy
def analogy(word1, word2, word3, embeddings_dict):
    pass

In [None]:
analogy('man', 'king', 'woman', embeddings_dict)

In [None]:
# most similar
def find_most_similar(word, embeddings_dict, top_n=10):
    pass

In [None]:
most_similar = find_most_similar('mexico', embeddings_dict)

In [None]:
for i, w in enumerate(most_similar, 1):
    print(f'{i} ---> {w[0]}')