<a href="https://colab.research.google.com/github/Jayden-Nyamiaka/Machine-Learning-and-Data-Mining/blob/main/nyamiaka_jayden_set5_prob3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/emiletimothy/Caltech-CS155-2023/blob/main/set5/set5_prob3.ipynb)


## Set 5
## 3. Word2Vec **Principles**

#### Preparation

In [None]:
# download the helper function
!wget -O P3CHelpers.py https://raw.githubusercontent.com/emiletimothy/Caltech-CS155-2023/main/set5/P3CHelpers.py

--2023-02-23 05:00:32--  https://raw.githubusercontent.com/emiletimothy/Caltech-CS155-2023/main/set5/P3CHelpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4939 (4.8K) [text/plain]
Saving to: ‘P3CHelpers.py’


2023-02-23 05:00:32 (40.5 MB/s) - ‘P3CHelpers.py’ saved [4939/4939]



In [None]:
# download the dataset
!wget -O dr_seuss.txt https://raw.githubusercontent.com/emiletimothy/Caltech-CS155-2023/main/set5/data/dr_seuss.txt

--2023-02-23 05:01:09--  https://raw.githubusercontent.com/emiletimothy/Caltech-CS155-2023/main/set5/data/dr_seuss.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8810 (8.6K) [text/plain]
Saving to: ‘dr_seuss.txt’


2023-02-23 05:01:10 (53.8 MB/s) - ‘dr_seuss.txt’ saved [8810/8810]



In [None]:
import numpy as np
from P3CHelpers import *
import torch
import torch.nn as nn
import torch.optim as optim

#### Problem D: 
Fill in the generate_traindata and find_most_similar_pairs functions.

In [None]:
"""
Returns one-hot-encoded feature representation of the specified word given
a dictionary mapping words to their one-hot-encoded index.

Arguments:
    word_to_index: Dictionary mapping words to their corresponding index
                    in a one-hot-encoded representation of our corpus.

    word:          Word whose feature representation we wish to compute.

Returns:
    feature_representation:     Feature representation of the passed-in word.
"""
def get_word_repr(word_to_index, word):
    unique_words = word_to_index.keys()
    # Return a vector that's zero everywhere besides the index corresponding to <word>
    feature_representation = np.zeros(len(unique_words))
    feature_representation[word_to_index[word]] = 1
    return feature_representation    


"""
Generates training data for Skipgram model.

Arguments:
    word_list:     Sequential list of words (strings).
    word_to_index: Dictionary mapping words to their corresponding index
                    in a one-hot-encoded representation of our corpus.

    window_size:   Size of Skipgram window. Defaults to 2 
                    (use the default value when running your code).

Returns:
    (trainX, trainY):     A pair of matrices (trainX, trainY) containing training 
                          points (one-hot-encoded vectors) and their corresponding output_word
                          (also one-hot-encoded vectors)

"""
def generate_traindata(word_list, word_to_index, window_size=4):
    
  trainX = []
  trainY = []

  N = len(word_list)
  for i in range(N):
    word_i = get_word_repr(word_to_index, word_list[i])
    lower_s, upper_s = max(i - window_size, 0), min(i + 1 + window_size, N)
    for j in range(lower_s, upper_s):
      if (i == j):
        continue
      word_j = get_word_repr(word_to_index, word_list[j])
      trainX.append(word_i)
      trainY.append(word_j)
  
  return np.array(trainX), np.array(trainY)

In [None]:
"""
Find the most similar pairs from the word embeddings computed from
a body of text

Arguments:
    filename:           Text file to read and train embeddings from
    num_latent_factors: The number of latent factors / the size of the embedding
"""
def find_most_similar_pairs(filename, num_latent_factors):
    # Load in a list of words from the specified file; remove non-alphanumeric characters
    # and make all chars lowercase.
    sample_text = load_word_list(filename)

    # Create word dictionary
    word_to_index = generate_onehot_dict(sample_text)
    print("Textfile contains %s unique words"%len(word_to_index))
    # Create training data
    trainX, trainY = generate_traindata(sample_text, word_to_index)
    trainXY = np.array([trainX, trainY])

    ## Creates and trains model in Pytorch

    # vocab_size = number of unique words in our text file. Will be useful 
    # when adding layers to your neural network
    vocab_size = len(word_to_index)

    epochs = 30
    learning_rate = 5e-3
    batch = 64

    model = nn.Sequential(
        nn.Linear(vocab_size, 10),
        nn.Linear(10, vocab_size),
        nn.Softmax()
    )

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.CrossEntropyLoss()
    train_data = torch.from_numpy(trainXY).type(torch.FloatTensor)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch, shuffle=True)

    model.train()

    for epoch in range(epochs):
      for batch_idx, (data, target) in enumerate(train_loader):
          # Erase accumulated gradients
          optimizer.zero_grad()

          # Forward pass
          output = model(data)

          # Calculate loss
          loss = loss_fn(output, target)

          # Backward pass
          loss.backward()
          
          # Weight update
          optimizer.step()

      # Tracks loss each epoch
      print('Train Epoch: %d  Loss: %.4f' % (epoch + 1,  loss.item()))

    model.eval()

    # Prints model parameters
    print()
    for param in model.named_parameters():
      param_name = param[0]
      param_shape = param[1].shape
      print('Parameter "' + param_name + '" has shape', param_shape) 
  
    ## Extracts weights for hidden layer
    
    # set weights variable below
    weights = model.get_parameter('1.weight').detach().numpy()
    
    # Find and print most similar pairs
    print()
    similar_pairs = most_similar_pairs(weights, word_to_index)
    for pair in similar_pairs[:30]:
        print(pair)

    # 

### Problem E-H:
Run your model on drseuss.txt and answer questions from E through H.

In [None]:
find_most_similar_pairs('dr_seuss.txt', 10)

Textfile contains 308 unique words


  input = module(input)


Train Epoch: 1  Loss: 5.7301
Train Epoch: 2  Loss: 5.7301
Train Epoch: 3  Loss: 5.7301
Train Epoch: 4  Loss: 5.7300
Train Epoch: 5  Loss: 5.7300
Train Epoch: 6  Loss: 5.7300
Train Epoch: 7  Loss: 5.7299
Train Epoch: 8  Loss: 5.7299
Train Epoch: 9  Loss: 5.7299
Train Epoch: 10  Loss: 5.7298
Train Epoch: 11  Loss: 5.7298
Train Epoch: 12  Loss: 5.7297
Train Epoch: 13  Loss: 5.7297
Train Epoch: 14  Loss: 5.7296
Train Epoch: 15  Loss: 5.7295
Train Epoch: 16  Loss: 5.7295
Train Epoch: 17  Loss: 5.7294
Train Epoch: 18  Loss: 5.7293
Train Epoch: 19  Loss: 5.7292
Train Epoch: 20  Loss: 5.7291
Train Epoch: 21  Loss: 5.7290
Train Epoch: 22  Loss: 5.7289
Train Epoch: 23  Loss: 5.7287
Train Epoch: 24  Loss: 5.7286
Train Epoch: 25  Loss: 5.7284
Train Epoch: 26  Loss: 5.7283
Train Epoch: 27  Loss: 5.7281
Train Epoch: 28  Loss: 5.7279
Train Epoch: 29  Loss: 5.7277
Train Epoch: 30  Loss: 5.7274

Parameter "0.weight" has shape torch.Size([10, 308])
Parameter "0.bias" has shape torch.Size([10])
Parameter