Homework 4: Neural Language Models (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 2
----

### Names
----
Names: Jason Cheung, Robert Levin

Task 2: Training your own word embeddings (15 points)
--------------------------------

For this task, you'll use the `gensim` package to train your own embeddings for both words and characters. These will eventually act as inputs to your neural language model.

In [None]:
# here are several dependencies to install
# !python --version
# !python -m pip install --upgrade pip setuptools wheel
# !pip install nltk
# !pip install gensim
# !pip install torch torchvision torchinfo

In [3]:
# import your libraries here

# Remember to restart your kernel if you change the contents of this file!
import neurallm_utils as nutils

# for word embeddings
# if not installed, run the following command:
# !pip install gensim
from gensim.models import Word2Vec

import torch
import torch.nn as nn

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jasoncheung/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jasoncheung/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# If running on google colab, you'll need to mount your drive to access data files

# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# constants you may find helpful. Edit as you would like.

# The dimensions of word embedding. 
# This variable will be used throughout the program
# DO NOT WRITE "50" WHEN YOU ARE REFERRING TO THE EMBEDDING SIZE
EMBEDDINGS_SIZE = 50

EMBEDDING_SAVE_FILE_WORD = f"spooky_embedding_word_{EMBEDDINGS_SIZE}.model" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = f"spooky_embedding_char_{EMBEDDINGS_SIZE}.model" # The file to save your char embeddings to
TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on


Train embeddings on provided dataset
---

In [None]:
# your code here
# use the provided utility functions to read in the data


data = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]


# read the spooky data in both by character and by word using the read_file_spooky function in the 
# provided utils
data_by_word = nutils.read_file_spooky(TRAIN_FILE, 1)
data_by_char = nutils.read_file_spooky(TRAIN_FILE, 1, by_character=True)

# print out the first two sentences in each format
# make sure we can read the output easily without scrolling to the side too much
print(data_by_word[:2])
print(data_by_char[:2])


[['<s>', 'this', 'process', ',', 'however', ',', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the', 'dimensions', 'of', 'my', 'dungeon', ';', 'as', 'i', 'might', 'make', 'its', 'circuit', ',', 'and', 'return', 'to', 'the', 'point', 'whence', 'i', 'set', 'out', ',', 'without', 'being', 'aware', 'of', 'the', 'fact', ';', 'so', 'perfectly', 'uniform', 'seemed', 'the', 'wall', '.', '</s>'], ['<s>', 'it', 'never', 'once', 'occurred', 'to', 'me', 'that', 'the', 'fumbling', 'might', 'be', 'a', 'mere', 'mistake', '.', '</s>']]
[['<s>', 't', 'h', 'i', 's', '_', 'p', 'r', 'o', 'c', 'e', 's', 's', ',', '_', 'h', 'o', 'w', 'e', 'v', 'e', 'r', ',', '_', 'a', 'f', 'f', 'o', 'r', 'd', 'e', 'd', '_', 'm', 'e', '_', 'n', 'o', '_', 'm', 'e', 'a', 'n', 's', '_', 'o', 'f', '_', 'a', 's', 'c', 'e', 'r', 't', 'a', 'i', 'n', 'i', 'n', 'g', '_', 't', 'h', 'e', '_', 'd', 'i', 'm', 'e', 'n', 's', 'i', 'o', 'n', 's', '_', 'o', 'f', '_', 'm', 'y', '_', 'd', 'u', 'n', 'g', 'e', 'o', 'n', ';', '_', 'a', 

8. What character represents spaces when we tokenize by character? __YOUR ANSWER HERE__
9. Read the word2vec documentation. What do the following parameters signify?
    - embeddings_size: __YOUR ANSWER HERE__
    - window: __YOUR ANSWER HERE__
    - min_count: __YOUR ANSWER HERE__
    - sg: __YOUR ANSWER HERE__

In [11]:
# 10 points
# create your word embeddings
# use the skip gram algorithm and a window size of 5
# min_count should be 1
# takes ~3.3 sec on Felix's computer for character embeddings using skip-gram with window size 5
# takes ~3.3 sec on Felix's computer for word embeddings using skip-gram with window size 5

import numpy as np


def train_word2vec(data: list[list[str]], embeddings_size: int,
                    window: int = 5, min_count: int = 1, sg: int = 1) -> Word2Vec:
    """
    Create new word embeddings based on our data.

    Params:
        data: The corpus
        embeddings_size: The dimensions in each embedding

    Returns:
        A gensim Word2Vec model
        https://radimrehurek.com/gensim/models/word2vec.html
    """

    return Word2Vec(sentences=data, window=5, min_count=1)


# After you are happy with this function, copy + paste it into the bottom of 
# your neurallm_utils.py file
# You'll need it for the next task!
def create_embedder(raw_embeddings: Word2Vec) -> torch.nn.Embedding:
    """
    Create a PyTorch embedding layer based on our data.

    We will *first* train a Word2Vec model on our data.
    Then, we'll use these weights to create a PyTorch embedding layer.
        `nn.Embedding.from_pretrained(weights)`


    PyTorch docs: https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html#torch.nn.Embedding.from_pretrained
    Gensim Word2Vec docs: https://radimrehurek.com/gensim/models/word2vec.html

    Pay particular attention to the *types* of the weights and the types required by PyTorch.

    Params:
        data: The corpus
        embeddings_size: The dimensions in each embedding

    Returns:
        A PyTorch embedding layer
    """

    # Hint:
    # For later tasks, we'll need two mappings: One from token to index, and one from index to tokens.
    # It might be a good idea to store these as properties of your embedder.
    # e.g. `embedder.token_to_index = ...`
    
    word_vectors = raw_embeddings.wv
    vocab = word_vectors.index_to_key
    token_to_index = {word: idx for idx, word in enumerate(vocab)}
    index_to_token = {idx: word for word, idx in token_to_index.items()}
    weights = torch.FloatTensor(np.array([word_vectors[word] for word in vocab]))

    embedder = nn.Embedding.from_pretrained(weights)
    embedder.token_to_index = token_to_index
    embedder.index_to_token = index_to_token

    return embedder

In [12]:

# Create and save both sets (word and character based) of Word2Vec embeddings. 
# Use the provided utility functions in nutils.
# These will be (re)loaded in the next notebook.

embeddings_by_word = train_word2vec(data_by_word, EMBEDDINGS_SIZE)
embeddings_by_char = train_word2vec(data_by_char, EMBEDDINGS_SIZE)
nutils.save_word2vec(embeddings_by_word, EMBEDDING_SAVE_FILE_WORD)
nutils.save_word2vec(embeddings_by_char, EMBEDDING_SAVE_FILE_CHAR)


In [13]:
# load them in again to make sure that this works and is still fast

embeddings_by_word = nutils.load_word2vec(EMBEDDING_SAVE_FILE_WORD)
embeddings_by_char = nutils.load_word2vec(EMBEDDING_SAVE_FILE_CHAR)
print(embeddings_by_word)
print(embeddings_by_char)

Word2Vec<vocab=25374, vector_size=100, alpha=0.025>
Word2Vec<vocab=60, vector_size=100, alpha=0.025>


In [14]:
# now create the embedders

embedder_by_word = create_embedder(embeddings_by_word)
embedder_by_char = create_embedder(embeddings_by_char)


In [19]:
# take a look at your saved token to index and index to token mappings in your embedders to make sure they make sense
# AND that they are both dictionaries mapping from int to str or vice versa!
# don't leave a ton of output in your notebook when you turn it in, but you need to understand this,
# and it's an easy place to make a mistake that's hard to debug later.
# do leave whatever code you use here, comment it out if it produces a lot of output

def pretty_print(d: dict, n: int = None):
    if n is None:
        n = len(d)

    print("----------------------")
    for k, v in list(d.items())[:n]:
        print(f"{k}: {v}")
    print("----------------------")

pretty_print(embedder_by_word.token_to_index, n=10)
pretty_print(embedder_by_word.index_to_token, n=10)
pretty_print(embedder_by_char.token_to_index, n=10)
pretty_print(embedder_by_char.index_to_token, n=10)

----------------------
,: 0
the: 1
of: 2
<s>: 3
</s>: 4
.: 5
and: 6
to: 7
i: 8
a: 9
----------------------
----------------------
0: ,
1: the
2: of
3: <s>
4: </s>
5: .
6: and
7: to
8: i
9: a
----------------------
----------------------
_: 0
e: 1
t: 2
a: 3
o: 4
n: 5
i: 6
s: 7
h: 8
r: 9
----------------------
----------------------
0: _
1: e
2: t
3: a
4: o
5: n
6: i
7: s
8: h
9: r
----------------------


In [20]:
# 4 points
# print out the vocabulary size for your embeddings for both your word
# embeddings and your character embeddings
# label which is which when you print them out

print("Word Embeddings Vocabulary Size:", len(embedder_by_word.token_to_index))
print("Character Embeddings Vocabulary Size:", len(embedder_by_char.token_to_index))


Word Embeddings Vocabulary Size: 25374
Character Embeddings Vocabulary Size: 60
