# JM Chapter 7

1. Write out the computations of the remaining three input sets in the second paragraph of 7.2.1 in the reading. Make this a comment (or a markdown cell, etc. in a notebook) at the top of your script.

[0, 1]- <br>
$ x_1 = 0; x_2 = 1 $ <br>
$ h_1= x_1 + x_2 = 0 + 1 = 1 $ <br>
$ h_2 = x_1 + x_2 - 1 = 0 + 1 - 1 = 0 $ <br>
$ y_1 = h_1 - 2*h_2 = 1 - 2 * 0 = 1 $ <br>

[1, 0]- <br>
$ x_1 = 1; x_2 = 0 $ <br>
$ h_1= x_1 + x_2 = 1 + 0 = 1 $ <br>
$ h_2 = x_1 + x_2 - 1 = 1 + 0 - 1 = 0 $ <br>
$ y_1 = h_1 - 2*h_2 = 1 - 2 * 0 = 1 $ <br> 

[1, 1]- <br>
$ x_1 = 1; x_2 = 1 $ <br>
$ h_1= x_1 + x_2 = 1 + 1 = 2 $ <br>
$ h_2 = x_1 + x_2 - 1 = 1 + 1 - 1 = 1 $ <br>
$ y_1 = h_1 - 2*h_2 = 2 - 2 * 1 = 0 $ <br>


2. Read both of the following tutorials, which are implementations of the language model introduced in 7.5 using two different deep learning frameworks: pytorch and tensorflow. Actually implement at least one of them in your homework. Do your best to understand what the code is doing.

In [1]:
import nltk
import csv
from nltk.corpus import brown
from nltk.corpus import wordnet

nltk.download("brown")
nltk.download("wordnet")

len(brown.paras())

[nltk_data] Downloading package brown to /home/alenning/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package wordnet to /home/alenning/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


15667

In [2]:
num_train = 12000
UNK_symbol = "<UNK>"
vocab = set([UNK_symbol])

# create brown corpus again with all words
# no preprocessing, only lowercase
brown_corpus_train = []
for idx,paragraph in enumerate(brown.paras()):
    if idx == num_train:
        break
    words = []
    for sentence in paragraph:
        for word in sentence:
            words.append(word.lower())
    brown_corpus_train.append(words)

# create term frequency of the words
words_term_frequency_train = {}
for doc in brown_corpus_train:
    for word in doc:
        # this will calculate term frequency
        # since we are taking all words now
        words_term_frequency_train[word] = words_term_frequency_train.get(word,0) + 1

# create vocabulary
for doc in brown_corpus_train:
    for word in doc:
        if words_term_frequency_train.get(word,0) >= 5:
            vocab.add(word)

print(len(vocab))

12681


In [3]:
import numpy as np
# create required lists
x_train = []
y_train = []
x_dev = []
y_dev = []

# create word to id mappings
word_to_id_mappings = {}
for idx,word in enumerate(vocab):
    word_to_id_mappings[word] = idx

# function to get id for a given word
# return <UNK> id if not found
def get_id_of_word(word):
    unknown_word_id = word_to_id_mappings['<UNK>']
    return word_to_id_mappings.get(word,unknown_word_id)

# creating training and dev set
for idx,paragraph in enumerate(brown.paras()):
    for sentence in paragraph:
        for i,word in enumerate(sentence):
            if i+2 >= len(sentence):
                # sentence boundary reached
                # ignoring sentence less than 3 words
                break
            # convert word to id
            x_extract = [get_id_of_word(word.lower()),get_id_of_word(sentence[i+1].lower())]
            y_extract = [get_id_of_word(sentence[i+2].lower())]
            if idx < num_train:
                x_train.append(x_extract)
                y_train.append(y_extract)
            else:
                x_dev.append(x_extract)
                y_dev.append(y_extract)

# making numpy arrays
x_train = np.array(x_train)
y_train = np.array(y_train)
x_dev = np.array(x_dev)
y_dev = np.array(y_dev)  
  
print(x_train.shape)
print(y_train.shape)
print(x_dev.shape)
print(y_dev.shape)

(872823, 2)
(872823, 1)
(174016, 2)
(174016, 1)


In [4]:
import torch
import multiprocessing
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import time

# Trigram Neural Network Model
class TrigramNNmodel(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size, h):
        super(TrigramNNmodel, self).__init__()
        self.context_size = context_size
        self.embedding_dim = embedding_dim
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, h)
        self.linear2 = nn.Linear(h, vocab_size, bias = False)

    def forward(self, inputs):
        # compute x': concatenation of x1 and x2 embeddings
        embeds = self.embeddings(inputs).view((-1,self.context_size * self.embedding_dim))
        # compute h: tanh(W_1.x' + b)
        out = torch.tanh(self.linear1(embeds))
        # compute W_2.h
        out = self.linear2(out)
        # compute y: log_softmax(W_2.h)
        log_probs = F.log_softmax(out, dim=1)
        # return log probabilities
        # BATCH_SIZE x len(vocab)
        return log_probs

In [5]:
# create parameters
gpu = 0 
# word vectors size
EMBEDDING_DIM = 200
CONTEXT_SIZE = 2
BATCH_SIZE = 256
# hidden units
H = 100
torch.manual_seed(13013)

# check if gpu is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
available_workers = multiprocessing.cpu_count()

print("--- Creating training and dev dataloaders with {} batch size ---".format(BATCH_SIZE))
train_set = np.concatenate((x_train, y_train), axis=1)
dev_set = np.concatenate((x_dev, y_dev), axis=1)
train_loader = DataLoader(train_set, batch_size = BATCH_SIZE, num_workers = available_workers)
dev_loader = DataLoader(dev_set, batch_size = BATCH_SIZE, num_workers = available_workers)

cpu
--- Creating training and dev dataloaders with 256 batch size ---


In [9]:
# helper function to get accuracy from log probabilities
def get_accuracy_from_log_probs(log_probs, labels):
    probs = torch.exp(log_probs)
    predicted_label = torch.argmax(probs, dim=1)
    acc = (predicted_label == labels).float().mean()
    return acc

# helper function to evaluate model on dev data
def evaluate(model, criterion, dataloader, gpu):
    model.eval()

    mean_acc, mean_loss = 0, 0
    count = 0

    with torch.no_grad():
        dev_st = time.time()
        for it, data_tensor in enumerate(dataloader):
            context_tensor = data_tensor[:,0:2]
            target_tensor = data_tensor[:,2]
            context_tensor, target_tensor = context_tensor, target_tensor
            log_probs = model(context_tensor)
            mean_loss += criterion(log_probs, target_tensor).item()
            mean_acc += get_accuracy_from_log_probs(log_probs, target_tensor)
            count += 1
            if it % 500 == 0: 
                print("Dev Iteration {} complete. Mean Loss: {}; Mean Acc:{}; Time taken (s): {}".format(it, mean_loss / count, mean_acc / count, (time.time()-dev_st)))
                dev_st = time.time()

    return mean_acc / count, mean_loss / count


In [10]:
# Using negative log-likelihood loss
loss_function = nn.NLLLoss()

# create model
model = TrigramNNmodel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE, H)

# # load it to gpu
# model.cuda(gpu)

# using ADAM optimizer
optimizer = optim.Adam(model.parameters(), lr = 2e-3)


# ------------------------- TRAIN & SAVE MODEL ------------------------
best_acc = 0
best_model_path = None
for epoch in range(5):
    st = time.time()
    print("\n--- Training model Epoch: {} ---".format(epoch+1))
    for it, data_tensor in enumerate(train_loader):       
        context_tensor = data_tensor[:,0:2]
        target_tensor = data_tensor[:,2]

        context_tensor, target_tensor = context_tensor, target_tensor

        # zero out the gradients from the old instance
        model.zero_grad()

        # get log probabilities over next words
        log_probs = model(context_tensor)

        # calculate current accuracy
        acc = get_accuracy_from_log_probs(log_probs, target_tensor)

        # compute loss function
        loss = loss_function(log_probs, target_tensor)

        # backward pass and update gradient
        loss.backward()
        optimizer.step()

        if it % 500 == 0: 
            print("Training Iteration {} of epoch {} complete. Loss: {}; Acc:{}; Time taken (s): {}".format(it, epoch, loss.item(), acc, (time.time()-st)))
            st = time.time()

    print("\n--- Evaluating model on dev data ---")
    dev_acc, dev_loss = evaluate(model, loss_function, dev_loader, gpu)
    print("Epoch {} complete! Development Accuracy: {}; Development Loss: {}".format(epoch, dev_acc, dev_loss))
    if dev_acc > best_acc:
        print("Best development accuracy improved from {} to {}, saving model...".format(best_acc, dev_acc))
        best_acc = dev_acc
        # set best model path
        best_model_path = 'best_model_{}.dat'.format(epoch)
        # saving best model
        torch.save(model.state_dict(), best_model_path)



--- Training model Epoch: 1 ---
Training Iteration 0 of epoch 0 complete. Loss: 9.500727653503418; Acc:0.0; Time taken (s): 2.103275775909424
Training Iteration 500 of epoch 0 complete. Loss: 6.266800403594971; Acc:0.1484375; Time taken (s): 36.90906620025635
Training Iteration 1000 of epoch 0 complete. Loss: 6.116858959197998; Acc:0.14453125; Time taken (s): 34.95036220550537
Training Iteration 1500 of epoch 0 complete. Loss: 6.026060104370117; Acc:0.1328125; Time taken (s): 36.72751474380493
Training Iteration 2000 of epoch 0 complete. Loss: 5.957391738891602; Acc:0.10546875; Time taken (s): 37.28599524497986
Training Iteration 2500 of epoch 0 complete. Loss: 6.228488922119141; Acc:0.1484375; Time taken (s): 35.7860791683197
Training Iteration 3000 of epoch 0 complete. Loss: 5.779256820678711; Acc:0.19921875; Time taken (s): 37.785094022750854

--- Evaluating model on dev data ---
Dev Iteration 0 complete. Mean Loss: 4.968530178070068; Mean Acc:0.19140625; Time taken (s): 1.66219449

In [11]:
# ---------------------- Loading Best Model -------------------
best_model = TrigramNNmodel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE, H)
best_model.load_state_dict(torch.load(best_model_path))
# best_model.cuda(gpu)

cos = nn.CosineSimilarity(dim=0)

lm_similarities = {}

# word pairs to calculate similarity
words = {('computer','keyboard'),('cat','dog'),('dog','car'),('keyboard','cat')}

# ----------- Calculate LM similarities using cosine similarity ----------
for word_pairs in words:
    w1 = word_pairs[0]
    w2 = word_pairs[1]
    words_tensor = torch.LongTensor([get_id_of_word(w1),get_id_of_word(w2)])
    # words_tensor = words_tensor.cuda(gpu)
    # get word embeddings from the best model
    words_embeds = best_model.embeddings(words_tensor)
    # calculate cosine similarity between word vectors
    sim = cos(words_embeds[0],words_embeds[1])
    lm_similarities[word_pairs] = sim.item()

print(lm_similarities)


{('computer', 'keyboard'): -0.08803991973400116, ('keyboard', 'cat'): 0.12031812220811844, ('dog', 'car'): 0.06517498195171356, ('cat', 'dog'): 0.1584472954273224}


3. Write at least one paragraph about what you learned and one paragraph about what questions you have. (also as comment/markdown cell)

I actually learned how nerual networks work. Before, I had heard of a lot of these elements of a learning problem, but had no idea what any of it meant or how to make use of any of it. I learned what an activation function is, as well as how backpropogation works. It was really eye-opening to see that backpropogation is just an application of the chain rule. It was also extremely useful to me to walk through the nerual net at the beginning of this homework assignment. It really helps to just see a neural net as a linear combination of nested function.

My biggest questions arise from hyperparameters, as well as initial weights. I have no idea how to choose hyperparameters or how to optimize hyperparameters. Along those lines I am curious how to pick initial weights to use, although I guess with a loss function and backpropogation it doesn't really matter where you start. My last question that I have is how to deploy models, but this is more of a Machine Learning Operations question. How do you deploy or load pre-trained models so that you don't have to take an eternity training these neural networks. But honestly, this makes a lot of sense, just curious how someone thought of this. I guess I'm also curious how to tell if your model is overfitting vs underfitting.