# Project 3 (Part 1): Language identification with Neural Networks
## CS4740/5740 Fall 2021

Names: Lusca Robinson, Kyrus Mama

Netids: nar73, krm74

### Project Submission Due: Friday Nov 5 11:59pm
Please submit a **pdf file** of this notebook on **Gradescope**, and your **ipynb** on **CMS**. For instructions on generating pdf and ipynb files, please refer to project 1 or project 2 instructions.



## Introduction
In this two-part project we will consider **neural networks**. In part 1 you will study a Feedforward Neural Network (FFNN) for performing language identification (i.e., determining whether tha piece of text  is Shakespearean English or modern English).

You will be given an implementation for a FFNN and be asked to debug it in a specific way. You also will be required to submit a description of any additional libraries used (other than those that we import for you), how your group divided up the work, and your feedback regarding the assignment (the latter with **Part 2** of project 3). The google doc template for the write up is [here](https://docs.google.com/document/d/16scmrRva5WczB4a4QDYhFJBXaaAJd8oesEEAdi6ehbY/edit).

## Advice 🚀
As always, the report is important! The report is where you get to show
that you understand not only what you are doing but also why and how you are doing it. So be clear, organized and concise; avoid vagueness and excess verbiage.

## Dataset
You are given access to a set of parallel sentences. One sentence is written in modern English (the "source") and another is in Shakespearean English (the "target"). For this project, given modern English you will need to translate it into Shakespearean English. This is usually called (Neural) Machine Translation. We'll simply refer to it as NMT or Neural Machine Translation in the project.

We will minimally preprocess the source/target sentences and handle tokenization in what we release. For this assignment, we do not anticipate any further preprocessing to be done by you. Should you choose to do so, it would be interesting to hear about in the report (along with whether or not it helped performance), but it is not a required aspect of the assignment.

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=True)

#path = os.path.join(os.getcwd(), "drive", "MyDrive", "CS 4740", "Project 3", "Dataset")
train_path = os.path.join(os.getcwd(), "drive", "MyDrive", "CS 4740", "Project 3", "Dataset", "train.txt") # replace based on your Google drive organization
val_path = os.path.join(os.getcwd(), "drive", "MyDrive", "CS 4740", "Project 3", "Dataset", "val.txt") # replace based on your Google drive organization

Mounted at /content/drive


# Part 1: Feedforward Neural Network

In this section, there are two main coding components relevant to **Part 1**.

1. `Data loader`\
As the name suggests, this section loads the data from the dataset files and handles other preprocessing and setup. You will **not** need to change this file and should **not** change this file throughout the assignment.

2. `ffnn`\
This contains the model and code that uses the model for **Part 1**.

In the `ffnn` section, you will find a Feedforward Neural Net serving as the underlying model for performing language detection.



## Part 1: Tips

We do not assume you have **any** experience working with neural networks and/or debugging them. You may discover debugging neural network code is quite different from debugging in general software engineering and from debugging in other areas of CS such as algorithms and systems.

We suggest you systematically step through the code and simultanously (perhaps by physically drawing out the vectors, matrices, and tensors involved) describe what the computations _mean_. What you are looking for is where the code differs from what you would expect a neural net to do.

## Part 1: Rules

For **Part 1**, you will not be able to ask any questions on EdStem and we will be unable to provide any meaningful advice in office hours. Unfortunately, this is the nature of debugging, it is unlikely anyone can give you specific advice for most problems you encounter and we have already provided general tips in the preceding section. If you absolutely must ask a question or you believe there is some kind of issue with the assignment for this part, please submit a private EdStem post and we will respond swiftly.

As a reminder **communication about the assignment _between_ distinct groups is not permitted and is a violation of the Academic Integrity policy**. For this assignment, we will be _extremely_ stringent about this, given that debugging is entirely pointless if someone else in a different group tells you where the error is.

## Import libraries and connect to Google Drive

In [None]:
import json
import math
import os
from pathlib import Path
import random
import time
from tqdm.notebook import tqdm, trange
from typing import Dict, List, Set, Tuple

import numpy as np
import nltk
import torch
import torch.nn as nn
from torch.nn import init
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from tqdm.notebook import tqdm, trange

In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Data loader

In [None]:
language_to_idx = {
    "Modern": 0,
    "Shakespeare": 1,
}
idx_to_language = {v: k for k, v in language_to_idx.items()}
UNK = "<UNK>"

In [None]:
def fetch_data(train_data_path, val_data_path):
    """fetch_data retrieves the data from a json/csv and outputs the validation
    and training data

    :param train_data_path:
    :type train_data_path: str
    :return: Training, validation pair where the training is a list of document, label pairs
    :rtype: Tuple[
        List[Tuple[List[str], int]],
        List[Tuple[List[str], int]], 
    ]
    """
    with open(train_data_path) as training_f:
        training = training_f.read().split("\n")
    with open(val_data_path) as valid_f:
        validation = valid_f.read().split("\n")
	
    # If needed you can shrink the training and validation data to speed up somethings but this isn't always safe to do by setting k < 10000
    # k = #fill in
    # training = random.shuffle(training)
    # validation = random.shuffle(validation)
    # training, validation = training[:k], validation[:(k // 10)]

    tra = []
    val = []
    for elt in training:
        if elt == '':
            continue
        txt, language = elt.split("\t")
        tra.append((nltk.word_tokenize(txt), language))
    for elt in validation:
        if elt == '':
            continue
        txt, language = elt.split("\t")
        val.append((nltk.word_tokenize(txt), language))

    return tra, val

In [None]:
def make_vocab(data):
    """make_vocab creates a set of vocab words that the model knows

    :param data: The list of documents that is used to make the vocabulary
    :type data: List[str]
    :returns: A set of strings corresponding to the vocabulary
    :rtype: Set[str]
    """
    vocab = set()
    for document, _ in data:
        for word in document:
            vocab.add(word)
    return vocab 


def make_indices(vocab):
	"""make_indices creates a 1-1 mapping of word and indices for a vocab.

	:param vocab: The strings corresponding to the vocabulary in train data.
	:type vocab: Set[str]
	:returns: A tuple containing the vocab, word2index, and index2word.
		vocab is a set of strings in the vocabulary including <UNK>.
		word2index is a dictionary mapping tokens to its index (0, ..., V-1)
		index2word is a dictionary inverting the mapping of word2index
	:rtype: Tuple[
		Set[str],
		Dict[str, int],
		Dict[int, str],
	]
	"""
	vocab_list = sorted(vocab)
	vocab_list.append(UNK)
	word2index = {}
	index2word = {}
	for index, word in enumerate(vocab_list):
		word2index[word] = index 
		index2word[index] = word 
	vocab.add(UNK)
	return vocab, word2index, index2word 


def convert_to_vector_representation(data, word2index, test=False):
	"""convert_to_vector_representation converts the list of strings into a vector

	:param data: The dataset to be converted into a vectorized format
	:type data: Union[
		List[Tuple[List[str], int]],
		List[str],
	]
	:param word2index: A mapping of word to index
	:type word2index: Dict[str, int]
	:returns: A list of vector representations of the input or pairs of vector
		representations with expected output
	:rtype: List[Tuple[torch.Tensor, int]] or List[torch.Tensor]

	List[Tuple[List[torch.Tensor], int]] or List[List[torch.Tensor]]
	"""
	if test:
		vectorized_data = []
		for document in data:
			vector = torch.zeros(len(word2index)) 
			for word in document:
				index = word2index.get(word, word2index[UNK])
				vector[index] += 1
			vectorized_data.append(vector)
	else:
		vectorized_data = []
		for document, y in data:
			vector = torch.zeros(len(word2index)) 
			for word in document:
				index = word2index.get(word, word2index[UNK])
				vector[index] += 1
			vectorized_data.append((vector, y))
	return vectorized_data

In [None]:
class LanguageDataset(Dataset):
    """LanguageDataset is a torch dataset to interact with the Language data.

    :param data: The vectorized dataset with input and expected output values
    :type data: List[Tuple[List[torch.Tensor], int]]
    """
    def __init__(self, data):
        self.data = data
        self.len = len(data)
    
    def __len__(self):
        """__len__ returns the number of samples in the dataset.

        :returns: number of samples in dataset
        :rtype: int
        """
        return self.len
    
    def __getitem__(self, index):
        """__getitem__ returns the tensor, output pair for a given index

        :param index: index within dataset to return
        :type index: int
        :returns: A tuple (x, y) where x is model input and y is our label
        :rtype: Tuple[torch.Tensor, int]
        """
        return convert_to_vector_representation(self.data[index:index+1], word2index)[0]

def get_data_loaders(train, val, batch_size=16):
    """
    """
    # First we create the dataset given our train and validation lists
    dataset = LanguageDataset(train + val)

    # Then, we create a list of indices for all samples in the dataset
    train_indices = [i for i in range(len(train))]
    val_indices = [i for i in range(len(train), len(train) + len(val))]

    # Now we define samplers and loaders for train and val
    train_sampler = SubsetRandomSampler(train_indices)
    train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_sampler)
    
    val_sampler = SubsetRandomSampler(val_indices)
    val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_sampler)

    return train_loader, val_loader

In [None]:
train, val = fetch_data(train_path, val_path)

In [None]:
vocab = make_vocab(train)
vocab, word2index, index2word = make_indices(vocab)

In [None]:
train_loader, val_loader = get_data_loaders(train, val, batch_size=1)

In [None]:
# Note: Colab has 12 hour limits on GPUs, also potential inactivity may kill the notebook. Save often!

## 1.1 FFNN Implementation

### 1.1 Task
Assume that an onmiscient oracle has told you there are **4 fundamental errors** in the **FFNN** implementation. They may be anywhere in the section below (everything below until the start of section 1.2) unless otherwise indicated. Your objective is to _find_ and _fix_ each of these errors and to include in the report a description of the original error along with the fix. To help your efforts, the oracle has provided you with additional information about the properties of the errors as follows:

* _Correctness_ \
Each error causes the code to be strictly incorrect. There is absolutely no ambiguity that the errant code (or missing code) is incorrect. This means errors are not due to the code being inefficient (in run-time or in memory).

* _Localized_ \
Each error can be judged to be erroneous by strictly looking at the code (along with your knowledge of machine learning as taught through this course). The errors therefore are not due to the model being uncompetitive in terms of performance with state-of-the-art performance for this task nor are they due to the amount of data being insufficient for this task in general.

* _General_ \
Each error is general in nature. They will not be triggered only by the model receiving a pathological input, i.e. they will not be something that is triggered specifically when inputting some phrase such as "roman fool" or word such as "sword".

* _Fundamental_ \
Each error is a fundamental failure in terms of doing what is intended. This means that errors do not hinge on nuanced understanding of specific PyTorch functionality. This also means they will not exploit properties of the dataset in
a subtle way that could only be realized by someone who has comprehensively studied the data.

The bottom line: the errors should be fairly obvious. The oracle further reminds you that performance/accuracy of the (resulting) model should not be how you ensure you have debugged successfully. For example, if you correct some, but not all, of the errors, the remaining errors may mask the impact of your fixes. Further, performance is not guaranteed to improve by fixing any particular error. Consider the case where the training set is also employed as the test set; performance will be very high but there is something very wrong. And fixing the problem will reduce performance.
In fixing each error, the oracle provides some further insight about the fixes:

* _Minimal_ \
A reasonable fix for each error can be achieved in < 5 lines of code being changes. We do not require you to make fixes of 4 or fewer lines, but it should be a cause for concern if your fixes are far more elaborate.

* _Ill-posed_ \
While the errors are unambiguous, the method for fixing them is under-specified: You are free to implement any reasonable fix and all such fixes will equally receive full credit.

In [None]:
# Lambda to switch to GPU if available
get_device = lambda : "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
unk = '<UNK>'

# Consult the PyTorch documentation for information on the functions used below:
# https://pytorch.org/docs/stable/torch.html

class FFNN(nn.Module):
	def __init__(self, input_dim, h, output_dim):
		super(FFNN, self).__init__()
		self.h = h
		self.W1 = nn.Linear(input_dim, h)
		self.activation = nn.ReLU() # The rectified linear unit; one valid choice of activation function
		# ERROR 1 on line below: error in size of dimensions, should be h x output dim but is h x h
		self.W2 = nn.Linear(h, output_dim) 
    # The below two lines are not a source for an error
		self.softmax = nn.LogSoftmax(dim=1) # The softmax function that converts vectors into probability distributions; computes log probabilities for computational benefits
		self.loss = nn.NLLLoss() # The cross-entropy/negative log likelihood loss taught in class

	def compute_Loss(self, predicted_vector, gold_label):
		return self.loss(predicted_vector, gold_label)

	def forward(self, input_vector):
		# The z_i are just there to record intermediary computations for your clarity
		z1 = self.W1(input_vector)
		# ERROR 2 on line below: Fixed the error of no activation function on the first layer of the nn
		z2 = self.activation(z1) 
		z3 = self.W2(z2)
		predicted_vector = self.softmax(z3)
		return predicted_vector
	
	def load_model(self, save_path):
		self.load_state_dict(torch.load(save_path))
	
	def save_model(self, save_path):
		torch.save(self.state_dict(), save_path)


def train_epoch(model, train_loader, optimizer):
	model.train()
	total = 0
	loss = 0
	correct = 0
	for (input_batch, expected_out) in tqdm(train_loader, leave=False, desc="Training Batches"):
		expected_out = torch.tensor([int(i) for i in expected_out])
		output = model(input_batch.to(get_device()))
		total += output.size()[0]
		_, predicted = torch.max(output, 1)
		correct += (expected_out == predicted.to("cpu")).cpu().numpy().sum()

		# ERROR 3 on line below: We need to zero out the gradients before we call our loss function so our gradients on EACH batch is accuracte 
		optimizer.zero_grad()  
		loss = model.compute_Loss(output, expected_out.to(get_device()))
		loss.backward()
		optimizer.step()
	# TODO: You'll want to print accuracy here
	acc = correct/total
	print("Tain accuracy: {:.4f}.\tLoss: {:.4f}".format(acc, loss.item()))
 
	return


def evaluation(model, val_loader, optimizer):
	model.eval()
	loss = 0
	correct = 0
	total = 0
	for (input_batch, expected_out) in tqdm(val_loader, leave=False, desc="Validation Batches"):
		expected_out = torch.tensor([int(i) for i in expected_out])
		output = model(input_batch.to(get_device()))
		total += output.size()[0]
		_, predicted = torch.max(output, 1)
		correct += (expected_out.to("cpu") == predicted.to("cpu")).cpu().numpy().sum()

		loss += model.compute_Loss(output, expected_out.to(get_device()))
	loss /= len(val_loader)
	# TODO: You'll want to print some validation metrics here
	acc = correct/total
	print("Tain accuracy: {:.4f}.\tLoss: {:.4f}".format(acc, loss.item()))
 
	pass

def train_and_evaluate(number_of_epochs, model, train_loader, val_loader):
	optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
	for epoch in trange(number_of_epochs, desc="Epochs"):
		# ERROR 4 on line below: We need to train on the training set, not the validation set
		train_epoch(model, train_loader, optimizer) 
		with torch.no_grad():
			evaluation(model, val_loader, optimizer)
	return

In [None]:
h = 512
model = FFNN(len(vocab), h, len(language_to_idx)).to(get_device())
train_and_evaluate(2, model, train_loader, val_loader)
model.save_model("ffnn_fixed.pth") # Save our model!

In [None]:
# Example of how to load
loaded_model = FFNN(len(vocab), h, len(language_to_idx))
loaded_model.load_model("ffnn_fixed.pth")

## 1.2 Part 1 Report
Please include a description of the error, a description of your fix, and a python comment in the code indicating where fix is for each of the 4 errors. The python comment will look something along the lines of "# Error 1 fix on line below". As a convenience, the template for the google doc write up is [here](https://docs.google.com/document/d/16scmrRva5WczB4a4QDYhFJBXaaAJd8oesEEAdi6ehbY/edit)

### Error 1:
Your answer here.

### Error 2:
Your answer here.

### Error 3:
Your answer here.

### Error 4:
Your answer here.

# Part 2: Miscellaneous
List any additional libraries you used and sources you referenced and cited (labelled with the section in which you referred to them). Include a description of how your group split
up the work. Include brief feedback on this asignment.

**Each section must be clearly labelled, complete, and the corresponding pages should be correctly assigned to the corresponding Gradescope rubric item.** If you follow these steps for each of the 4 components requested, you are guaranteed full credit for this section. Otherwise, you will receive no credit for this section.