# **Assignment 1 - Language Models**
#### **Due: September 27 (Tuesday), 2022**

## **Notes**

### **Introduction**

Welcome to CSE 527A. The goal for the first assignment is to make sure you are familiar with all the tools you need to complete the programming assignments for the course. 

Each assignment contains two parts: a written and coding portion. The coding portion for each homework assignment will be delivered through a Colaboratory notebook such as this one. Please use as many as code and markdown cells to run and explain all the steps you took in order to answer each question.

### **Comments/Documentation**

Please follow PEP 8 style guidelines (https://peps.python.org/pep-0008/) for commenting your code. Furthermore, please remember to manually save your work once in a while. If you are connected to a hosted runtime that if for whatever reason it disconnects you will have to rerun all connected code cells.

### **Getting Started**

In order to compile code efficiently please pay attention to if you are using a hardware accelerator or not. If you are directly calling libraries like Tensorflow, Keras, or Pytorch, it is advised to switch to a GPU.

To access a GPU, go to `Edit->Notebook settings` and in the `Hardware accelerator` dropdown choose `GPU`. 
As soon as you run a code cell, you will be connected to a cloud instance with a GPU.
Try running the code cell below to check that a GPU is connected (select the cell then either click the play button at the top left or press `Ctrl+Enter` or `Shift+Enter`).

The free version of Google Colab will provide the necessary hardware for this course. Please keep in mind the RAM and Disk Space that you are allocated and that you are not given an infinite active runtime.

If your local machine has a GPU that you find outperforms the cloud GPU then you can follow the necessary documentation to use a GPU with your environment.

### **Lost GPU/TPU Access on Colab**

If you are not allocated a GPU or cannot connect to a GPU (limits are reached for Collab), Kaggle also provides free access to GPUs and TPUs. Please transfer your work to a Kaggle runtime instance by downloading your file on Colab as a '.ipynb' file and importing the file into Kaggle.

### **Submission Instructions**

We will use Gradescope for assignment submission. You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them. Please note if designated output is cleared, you will receive a 0.

To download this notebook, go to `File->Download .ipynb`.  Please rename the file to match the name in our file list. 

When submitting your ipython notebooks, make sure everything runs correctly if the cells are executed in order starting from a fresh session.  Note that just because a cell runs in your current session doesn't mean it doesn't rely on code that you have already changed or deleted.  If the code doesn't take too long to run, we recommend re-running everything with `Runtime->Restart and run all...`.

When you upload your submission to the Gradescope assignment, you should get immediate feedback that confirms your submission was processed correctly. Note that Gradesope will allow you to submit multiple times before the deadline, and we will use the latest submission for grading.

## **Setup**

In [1]:
from google.colab import drive # one option to load datasets
from google.colab import files
drive.mount('/content/gdrive')
!nvidia-smi -L # check if using GPU

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## **Problem 1**

## **1.1**

Write a program to compute unsmoothed unigrams, bigrams, and trigrams (you may not import nltk).

In [2]:
## your code here


def ngram(n,text): #generic N gram
  text = text.split()
  ngrams = {}
  # generate ngram by sliding window
  i,j = 0,n
  while j <= len(text):
    if n==1:
      ngrams[text[i]] = ngrams.get(text[i],0)+1
    else:
      ngrams[tuple(text[i:j])] = ngrams.get(tuple(text[i:j]),0)+1
    i+=1

    j+=1
  return ngrams

def unigram(text):
  return ngram(1,text)

def bigram(text):
  return ngram(2,text)

def trigram(text):
  return ngram(3,text)


## test the above code
print('unigrams=',unigram('asdf.g hjk hohoh'))
#print('unigrams=',unigram('a'))
print('bigrams=',bigram(('Valkyria Chronicles III Senjō no Valkyria 3   Chronicles  Japanese  戦場のヴァルキュリア3  lit  Valkyria of the')))
#print('bigrams=',bigram(('a')))
print('trigrams=',trigram(('asdf.g hjk hohoh')))


unigrams= {'asdf.g': 1, 'hjk': 1, 'hohoh': 1}
bigrams= {('Valkyria', 'Chronicles'): 1, ('Chronicles', 'III'): 1, ('III', 'Senjō'): 1, ('Senjō', 'no'): 1, ('no', 'Valkyria'): 1, ('Valkyria', '3'): 1, ('3', 'Chronicles'): 1, ('Chronicles', 'Japanese'): 1, ('Japanese', '戦場のヴァルキュリア3'): 1, ('戦場のヴァルキュリア3', 'lit'): 1, ('lit', 'Valkyria'): 1, ('Valkyria', 'of'): 1, ('of', 'the'): 1}
trigrams= {('asdf.g', 'hjk', 'hohoh'): 1}


## **1.2**

Train your model on the Wikitext-2-v1 training corpus (https://huggingface.co/datasets/wikitext). Explain the differences between your most common unigrams, bigrams, trigrams (pad beginning and end of your sentences and please remove puncutation and unknown tokens from corpus).


In [3]:
## your code here

#install datasets if necessary
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:

from datasets import load_dataset
dataset = load_dataset('wikitext', 'wikitext-2-v1', split =['train','validation','test'])



  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
import string



def remove_unk_punctuation(data):
  '''
  remove <unk> and punctuations
  '''

  prepared_data = ""
  for token in  data:
    
    current_string = token['text'].strip().replace("<unk>", "").translate(token['text'].maketrans('', '', string.punctuation))
    if current_string!="":
      # if not empty string, concate
      prepared_data+=current_string

  return prepared_data

train = remove_unk_punctuation(dataset[0])
validation = remove_unk_punctuation(dataset[1])
test = remove_unk_punctuation(dataset[2])



In [6]:
import collections
def sorted_ngrams(n,text):
  ngrams = ngram(n,text)
  ngrams_sorted = dict(sorted(ngrams.items(), key=lambda item: item[1],reverse = True))
  top_ngrams = []
  for key in ngrams_sorted.keys():
    top_ngrams.append(key)
  return  top_ngrams

frequency_sorted_unigram = sorted_ngrams(1,train)
frequency_sorted_bigram = sorted_ngrams(2,train)
frequency_sorted_trigram = sorted_ngrams(3,train)




In [7]:
print(frequency_sorted_unigram[:10])
print(frequency_sorted_bigram[:10])
print(frequency_sorted_trigram[:10])

['the', 'of', 'and', 'in', 'to', 'a', 'was', 'The', 's', 'that']
[('of', 'the'), ('in', 'the'), ('to', 'the'), ('and', 'the'), ('on', 'the'), ('for', 'the'), ('from', 'the'), ('at', 'the'), ('by', 'the'), ('as', 'a')]
[('one', 'of', 'the'), ('the', 'United', 'States'), ('as', 'well', 'as'), ('part', 'of', 'the'), ('the', 'end', 'of'), ('end', 'of', 'the'), ('in', 'the', 'United'), ('a', 'number', 'of'), ('at', 'the', 'time'), ('known', 'as', 'the')]


## **1.3**

Calculate the perplexity for each n-gram (unigram, bigram, and trigram) on all splits (traning, validataion, test sets of Wikitext-2). And, without writing code, discuss what might happen to the perplexity if you continue to increase the number of words in your n-gram (4-gram, 5-gram, etc.)?

In [8]:
## your code here
def ngram_probability_calculation(ngram_dict):
  dict_values = ngram_dict.values()
  total = sum(dict_values)
  ngrams_probability = {}
  for key in ngram_dict.keys():
    ngrams_probability[key]=ngram_dict[key]/total
  return ngrams_probability



## test the above code
print('unigrams probability=',ngram_probability_calculation(unigram('asdf.g hjk hohoh uuuu')))
print('bigrams probability=',ngram_probability_calculation(bigram('asdf.g hjk hohoh uuuu')))

#print('bigrams=',ngram_probability_calculation(bigram(('Valkyria Chronicles III Senjō no Valkyria 3   Chronicles  Japanese  戦場のヴァルキュリア3  lit  Valkyria of the'))))


#print('trigrams=',ngram_probability_calculation(trigram(('asdf.g hjk hohoh'))))

unigrams probability= {'asdf.g': 0.25, 'hjk': 0.25, 'hohoh': 0.25, 'uuuu': 0.25}
bigrams probability= {('asdf.g', 'hjk'): 0.3333333333333333, ('hjk', 'hohoh'): 0.3333333333333333, ('hohoh', 'uuuu'): 0.3333333333333333}


In [9]:
#train_probability for uni,bi,trigram
unigram_train_probability = ngram_probability_calculation(ngram(1,train))
bigram_train_probability = ngram_probability_calculation(ngram(2,train))
trigram_train_probability = ngram_probability_calculation(ngram(3,train))

unigram_train_count = ngram(1,train)
bigram_train_count = ngram(2,train)
trigram_train_count = ngram(3,train)

unigram_test_count = ngram(1,test)
bigram_test_count = ngram(2,test)
trigram_test_count = ngram(3,test)

unigram_validation_count = ngram(1,validation)
bigram_validation_count = ngram(2,validation)
trigram_validation_count = ngram(3,validation)

In [10]:
#sanity check the probability
'''
print(sum(unigram_train_probability.values()))
print(sum(bigram_train_probability.values()))
print(sum(trigram_train_probability.values()))
'''

'\nprint(sum(unigram_train_probability.values()))\nprint(sum(bigram_train_probability.values()))\nprint(sum(trigram_train_probability.values()))\n'

In [11]:
import sys
import math

def text_probability(text,n,ngram_trained_model,n_minus_one_trained_model=None):
  text = text.split()
  ngrams_list = []

  i,j = 0,n
  while j <= len(text):
    if n==1:
      ngrams_list.append(text[i])
    else:
      ngrams_list.append(tuple(text[i:j]))
    i+=1
    j+=1
  

  log_probability = 0
  probability = 1
  if n==1:
    #ngram_trained_model = unigram_train_count
    total_unigram = sum(ngram_trained_model.values())
    for token in ngrams_list:
    #if token in trained_model:
      log_probability+=math.log(ngram_trained_model[token]/total_unigram,2)
      probability*=ngram_trained_model[token]/total_unigram


  elif n==2:
    #ngram_trained_model = bigram_train_count
    #n_minus_one_trained_model = unigram_train_count
    for token in ngrams_list:
      n_minus_one_token = token[0]
      log_probability+=math.log(ngram_trained_model[token]/n_minus_one_trained_model[n_minus_one_token],2)
      probability*=ngram_trained_model[token]/n_minus_one_trained_model[n_minus_one_token]
      


  elif n==3:
    #ngram_trained_model = trigram_train_count
    #n_minus_one_trained_model = bigram_train_count
    for token in ngrams_list:
      n_minus_one_token = token[:n-1]
      log_probability+=math.log(ngram_trained_model[token]/n_minus_one_trained_model[n_minus_one_token],2)
      probability*=ngram_trained_model[token]/n_minus_one_trained_model[n_minus_one_token]
  else:
    print("not implemented")
    sys.exit(1)

  

  
  #perplexity
  perplexity = math.pow(2,-log_probability/len(ngrams_list))

  
  


  



  return perplexity #, probability, math.pow(2,log_probability), log_probability
    




## test the above code
#print('unigrams=',text_probability('asdf.g hjk hohoh',1))
#print('unigrams=',text_probability('asdf.g hjk hohoh',2))
#print('unirams=',text_probability(('Valkyria Chronicles III Senjō no Valkyria 3   Chronicles  Japanese  戦場のヴァルキュリア3  lit  Valkyria of the'),1))
#print('bigrams=',text_probability(('Valkyria Chronicles III Senjō no Valkyria 3   Chronicles  Japanese  戦場のヴァルキュリア3  lit  Valkyria of the'),2))

#print(text_probability(train,1))
print(text_probability(train,1,unigram_train_count))
print(text_probability(train,2,bigram_train_count,unigram_train_count))
print(text_probability(train,3,trigram_train_count,bigram_train_count))



1941.083712678701
97.34964861873617
5.429987233533894


## **1.4**

Enable Laplace smoothing and Add-K smoothing (0.1,0.05,0.01) to your code. Discuss the changes in perplexity values between n-grams as you try different smoothing methods/values.

In [12]:
## your code here


def prepare_for_smoothing(train_ngram_count,test_ngram_count,validation_ngram_count): 
  result=train_ngram_count.copy()
  for token in test_ngram_count:
    if token not in train_ngram_count:
      result[token]=0
  for token in validation_ngram_count:
    if token not in train_ngram_count:
      result[token]=0
  return result

def smoothing(train_count, k=0):
  result=train_count.copy()
  for token in train_count:
    result[token]+=k
  return result

## test the above code
#ngram(1,text)
toy_unigram_train=ngram(2,'asdf.g hjk hohoh')
toy_unigram_test=ngram(2,'asdf.g gg hjk fh hohoh')
toy_unigram_valid=ngram(2,'asdf.g hjk hohoh gg asdfg')
print(toy_unigram_train)
toy_unigram_train_forsmoothing=prepare_for_smoothing(toy_unigram_train,toy_unigram_test,toy_unigram_valid)
print(toy_unigram_train)
print(toy_unigram_train_forsmoothing)

toy_unigram_train_k_0=smoothing(toy_unigram_train_forsmoothing)
toy_unigram_train_k_1=smoothing(toy_unigram_train_forsmoothing,1)
toy_unigram_train_k_2=smoothing(toy_unigram_train_forsmoothing,2)
print(toy_unigram_train)
print(toy_unigram_train_forsmoothing)
print(toy_unigram_train_k_0)
print(toy_unigram_train_k_1)
print(toy_unigram_train_k_2)






{('asdf.g', 'hjk'): 1, ('hjk', 'hohoh'): 1}
{('asdf.g', 'hjk'): 1, ('hjk', 'hohoh'): 1}
{('asdf.g', 'hjk'): 1, ('hjk', 'hohoh'): 1, ('asdf.g', 'gg'): 0, ('gg', 'hjk'): 0, ('hjk', 'fh'): 0, ('fh', 'hohoh'): 0, ('hohoh', 'gg'): 0, ('gg', 'asdfg'): 0}
{('asdf.g', 'hjk'): 1, ('hjk', 'hohoh'): 1}
{('asdf.g', 'hjk'): 1, ('hjk', 'hohoh'): 1, ('asdf.g', 'gg'): 0, ('gg', 'hjk'): 0, ('hjk', 'fh'): 0, ('fh', 'hohoh'): 0, ('hohoh', 'gg'): 0, ('gg', 'asdfg'): 0}
{('asdf.g', 'hjk'): 1, ('hjk', 'hohoh'): 1, ('asdf.g', 'gg'): 0, ('gg', 'hjk'): 0, ('hjk', 'fh'): 0, ('fh', 'hohoh'): 0, ('hohoh', 'gg'): 0, ('gg', 'asdfg'): 0}
{('asdf.g', 'hjk'): 2, ('hjk', 'hohoh'): 2, ('asdf.g', 'gg'): 1, ('gg', 'hjk'): 1, ('hjk', 'fh'): 1, ('fh', 'hohoh'): 1, ('hohoh', 'gg'): 1, ('gg', 'asdfg'): 1}
{('asdf.g', 'hjk'): 3, ('hjk', 'hohoh'): 3, ('asdf.g', 'gg'): 2, ('gg', 'hjk'): 2, ('hjk', 'fh'): 2, ('fh', 'hohoh'): 2, ('hohoh', 'gg'): 2, ('gg', 'asdfg'): 2}


In [13]:

unigram_train_count_smoothing= prepare_for_smoothing(unigram_train_count,unigram_test_count,unigram_validation_count)
bigram_train_count_smoothing = prepare_for_smoothing(bigram_train_count,bigram_test_count,bigram_validation_count)
trigram_train_count_smoothing = prepare_for_smoothing(trigram_train_count,trigram_test_count,trigram_validation_count)

In [14]:
unigram_train_k_0=smoothing(unigram_train_count_smoothing)
bigram_train_k_0=smoothing(bigram_train_count_smoothing)
trigram_train_k_0=smoothing(trigram_train_count_smoothing)

#Laplace smoothing
unigram_train_k_laplace=smoothing(unigram_train_count_smoothing,1)
bigram_train_k_laplace=smoothing(bigram_train_count_smoothing,1)
trigram_train_k_laplace=smoothing(trigram_train_count_smoothing,1)

#k(0.1,0.05,0.01) smoothing 
unigram_train_k_1=smoothing(unigram_train_count_smoothing,0.1)
bigram_train_k_1=smoothing(bigram_train_count_smoothing,0.1)
trigram_train_k_1=smoothing(trigram_train_count_smoothing,0.1)

unigram_train_k_05=smoothing(unigram_train_count_smoothing,0.05)
bigram_train_k_05=smoothing(bigram_train_count_smoothing,0.05)
trigram_train_k_05=smoothing(trigram_train_count_smoothing,0.05)

unigram_train_k_01=smoothing(unigram_train_count_smoothing,0.01)
bigram_train_k_01=smoothing(bigram_train_count_smoothing,0.01)
trigram_train_k_01=smoothing(trigram_train_count_smoothing,0.01)

In [15]:
print(text_probability(train,1,unigram_train_k_0))
print(text_probability(train,2,bigram_train_k_0,unigram_train_k_0))
print(text_probability(train,3,trigram_train_k_0,bigram_train_k_0))



1941.083712678701
97.34964861873617
5.429987233533894


In [16]:
print("......lapalce smooting performance.......")

print('..train..')
print("unigram perplexity=",text_probability(train,1,unigram_train_k_laplace))
print("bigram perplexity=",text_probability(train,2,bigram_train_k_laplace,unigram_train_k_laplace))
print("trigram perplexity=",text_probability(train,3,trigram_train_k_laplace,bigram_train_k_laplace))
print('..validation..')
print("unigram perplexity=",text_probability(validation,1,unigram_train_k_laplace))
print("bigram perplexity=",text_probability(validation,2,bigram_train_k_laplace,unigram_train_k_laplace))
print("trigram perplexity=",text_probability(validation,3,trigram_train_k_laplace,bigram_train_k_laplace))
print('..test..')
print("unigram perplexity=",text_probability(test,1,unigram_train_k_laplace))
print("bigram perplexity=",text_probability(test,2,bigram_train_k_laplace,unigram_train_k_laplace))
print("trigram perplexity=",text_probability(test,3,trigram_train_k_laplace,bigram_train_k_laplace))

......lapalce smooting performance.......
..train..
unigram perplexity= 1943.6194148230973
bigram perplexity= 73.77607260148193
trigram perplexity= 4.176797835405882
..validation..
unigram perplexity= 1730.0201050331605
bigram perplexity= 110.30615567874555
trigram perplexity= 5.682695471903138
..test..
unigram perplexity= 1688.5167095041265
bigram perplexity= 111.17559751168041
trigram perplexity= 5.792542173852982


In [17]:
print("......k=0.1 smooting performance.......")

print('..train..')
print("unigram perplexity=",text_probability(train,1,unigram_train_k_1))
print("bigram perplexity=",text_probability(train,2,bigram_train_k_1,unigram_train_k_1))
print("trigram perplexity=",text_probability(train,3,trigram_train_k_1,bigram_train_k_1))
print('..validation..')
print("unigram perplexity=",text_probability(validation,1,unigram_train_k_1))
print("bigram perplexity=",text_probability(validation,2,bigram_train_k_1,unigram_train_k_1))
print("trigram perplexity=",text_probability(validation,3,trigram_train_k_1,bigram_train_k_1))
print('..test..')
print("unigram perplexity=",text_probability(test,1,unigram_train_k_1))
print("bigram perplexity=",text_probability(test,2,bigram_train_k_1,unigram_train_k_1))
print("trigram perplexity=",text_probability(test,3,trigram_train_k_1,bigram_train_k_1))

......k=0.1 smooting performance.......
..train..
unigram perplexity= 1941.1284835418628
bigram perplexity= 93.8621401417523
trigram perplexity= 5.233655962524139
..validation..
unigram perplexity= 1726.269501963434
bigram perplexity= 251.70034887595685
trigram perplexity= 14.707950555613726
..test..
unigram perplexity= 1684.858102481261
bigram perplexity= 251.96117101111764
trigram perplexity= 15.096859354326547


In [18]:
print("......k=0.05 smooting performance.......")

print('..train..')
print("unigram perplexity=",text_probability(train,1,unigram_train_k_05))
print("bigram perplexity=",text_probability(train,2,bigram_train_k_05,unigram_train_k_05))
print("trigram perplexity=",text_probability(train,3,trigram_train_k_05,bigram_train_k_05))
print('..validation..')
print("unigram perplexity=",text_probability(validation,1,unigram_train_k_05))
print("bigram perplexity=",text_probability(validation,2,bigram_train_k_05,unigram_train_k_05))
print("trigram perplexity=",text_probability(validation,3,trigram_train_k_05,bigram_train_k_05))
print('..test..')
print("unigram perplexity=",text_probability(test,1,unigram_train_k_05))
print("bigram perplexity=",text_probability(test,2,bigram_train_k_05,unigram_train_k_05))
print("trigram perplexity=",text_probability(test,3,trigram_train_k_05,bigram_train_k_05))

......k=0.05 smooting performance.......
..train..
unigram perplexity= 1941.0990998568923
bigram perplexity= 95.55443878775249
trigram perplexity= 5.3284420937647194
..validation..
unigram perplexity= 1726.4822060155464
bigram perplexity= 315.98910471936813
trigram perplexity= 19.734544987407137
..test..
unigram perplexity= 1685.1386273239207
bigram perplexity= 315.73356490457326
trigram perplexity= 20.295591321493074


In [19]:
print("......k=0.01 smooting performance.......")

print('..train..')
print("unigram perplexity=",text_probability(train,1,unigram_train_k_01))
print("bigram perplexity=",text_probability(train,2,bigram_train_k_01,unigram_train_k_01))
print("trigram perplexity=",text_probability(train,3,trigram_train_k_01,bigram_train_k_01))
print('..validation..')
print("unigram perplexity=",text_probability(validation,1,unigram_train_k_01))
print("bigram perplexity=",text_probability(validation,2,bigram_train_k_01,unigram_train_k_01))
print("trigram perplexity=",text_probability(validation,3,trigram_train_k_01,bigram_train_k_01))
print('..test..')
print("unigram perplexity=",text_probability(test,1,unigram_train_k_01))
print("bigram perplexity=",text_probability(test,2,bigram_train_k_01,unigram_train_k_01))
print("trigram perplexity=",text_probability(test,3,trigram_train_k_01,bigram_train_k_01))

......k=0.01 smooting performance.......
..train..
unigram perplexity= 1941.085645879415
bigram perplexity= 96.98183754326558
trigram perplexity= 5.409098635628027
..validation..
unigram perplexity= 1727.3035384219418
bigram perplexity= 530.6329428185244
trigram perplexity= 39.18001355921677
..test..
unigram perplexity= 1686.1392615421596
bigram perplexity= 527.9992658576441
trigram perplexity= 40.471719803450185


In [20]:
sys.exit(1)

SystemExit: ignored

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## **Problem 2**



(Eisenstein Ch. 6) Using the Pytorch library, train an LSTM language model from the same Wikitext training corpus you used in problem 1. After each epoch of training, compute its perplexity on the Wikitext validation corpus. Stop training when the perplexity stops improving.

1. Fully describe your model architecture, hyperparameters, and experimental procedure.
2. After each epoch of training, compute your LM’s perplexity on the development data. Plot the development perplexity against # of epochs. Additionally, compute and report the perplexity on test
data.
3. Compare experimental results such as perplexity and training time between your n-gram and neural models (include smoothed and unsmooth n-grams). Provide graphs that demonstrate your
results.


In [None]:
## your code here
import numpy as np

#build vocabulary

train_list = train.split()
test_list = test.split()
validation_list = validation.split()

vocabulary = set(train_list + test_list + validation_list)

print(len(vocabulary))


In [None]:
#vectorize the word

word2idx = {w:i for i, w in enumerate(vocabulary)}

#test
for i,word in enumerate(word2idx):
    print('  {:4s}: {:4d},'.format(word, word2idx[word]))
    if i == 5:
      break
    


In [None]:
#vectorize the text
def vectorize_text(text):
  return [word2idx[word] for word in text]


vectorized_train_list = vectorize_text(train_list)
vectorized_test_list = vectorize_text(test_list)
vectorized_validation_list = vectorize_text(validation_list)
print(train_list[0:10])
print(vectorized_train_list[0:10])
print(word2idx['Valkyria'],word2idx['戦場のヴァルキュリア3'])

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(0)

#creating training samples
def create_training_samples(sequence_length, vectorized_text,batch_length):
  starting_index = np.random.choice(len(vectorized_text)-1-sequence_length, batch_length)
  x = np.array([vectorized_text[i:i+sequence_length] for i in starting_index])
  y = np.array([vectorized_text[i+1:i+sequence_length+1] for i in starting_index])
  return torch.from_numpy(x),torch.from_numpy(y)


#test create_training_samples#
toy_x, toy_y = create_training_samples(5,vectorized_train_list[0:10],2)

print(toy_x)
print(toy_y)


In [None]:
from torch.utils.data import Dataset, DataLoader
class Wikitext:
  def __init__(self,vectorized_text,sequence_length):
    self.vectorized_text = vectorized_text
    self.sequence_length = sequence_length
  def __len__(self):
    return len(self.vectorized_text) - self.sequence_length
  def __getitem__(self,idx):
    x = np.array(self.vectorized_text[idx:idx+self.sequence_length])
    y = np.array(self.vectorized_text[idx+1:idx+1+self.sequence_length])
    return (torch.from_numpy(x),torch.from_numpy(y))
    


''' test the Wikitesxt dataset 
wiki_datatest = Wikitext(vectorized_train_list[0:10],5)
toy_train_loader = DataLoader(dataset=wiki_datatest,batch_size=100)

toy_iter = iter(toy_train_loader)
print(vectorized_train_list[0:10])
xy = toy_iter.next()
x, y = xy
print(x, y)
'''
sequence_length = 10 #64
batch_length = 50 #16

wiki_datatest_train = Wikitext(vectorized_train_list,sequence_length)
wiki_datatest_validation = Wikitext(vectorized_validation_list,sequence_length)
wiki_datatest_test = Wikitext(vectorized_test_list,sequence_length)
train_loader = DataLoader(dataset=wiki_datatest_train, batch_size=batch_length)
validation_loader = DataLoader(dataset=wiki_datatest_validation, batch_size=batch_length)
test_loader = DataLoader(dataset=wiki_datatest_test, batch_size=batch_length)





In [None]:
########################################################################

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(0)
class LSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, rnn_layers):
        super(LSTMModel, self).__init__()
        self.rnn_layers= rnn_layers
        self.hidden_size=hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim,rnn_layers,batch_first=True)
        self.last_layer = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, text,previous_state):
      embeds = self.word_embeddings(text)

      
      
      
      lstm_out,new_state = self.lstm(embeds,previous_state)

      

      lstm_out = lstm_out.reshape(lstm_out.size(0)*lstm_out.size(1), lstm_out.size(2)) #######
      

      last_layer = self.last_layer(lstm_out)   ######################

      return last_layer, new_state
    def setup_first_layer(self,sequence_length):
      return (torch.zeros(self.rnn_layers, sequence_length, self.hidden_size),
                torch.zeros(self.rnn_layers, sequence_length, self.hidden_size))


embedding_dim, hidden_dim, vocab_size, rnn_layers = 512, 512, len(vocabulary), 1

lstm_model = LSTMModel(embedding_dim, hidden_dim, vocab_size,rnn_layers)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(lstm_model.parameters(), lr=0.01)

      

In [None]:

x, y = create_training_samples(sequence_length,vectorized_train_list, batch_length)
print(x.shape)
print(y.shape)
prev_state = lstm_model.setup_first_layer(batch_length)
print(prev_state[0].shape)

pred,_ = lstm_model(x,prev_state)
print("Input shape:      ", x.shape, " # (batch_size, sequence_length)")
print("Prediction shape: ", pred.shape, "# (batch_size, sequence_length, vocab_size)")
print(len(prev_state))

In [None]:
yshaped = y.reshape(-1)
loss = criterion(pred,yshaped)

print(yshaped.shape)
print(loss)


In [None]:
from torch.nn.utils import clip_grad_norm_

toy_wiki_datatest = Wikitext(vectorized_train_list[0:10],sequence_length)
toy_train_loader = DataLoader(dataset=toy_wiki_datatest,batch_size=batch_length)


num_of_epochs = 2


for epoch in range(num_of_epochs):
  lstm_model.train()
  current_epoch_train_loss = 0
  batch_number = 1
  for x,y in train_loader:
      #print(x,y)
      prev_states = lstm_model.setup_first_layer(batch_length)
      optimizer.zero_grad()
      pred,new_state = lstm_model(x,prev_states)

      new_state[0].detach() ###############################
      new_state[1].detach() ###############################

      yshaped = y.reshape(-1)
      loss = criterion(pred,yshaped)
      current_epoch_train_loss+=loss.item()
      loss.backward()
      clip_grad_norm_(lstm_model.parameters(), 0.5)
      optimizer.step()
      print('epoch=',epoch,' loss=',loss.item(),' batch_number=',batch_number," total=",len(train_loader))
      batch_number+=1


  print("train loss=",current_epoch_train_loss/ len(train_loader))


  lstm_model.eval()
  current_epoch_valid_loss = 0
  for x,y in validation_loader:
    pred,new_state = lstm_model(x,prev_states)
    yshaped = y.reshape(-1)
    loss = criterion(pred,yshaped)
    current_epoch_valid_loss+=loss.item()
    print("validation loss=",loss.item())
  
  ###########################################################
  print(f'Epoch {epoch+1} \t\t Training Loss: {current_epoch_train_loss / len(train_loader)} \t\t Validation Loss: {current_epoch_valid_loss / len(validation_loader)}')
    





In [None]:

#train the model
from tqdm import tqdm

def train(x,y):
  prev_states = lstm_model.setup_first_layer(batch_length)

  optimizer.zero_grad()
  pred,new_state = lstm_model(x,prev_states)
  new_state[0].detach() ###############################
  new_state[1].detach() ###############################
  yshaped = y.reshape(-1)
  loss = criterion(pred,yshaped)
  loss.backward()
  optimizer.step()
  return loss


loss_record = []

num_training_iterations=3000
for i in tqdm(range(num_training_iterations)):
   x, y = create_training_samples(sequence_length,vectorized_train_list, batch_length)
   loss = train(x, y)
   loss_record.append(loss)
   if i%500 == 0:
     print("i=",i,' loss=',loss.item()," perplexity=",pow(2,loss.item()))
