# Advanced NLP HW0

Before starting the task please read thoroughly these chapters of Speech and Language Processing by Daniel Jurafsky & James H. Martin:

•	N-gram language models: https://web.stanford.edu/~jurafsky/slp3/3.pdf

•	Neural language models: https://web.stanford.edu/~jurafsky/slp3/7.pdf 

In this task you will be asked to implement the models described there.

Build a text generator based on n-gram language model and neural language model.
1.	Find a corpus (e.g. http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt ), but you are free to use anything else of your interest
2.	Preprocess it if necessary (we suggest using nltk for that)
3.	Build an n-gram model
4.	Try out different values of n, calculate perplexity on a held-out set
5.	Build a simple neural network model for text generation (start from a feed-forward net for example). We suggest using tensorflow + keras for this task

Criteria:
1.	Data is split into train / validation / test, motivation for the split method is given
2.	N-gram model is implemented
a.	Unknown words are handled
b.	Add-k Smoothing is implemented
3.	Neural network for text generation is implemented
4.	Perplexity is calculated for both models
5.	Examples of texts generated with different models are present and compared
6.	Optional: Try both character-based and word-based approaches.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import re
import urllib.request as urllib2 #downloading data from url

from collections import defaultdict
import random

import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline, padded_everygrams
from nltk.lm import MLE, Vocabulary, KneserNeyInterpolated, WittenBellInterpolated, Laplace, Lidstone

from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split

In [2]:
data = list(urllib2.urlopen('https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt'))

In [3]:
def preproc(data):
    data = [line.strip().decode("utf-8")  for line in data]
    pat = re.compile(r'((\b\w*)|(\b\w*\s?\b\w*)):$')
    data = [i.lower() for i in data if i]
    p = []
    speech = ''
    for line in data:
        if not pat.findall(line):
            if not speech:
                speech = line
            else:
                speech += ' ' + line

        else:
            p.append(speech)
            speech = ''
    p = [string for string in p if len(string) != 0]
    
    return p

In [4]:
data = preproc(data)

In [5]:
appos = {
"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"im" :"I am",
"isn't" : "is not",
"its": "it is",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"won't":"will not",
"didn't": "did not",
"'t'": ' it', 
"'em": "them", "o'": "of", "'ll": " will", "ne'er":"never"
}
for i, j in appos.items():
    for k in range(len(data)):
        data[k] = data[k].replace(i, j)   

In [6]:
tokenized = list(map(nltk.word_tokenize, data))

In [9]:
tokenized[0:4]

[['before',
  'we',
  'proceed',
  'any',
  'further',
  ',',
  'hear',
  'me',
  'speak',
  '.'],
 ['speak', ',', 'speak', '.'],
 ['you',
  'are',
  'all',
  'resolved',
  'rather',
  'to',
  'die',
  'than',
  'to',
  'famish',
  '?'],
 ['resolved', '.', 'resolved', '.']]

## Models

Base class for the model.

In [10]:
class BaseLM:
    
    def __init__(self, n, vocab = None):
    
        """Language model constructor
        n -- n-gram size
        vocab -- optional fixed vocabulary for the model
        """
        self.n = n
        self.vocab = vocab
        self.corpus = []
        self.dic = defaultdict(lambda: defaultdict(lambda: 0))
        
        def generate_corpus():
            
            for speech in self.vocab:

                ngram = nltk.ngrams([word for word in speech], self.n+1, pad_right=True, pad_left=True)
                self.corpus.append(list(ngram))

            
            for ngram in [item for sublist in self.corpus for item in sublist]:
                self.dic[(ngram[:-1])][ngram[-1]] += 1

            for key in self.dic.keys():
                total = float(sum(self.dic[key].values()))
                for value in self.dic[key]:
                    self.dic[(key)][value] /= total
                

        generate_corpus()
    

    def prob(self, word, context=None):
        """This method returns probability of a word with given context: P(w_t | w_{t - 1}...w_{t - n + 1})

        For example:
        >>> lm.prob('hello', context=('world',))
        0.99988
        """
        
        if word in self.dic[tuple(context.split(' '))].keys():
            print(self.dic[tuple(context.split(' '))][word])
        else:
            print('There is no such sequence in corpus!')
        
    def generate_text(self, text_length):
        """This method generates random text of length 

        For example
        >>> lm.generate_text(2)
        hello world

        """
        text = list(list(self.dic.keys())[random.randint(0, len(self.dic))])
        endpoint = 0

        while len(text)<=text_length:
            prob = 0

            for word in self.dic[tuple(text[(self.n*(-1)):])].keys():
                prob += self.dic[tuple(text[(self.n*(-1)):])][word]

                if prob >= np.random.randn():
                    text.append(word)
                    break
        print(' '.join([w for w in text if w]))
    

    def update(self, sequence_of_tokens):
        """This method learns probabiities based on given sequence of tokents
    
        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        raise NotImplementedError
    
    def perplexity(self, sequence_of_tokens):
        """This method returns perplexity for a given sequence of tokens
    
        sequence_of_tokens -- iterable of tokens
        """
        raise NotImplementedError  

In [11]:
blm = BaseLM(3, tokenized)

In [13]:
blm.corpus[0:3]

[[(None, None, None, 'before'),
  (None, None, 'before', 'we'),
  (None, 'before', 'we', 'proceed'),
  ('before', 'we', 'proceed', 'any'),
  ('we', 'proceed', 'any', 'further'),
  ('proceed', 'any', 'further', ','),
  ('any', 'further', ',', 'hear'),
  ('further', ',', 'hear', 'me'),
  (',', 'hear', 'me', 'speak'),
  ('hear', 'me', 'speak', '.'),
  ('me', 'speak', '.', None),
  ('speak', '.', None, None),
  ('.', None, None, None)],
 [(None, None, None, 'speak'),
  (None, None, 'speak', ','),
  (None, 'speak', ',', 'speak'),
  ('speak', ',', 'speak', '.'),
  (',', 'speak', '.', None),
  ('speak', '.', None, None),
  ('.', None, None, None)],
 [(None, None, None, 'you'),
  (None, None, 'you', 'are'),
  (None, 'you', 'are', 'all'),
  ('you', 'are', 'all', 'resolved'),
  ('are', 'all', 'resolved', 'rather'),
  ('all', 'resolved', 'rather', 'to'),
  ('resolved', 'rather', 'to', 'die'),
  ('rather', 'to', 'die', 'than'),
  ('to', 'die', 'than', 'to'),
  ('die', 'than', 'to', 'famish'),
  ('

In [14]:
blm.generate_text(500)

be low . before we proceed any further , examine your conscience : and dying so , death is my son-in-law , death is to hI amself and knew no other kin . before you can say , god shorten harry 's happy life one day ! we know't , we know't . before we make election , give me leave to go . before thee stands this fair hesperides , with golden fruit , but dangerous to be aged in any kind of art . heaven and yourself had part in this fair maid ; before thy coming lewis was henry 's friend . o , me alone ! make you a sword of me ? before thy coming lewis was henry 's friend . o , me ! you are like to sir vincentio . his name and orderly proceed to swear hI am in a holiday humour and like enough to consent . what would you have , i know a way , out of thy long-experienced tI ame , but hearts for the event . before we proceed any further , hear me speak . speak , count , 't is true that you have lately told us ; the volsces are in arms , and swore he would never have loved the moor , the chafe

In [15]:
blm.prob('any', 'before we proceed')

1.0
