# Advanced NLP HW0

Before starting the task please read thoroughly these chapters of Speech and Language Processing by Daniel Jurafsky & James H. Martin:

•	N-gram language models: https://web.stanford.edu/~jurafsky/slp3/3.pdf

•	Neural language models: https://web.stanford.edu/~jurafsky/slp3/7.pdf 

In this task you will be asked to implement the models described there.

Build a text generator based on n-gram language model and neural language model.
1.	Find a corpus (e.g. http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt ), but you are free to use anything else of your interest
2.	Preprocess it if necessary (we suggest using nltk for that)
3.	Build an n-gram model
4.	Try out different values of n, calculate perplexity on a held-out set
5.	Build a simple neural network model for text generation (start from a feed-forward net for example). We suggest using tensorflow + keras for this task

Criteria:
1.	Data is split into train / validation / test, motivation for the split method is given
2.	N-gram model is implemented
a.	Unknown words are handled
b.	Add-k Smoothing is implemented
3.	Neural network for text generation is implemented
4.	Perplexity is calculated for both models
5.	Examples of texts generated with different models are present and compared
6.	Optional: Try both character-based and word-based approaches.

In [69]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import re
import urllib.request as urllib2 #downloading data from url

from collections import defaultdict
import random

import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline, padded_everygrams
from nltk.lm import MLE, Vocabulary, KneserNeyInterpolated, WittenBellInterpolated, Laplace, Lidstone

from sklearn.preprocessing import MinMaxScaler, StandardScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split

In [70]:
data = list(urllib2.urlopen('https://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt'))

In [71]:
def preproc(data):
    data = [line.strip().decode("utf-8")  for line in data]
    pat = re.compile(r'((\b\w*)|(\b\w*\s?\b\w*)):$')
    data = [i.lower() for i in data if i]
    p = []
    speech = ''
    for line in data:
        if not pat.findall(line):
            if not speech:
                speech = line
            else:
                speech += ' ' + line

        else:
            p.append(speech)
            speech = ''
    p = [string for string in p if len(string) != 0]
    
    return p

In [72]:
data = preproc(data)

In [73]:
data

['before we proceed any further, hear me speak.',
 'speak, speak.',
 'you are all resolved rather to die than to famish?',
 'resolved. resolved.',
 'first, you know caius marcius is chief enemy to the people.',
 "we know't, we know't.",
 "let us kill him, and we'll have corn at our own price. is't a verdict?",
 "no more talking on't; let it be done: away, away!",
 'one word, good citizens.',
 'we are accounted poor citizens, the patricians good. what authority surfeits on would relieve us: if they would yield us but the superfluity, while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them let us revenge this with our pikes, ere we become rakes: for the gods know i speak this in hunger for bread, not in thirst for revenge.',
 'would you proceed especially against caius marcius?',
 "against him first: he's

In [74]:
appos = {
"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"im" :"I am",
"isn't" : "is not",
"its": "it is",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"won't":"will not",
"didn't": "did not",
"'t'": ' it'
}
for i, j in appos.items():
    for k in range(len(data)):
        data[k] = data[k].replace(i, j)   

In [75]:
tokenized = list(map(nltk.word_tokenize, data))

In [76]:
tokenized

[['before',
  'we',
  'proceed',
  'any',
  'further',
  ',',
  'hear',
  'me',
  'speak',
  '.'],
 ['speak', ',', 'speak', '.'],
 ['you',
  'are',
  'all',
  'resolved',
  'rather',
  'to',
  'die',
  'than',
  'to',
  'famish',
  '?'],
 ['resolved', '.', 'resolved', '.'],
 ['first',
  ',',
  'you',
  'know',
  'caius',
  'marcius',
  'is',
  'chief',
  'enemy',
  'to',
  'the',
  'people',
  '.'],
 ['we', "know't", ',', 'we', "know't", '.'],
 ['let',
  'us',
  'kill',
  'hI',
  'am',
  ',',
  'and',
  'will',
  'have',
  'corn',
  'at',
  'our',
  'own',
  'price',
  '.',
  "is't",
  'a',
  'verdict',
  '?'],
 ['no',
  'more',
  'talking',
  'o',
  "n't",
  ';',
  'let',
  'it',
  'be',
  'done',
  ':',
  'away',
  ',',
  'away',
  '!'],
 ['one', 'word', ',', 'good', 'citizens', '.'],
 ['we',
  'are',
  'accounted',
  'poor',
  'citizens',
  ',',
  'the',
  'patricians',
  'good',
  '.',
  'what',
  'authority',
  'surfeit',
  'is',
  'on',
  'would',
  'relieve',
  'us',
  ':',
  'i

## Models

Base class for the model.

In [77]:
class BaseLM:
    
    def __init__(self, n, vocab = None):
    
        """Language model constructor
        n -- n-gram size
        vocab -- optional fixed vocabulary for the model
        """
        self.n = n
        self.vocab = vocab
        self.corpus = []
        self.dic = defaultdict(lambda: defaultdict(lambda: 0))
        
        def generate_corpus():
            
            for speech in self.vocab:

                ngram = nltk.ngrams([word for word in speech], self.n+1, pad_right=True, pad_left=True)
                self.corpus.append(list(ngram))

            
            for ngram in [item for sublist in self.corpus for item in sublist]:
                self.dic[(ngram[:-1])][ngram[-1]] += 1

            for key in self.dic.keys():
                total = float(sum(self.dic[key].values()))
                for value in self.dic[key]:
                    self.dic[(key)][value] /= total
                

        generate_corpus()
    

    def prob(self, word, context=None):
        """This method returns probability of a word with given context: P(w_t | w_{t - 1}...w_{t - n + 1})

        For example:
        >>> lm.prob('hello', context=('world',))
        0.99988
        """
        
        if word in self.dic[tuple(context.split(' '))].keys():
            print(self.dic[tuple(context.split(' '))][word])
        else:
            print('There is no such sequence in corpus!')
        
    def generate_text(self, text_length):
        """This method generates random text of length 

        For example
        >>> lm.generate_text(2)
        hello world

        """
        text = list(list(self.dic.keys())[random.randint(0, len(self.dic))])
        endpoint = 0

        while len(text)<=text_length:
            prob = 0

            for word in self.dic[tuple(text[(self.n*(-1)):])].keys():
                prob += self.dic[tuple(text[(self.n*(-1)):])][word]

                if prob >= np.random.randn():
                    text.append(word)
                    break
        print(' '.join([w for w in text if w]))
    

    def update(self, sequence_of_tokens):
        """This method learns probabiities based on given sequence of tokents
    
        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        raise NotImplementedError
    
    def perplexity(self, sequence_of_tokens):
        """This method returns perplexity for a given sequence of tokens
    
        sequence_of_tokens -- iterable of tokens
        """
        raise NotImplementedError  

In [78]:
blm = BaseLM(3, tokenized)

In [79]:
blm.corpus

[[(None, None, None, 'before'),
  (None, None, 'before', 'we'),
  (None, 'before', 'we', 'proceed'),
  ('before', 'we', 'proceed', 'any'),
  ('we', 'proceed', 'any', 'further'),
  ('proceed', 'any', 'further', ','),
  ('any', 'further', ',', 'hear'),
  ('further', ',', 'hear', 'me'),
  (',', 'hear', 'me', 'speak'),
  ('hear', 'me', 'speak', '.'),
  ('me', 'speak', '.', None),
  ('speak', '.', None, None),
  ('.', None, None, None)],
 [(None, None, None, 'speak'),
  (None, None, 'speak', ','),
  (None, 'speak', ',', 'speak'),
  ('speak', ',', 'speak', '.'),
  (',', 'speak', '.', None),
  ('speak', '.', None, None),
  ('.', None, None, None)],
 [(None, None, None, 'you'),
  (None, None, 'you', 'are'),
  (None, 'you', 'are', 'all'),
  ('you', 'are', 'all', 'resolved'),
  ('are', 'all', 'resolved', 'rather'),
  ('all', 'resolved', 'rather', 'to'),
  ('resolved', 'rather', 'to', 'die'),
  ('rather', 'to', 'die', 'than'),
  ('to', 'die', 'than', 'to'),
  ('die', 'than', 'to', 'famish'),
  ('

In [80]:
blm.generate_text(100)

master , pindarus , in his beard , bid sorrow wag , cry 'hem ! ' when he should groan , patch grief with proverbs , make misfortune drunk with candle-wasters ; bring hI am to the rock , the oak not to be their words : they told me i should be , which pitifully disaster the cheeks . before thy coming lewis was henry 's friend . before we proceed any further , hear me speak . speak , cousin ; but 't is thought speak it


In [27]:
blm.prob('mutable', ': for the')

0.045454545454545456


In [45]:
blm.prob('home', 'i want to go')

There is no such sequence in corpus!


There is no such sequence in corpus!


In [41]:
X_train, X_test = train_test_split(tokenized, test_size=0.1, random_state=42)

In [42]:
train_data, train_padded_sents = padded_everygram_pipeline(4, X_train)
test_data, test_padded_sents = padded_everygram_pipeline(4, X_test)

In [32]:
model = Laplace(4)

model.fit(train_data, train_padded_sents)

In [33]:
model.score('be', 'shall you'.split())

0.00024164317358034635

In [34]:
for i, test in enumerate(test_data):
    print("PP({0}):{1}".format(X_test[i], model.perplexity(test)))

PP(['come', ',', 'sir', '.']):35.00027642110774
PP(['it', 'will', 'be', 'found', 'so', ',', 'master', 'page', '.', 'master', 'doctor', 'caius', ',', 'i', 'am', 'come', 'to', 'fetch', 'you', 'home', '.', 'i', 'am', 'sworn', 'of', 'the', 'peace', ':', 'you', 'have', 'showed', 'yourself', 'a', 'wise', 'physician', ',', 'and', 'sir', 'hugh', 'hath', 'shown', 'himself', 'a', 'wise', 'and', 'patient', 'churchman', '.', 'you', 'must', 'go', 'with', 'me', ',', 'master', 'doctor', '.']):1974.7940513585127
PP(['the', 'sooner', ',', 'sweet', ',', 'for', 'you', '.']):197.9087632645272
PP(['come', ',', 'for', 'the', 'third', ',', 'laertes', ':', 'you', 'but', 'dally', ';', 'i', 'pray', 'you', ',', 'pass', 'with', 'your', 'best', 'violence', ';', 'i', 'am', 'afeard', 'you', 'make', 'a', 'wanton', 'of', 'me', '.']):1177.40526926497
PP(['therefore', 'he', 'will', 'be', ',', 'timon', ':', 'his', 'honesty', 'rewards', 'him', 'in', 'itself', ';', 'it', 'must', 'not', 'bear', 'my', 'daughter', '.']):1293.

KeyboardInterrupt: 

In [45]:
words = [word for sent in tokenized for word in sent]
words.extend(["<s>", "</s>"])

padded_sents = Vocabulary(words)

train_data = [nltk.ngrams(t, 4, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized]    ### Does not!

In [50]:
model = Laplace(4)

model.fit(train_data, padded_sents)

In [51]:
model.vocab.lookup(('shall', 'i', 'walmart', 'there','to', 'find', 'you', 'down'))

('shall', 'i', '<UNK>', 'there', 'to', 'find', 'you', 'down')

In [52]:
test_sentences = ['shall you be my', 'or die for']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

test_data, _ = padded_everygram_pipeline(4, tokenized_text)

for i, test in enumerate(test_data):
    print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

PP(shall you be my):23098.590750415962
PP(or die for):23592.00837402103


In [53]:
tokenized_text

[['shall', 'you', 'be', 'my'], ['or', 'die', 'for']]

In [54]:
model.score('be', 'shall you'.split())

3.85549600956163e-05

ValueError: Can't choose from empty population

In [57]:
n = 2
train_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized]
words = [word for sent in tokenized for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = MLE(n)
model.fit(train_data, padded_vocab)

In [65]:
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]

In [67]:
test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

MLE Estimates: [(('an', ('<s>',)), 0.00278296403283193), (('apple', ('an',)), 0.0023923444976076554), (('</s>', ('apple',)), 0.0)]
MLE Estimates: [(('an', ('<s>',)), 0.00278296403283193), (('ant', ('an',)), 0.0005980861244019139), (('</s>', ('ant',)), 0.0)]


In [68]:
train_sentences = ['an apple a day keeps doctors away', 'an orange on the tabe']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in train_sentences]

n = 5
train_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
words = [word for sent in tokenized_text for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = Laplace(n)
model.fit(train_data, padded_vocab)

test_sentences = ['an apple', 'a big black ant', 'on the table']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]

test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for i, test in enumerate(test_data):
  print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

MLE Estimates: [(('an', ('<s>',)), 0.17647058823529413), (('apple', ('an',)), 0.11764705882352941), (('</s>', ('apple',)), 0.0625)]
MLE Estimates: [(('a', ('<s>',)), 0.058823529411764705), (('big', ('a',)), 0.0625), (('black', ('big',)), 0.06666666666666667), (('ant', ('black',)), 0.06666666666666667), (('</s>', ('ant',)), 0.06666666666666667)]
MLE Estimates: [(('on', ('<s>',)), 0.058823529411764705), (('the', ('on',)), 0.125), (('table', ('the',)), 0.0625), (('</s>', ('table',)), 0.06666666666666667)]
PP(an apple):9.168300902386457
PP(a big black ant):15.580038848434484
PP(on the table):13.44118434700527
