# N-GRAM - Probabilistic Language Modeling

- Let us build our first language model ...
- **Important**: for all of the tasks below please make use of the provided methods given by the `nltk` Python module
- This is a step-by-step coding example to get from raw text information to the final language model
- Have look at the NLTK-LM documentation: https://www.nltk.org/api/nltk.lm.html

## Load and Extract the Data

- Use the provided text document including a series of different Tweets from Donald Trupm (`Donald-Trump-Tweets.csv`)
- Use the `Pandas` Python library to read the CSV-file as a `pandas dataframe` object
- Filter the column which contains the text information (`Tweet_Text`), leading to a Table with 2 columns -- index and text information
- Verify also for invalid entries and filter them in advance (`notnull`function of the `pandas dataframe` object)
- Visualize the resulting and filtered `Pandas` Table

In [1]:
import pandas

trump_data = pandas.read_csv('Donald-Trump-Tweets.csv')
trump_data

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,
...,...,...,...,...,...,...,...,...,...,...,...,...
7370,15-07-16,13:10:00,I loved firing goofball atheist Penn @pennjill...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,953,431,,
7371,15-07-16,10:18:31,I hear @pennjillette show on Broadway is terri...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1175,1086,,
7372,15-07-16,10:10:17,Irrelevant clown @KarlRove sweats and shakes n...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1494,930,,
7373,15-07-16,9:44:07,"""@HoustonWelder: Donald Trump is one of the se...",text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1800,1738,,


In [2]:
clean_trump_tweets = trump_data[trump_data['Tweet_Text'].notnull()]
trump_data

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,
...,...,...,...,...,...,...,...,...,...,...,...,...
7370,15-07-16,13:10:00,I loved firing goofball atheist Penn @pennjill...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,953,431,,
7371,15-07-16,10:18:31,I hear @pennjillette show on Broadway is terri...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1175,1086,,
7372,15-07-16,10:10:17,Irrelevant clown @KarlRove sweats and shakes n...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1494,930,,
7373,15-07-16,9:44:07,"""@HoustonWelder: Donald Trump is one of the se...",text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1800,1738,,


In [3]:
trump_tweets = trump_data['Tweet_Text'].to_list()
trump_data

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,
...,...,...,...,...,...,...,...,...,...,...,...,...
7370,15-07-16,13:10:00,I loved firing goofball atheist Penn @pennjill...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,953,431,,
7371,15-07-16,10:18:31,I hear @pennjillette show on Broadway is terri...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1175,1086,,
7372,15-07-16,10:10:17,Irrelevant clown @KarlRove sweats and shakes n...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1494,930,,
7373,15-07-16,9:44:07,"""@HoustonWelder: Donald Trump is one of the se...",text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1800,1738,,


## Text Preprocessing
- Clean and prepare (e.g. regex, nltk functionalities, etc.) the entire text information, in order to gurantee a robust sentence tokenization (e.g. how to handle hyperlinks, hashtags, punctuation marks, non-alphanumeric patterns, twitter references via @, upper/lower case, ...?)
- Perform the entire preprocessing in the same way for all three data partitions to ensure comparability
- Split the entire text information into a `training, validation, and unseen test dataset` (take the `first 6,375 lines for training`, the following `500 for validation`, and the remaining `500 for testing`)

In [4]:
import re
 
sent_patt = re.compile('(?<!\.|\!|\?|\:|\;|\s)\w[.,:;!?]\s+')
multi_sym = re.compile(r'[!.,?=-]{2,}')
time = re.compile(r'([0-2][0-9]:[0-5][0-9]([pm]|[am])*)|([0-2]*[0-9]*:*[0-5][0-9]([pm]|[am])+)|([0-9][0-9]*([pm]|[am])+)')
date = re.compile(r'([0-3]*[0-9]\/[0-9]*\/[0-9]+)')

for idx in range(len(trump_tweets)):
    tweet = trump_tweets[idx]
    
    tweet = " "+tweet.lower()+" "
    
    tweet = tweet.replace("\n", " ").replace("\"", "").replace("“", "").replace("”", "").replace("|", " ").replace("`", " ").replace("'", " ").replace(":_", " ").replace("_", " ").replace(" rt ", " retweet ").replace(" mrs. ", " mrs ").replace(" ms. ", " ms ").replace(" mr. ", " mr ").replace(" dr. ", " dr ").replace(" prof. ", " prof "). replace(" dr.-ing. ", " dr.-ing ")  
    tweet = re.sub(r'http\S+', ' hrefl ', tweet) #links
    tweet = re.sub(r'#\S+', ' twhash ', tweet) #hashtag
    tweet = re.sub(r'@\S+', ' usacc ', tweet) #useraccount
    
    all_sym = multi_sym.finditer(tweet)
    all_time = time.finditer(tweet)
    all_date = date.finditer(tweet)
    
    for m in all_sym:
        tweet = tweet.replace(m.group(), ' '+m.group()[0]+' ', 1)
    for m in all_time:
        tweet = tweet.replace(m.group(), ' tiform ', 1)
    for m in all_date:
        tweet = tweet.replace(m.group(), ' dtform ', 1)
    tweet = " "+tweet+" "
    all_pts = sent_patt.finditer(tweet)
    for m in all_pts:
        tweet = tweet.replace(m.group(), m.group()[0]+' '+m.group()[1]+' ', 1)

    tweet = re.sub("\s\s+", " ", tweet).lower()
    trump_tweets[idx] = tweet.strip()

In [5]:
train = trump_tweets[0:6375]
val = trump_tweets[6375:6875]
test = trump_tweets[6875:]

print("Train Size:", len(train))
print("Val Size:", len(val))
print("Test Size:", len(test))

Train Size: 6375
Val Size: 500
Test Size: 500


## Sentence Tokenization and Padding
- Join all the tweet text information and perform sentence tokenization
- Report the number of sentences
- Use the list of detected and individual sentences and either consistently remove any type of punctuation marks, or leave them in the original corpus and treat them as individual words  
- Integrate the required sentence start (`<s>`) and sentence end (`</s>`) for each sentence in the list (`N-Gram order = 3`)
- Convert the list of sentences (list of strings) into a nested list, describing a list of sentences, while each sentence is represented as list of words (word vector $\vec{w}$)

In [6]:
from nltk import sent_tokenize
from nltk.lm.preprocessing import pad_both_ends 

def sentenize(partition_list):
    #Training
    num_words_partition = 0
    sentences_partition = []
    for partition_tweet in partition_list:
        tweet_proc = partition_tweet.replace(" !", " .").replace(" ?", " .")
        sent_part = tweet_proc.split(" .")
        sent_part = list(filter(None, sent_part))
        for sent in sent_part:
            words = sent.strip().split(" ")
            sentences_partition.append(words)
            num_words_partition += len(words)
            
    return sentences_partition, num_words_partition
            
#Training
sentences_train, num_words_train = sentenize(train)

print("Words in Training=", num_words_train)
print("Number Sentences in Training=", len(sentences_train))
print("Average Number Words per Sentence in Training=", num_words_train/len(sentences_train))
print()

#Validation
sentences_val, num_words_val = sentenize(val)

print("Words in Validation=", num_words_val)
print("Number Sentences in Validation=", len(sentences_val))
print("Average Number Words per Sentence in Validation=", num_words_val/len(sentences_val))
print()

#Test
sentences_test, num_words_test = sentenize(test)

print("Words in Test=", num_words_test)
print("Number Sentences in Test=", len(sentences_test))
print("Average Number Words per Sentence in Test=", num_words_test/len(sentences_test))

Words in Training= 113740
Number Sentences in Training= 12623
Average Number Words per Sentence in Training= 9.010536322585756

Words in Validation= 8831
Number Sentences in Validation= 954
Average Number Words per Sentence in Validation= 9.256813417190775

Words in Test= 8687
Number Sentences in Test= 918
Average Number Words per Sentence in Test= 9.462962962962964


In [7]:
def pad_sentence(sentences_dpart):
    padded_dpart_sentences = []
    for dpart_sentence in sentences_dpart:
        padded_sentence = list(pad_both_ends(dpart_sentence, 3))
        padded_dpart_sentences.append(padded_sentence)
    return padded_dpart_sentences
    
padded_train_tweets = pad_sentence(sentences_train)
padded_val_tweets = pad_sentence(sentences_val)
padded_test_tweets = pad_sentence(sentences_test)

## Generate Vocabulary

- Use the entire padded sentence information (nested list), in order to provide one single list object with all the words (use the `flatten` function from `nltk.lm.preprocessing`)
- Create the vocabulary for our language model, however each word, which does not show up more than `N times (N=1, N=2, ...)` (what happens if `N > 1`?) within the entire corpus, should not be part of the vocabulary and mapped to the `<UNK>` category/tag (use the `unk_cutoff` option)
- Use the `Vocabulary` object of `nltk.lm` to realize the requirements
- Report the size of your entire vocabulary (`|V|`) as well as the number of unique elements, togehter with the top-N most-frequent elements (without considering sentence start `<s>` and `</s>`) - What can you observe regarding the type of words which are the most frequent? 

In [8]:
from nltk.lm import Vocabulary
from nltk.lm.preprocessing import flatten

train_words = list(flatten(padded_train_tweets))

In [9]:
train_vocabulary = Vocabulary(train_words, unk_cutoff=1)
print("Number of unique words in the Vocabulary=", len(train_vocabulary))
number_words_vocabulary = sum(train_vocabulary.counts.values())
print("Total number of words in the Vocabulary=", number_words_vocabulary)

Number of unique words in the Vocabulary= 8548
Total number of words in the Vocabulary= 164232


In [10]:
most_common = train_vocabulary.counts.most_common(n=134)
print(most_common)

[('<s>', 25246), ('</s>', 25246), ('usacc', 6198), ('the', 3343), (',', 3123), ('hrefl', 2830), ('twhash', 2820), ('to', 2136), ('in', 1685), ('a', 1665), ('and', 1619), ('is', 1527), ('you', 1525), ('i', 1518), ('of', 1352), ('on', 1247), ('for', 1186), ('will', 1170), ('be', 946), ('great', 915), ('trump', 911), ('thank', 827), (':', 769), ('-', 671), ('at', 650), ('that', 642), ('we', 606), ('it', 582), ('with', 580), ('are', 550), (';', 544), ('hillary', 519), ('&amp', 517), ('me', 512), ('have', 484), ('he', 482), ('my', 465), ('so', 448), ('just', 444), ('not', 439), ('all', 438), ('this', 428), ('was', 424), ('america', 417), ('by', 377), ('people', 375), ('new', 363), ('has', 360), ('our', 348), ('out', 339), ('from', 311), ('your', 305), ('make', 302), ('clinton', 296), ('very', 294), ('tiform', 294), ('they', 288), ('poll', 284), ('no', 283), ('again', 282), ('about', 277), ('his', 277), ('.', 277), ('retweet', 272), ('now', 271), ('who', 270), ('get', 268), ('as', 263), ('do

## Compute N-Grams

- Use the padded sentence information and compute for each sentence, represented as word vector $\vec{w}$ (list of words), all the N-Gram patterns, using the `everygrams` function from `nltk` (total N-Gram information should include `Unigram, Bigram, and Trigram` - `N=3` )
- Analyze the output (`generator object`) for each sentence and convert it to a list object
- Appending all the sentence-wise list output with all the N-Gram information to another list, representing your final training data
- Compute the counts for all the Unigrams, Bigrams, and Trigrams (use the `Counter` module from `collections` together with the list-converted output of `everygrams`)
- Visualize the `top-N` most-frequent Unigrams, Bigrams, and Trigrams (use `matplotlib` barplot) 

In [11]:
from collections import Counter
from nltk import everygrams

def compute_n_grams(padded_datapart_sentences, minN, maxN):
    ngram_counts_dpart = None
    everygram_padded_dpart_tweets = []
    for padded_dpart_tweets in padded_datapart_sentences:
        everygram_dpart = list(everygrams(padded_dpart_tweets, min_len=minN, max_len=maxN))
        everygram_padded_dpart_tweets.append(everygram_dpart)
        if ngram_counts_dpart is None:
            ngram_counts_dpart = Counter(everygram_dpart)
        else:
            ngram_counts_dpart += Counter(everygram_dpart)
    
    return ngram_counts_dpart, everygram_padded_dpart_tweets

ngram_counts_train, everygram_padded_train_tweets = compute_n_grams(padded_train_tweets, 1, 3)
ngram_counts_val, everygram_padded_val_tweets = compute_n_grams(padded_val_tweets, 1, 3)
ngram_counts_test, everygram_padded_test_tweets = compute_n_grams(padded_test_tweets, 1, 3)

## Build N-Gram Language Model

- Create a N-Gram language model using the `MLE` (Maximum-Likelihood-Estimator) module from `nltk.lm`
- Train the model calling the `fit` function, which requires two mandatory arguments - the (nested) sentence-related list with all N-Grams per sentence (see result of Section `Compute N-Grams`), in addition to the vocabulary (see result of Section `Generate Vocabulary`)
- Analyze your trained model, using different functionalities, e.g. `counts`, `scores`, `logscore`, `vocab`, together with different words-sequences, also including words which are not in the vocabulary, in order to make sure that the model outputs are valid
- Manually evaluate/verify probabilities for `Unigram`, `Bigram`, and `Trigram` (in case of the `MLE` model) for the following expression: `"make america great"` - Does it match with the `model.counts` and `model.score` values? What happens in the special case of the bigram probability for `<s> <s>` and `</s> </s>` ? Are those counts and score values the same - explain what you are observing?
- There exist also more sophisticated language models, such as `KneserNeyInterpolated`, `Laplace`, `Lidstone`, `WittenBellInterpolated`, `AbsoluteDiscountingInterpolated` (see https://www.nltk.org/api/nltk.lm.html)
- Train also a `KneserNeyInterpolated` language model. What do you encounter when computing probabilities, compared to the `MLE` version (key word: smoothing)
- **Homework:** also have a look at the other LM alternatives (mentioned before), as well as compare results! Have also a look at the different `smoothing` options provided by the `nltk.lm` module

In [12]:
from nltk.lm import MLE, KneserNeyInterpolated, Laplace, Lidstone, AbsoluteDiscountingInterpolated, WittenBellInterpolated
model = MLE(3) #other LM options are possible here

In [13]:
model.fit(everygram_padded_train_tweets, train_vocabulary)

In [14]:
print("Size of unique words is the vocabulary=", model.vocab)
print("Total number of N-Grams=", model.counts)
print("--- Check if the sentence is within the vocabulary, otherwise replace by <UNK> ---")
print(model.vocab.lookup('nlp is in america also a common thing god bless america'.split()))
print()
print("Number of occurrences of the word \"make\" within the training dataset=", model.counts["make"])
print("Number of occurrences of the word \"america\" within the training dataset=", model.counts["america"])
print("Number of occurrences of the word \"great\" within the training dataset=", model.counts["great"])
print()
print("Number of occurrences of the word \"<s>\" within the training dataset=", model.counts["<s>"])
print("Number of occurrences of the word \"</s>\" within the training dataset=", model.counts["</s>"])
print()
print("Number of occurrences of the word \"make america\" within the training dataset=", model.counts[['make']]['america'])
print("Number of occurrences of the word \"america great\" within the training dataset=", model.counts[['america']]['great'])
print("Number of occurrences of the word \"make america great\" within the training dataset=", model.counts[['make', 'america']]['great'])
print()
print("Number of occurrences of the word \"<s> <s>\" within the training dataset=", model.counts[['<s>']]['<s>'])
print("Number of occurrences of the word \"</s> </s>\" within the training dataset=", model.counts[['</s>']]['</s>'])
print()
print("Total number of words in the training dataset (|V|)=", number_words_vocabulary)
print()
print("Probability of observing the word \"make\" in the training data=", model.score("make"))
print("Probability of observing the word \"america\" in the training data=", model.score("america"))
print("Probability of observing the word \"great\" in the training data=", model.score("great"))
print()
print("Probability of observing the bigram \"make america\" in the training data=", model.score('america', ['make']))
print("Probability of observing the bigram \"america great\" in the training data=", model.score('great', ['america']))
print()
print("Probability of observing the trigram \"make america great\" in the training data=", model.score('great', 'make america'.split()))
print()
print("Probability of observing the bigram \"<s> <s>\" in the training data=", model.score('<s>', ('<s>',)))
print("Probability of observing the bigram \"</s> </s>\" in the training data=", model.score('</s>', ('</s>',)))
print()

Size of unique words is the vocabulary= <Vocabulary with cutoff=1 unk_label='<UNK>' and 8549 items>
Total number of N-Grams= <NgramCounter with 3 ngram orders and 454827 ngrams>
--- Check if the sentence is within the vocabulary, otherwise replace by <UNK> ---
('<UNK>', 'is', 'in', 'america', 'also', 'a', 'common', 'thing', 'god', 'bless', 'america')

Number of occurrences of the word "make" within the training dataset= 302
Number of occurrences of the word "america" within the training dataset= 417
Number of occurrences of the word "great" within the training dataset= 915

Number of occurrences of the word "<s>" within the training dataset= 25246
Number of occurrences of the word "</s>" within the training dataset= 25246

Number of occurrences of the word "make america" within the training dataset= 214
Number of occurrences of the word "america great" within the training dataset= 179
Number of occurrences of the word "make america great" within the training dataset= 175

Number of occ

## Evaluation of the Language Model 

- First apply the trained language model and compute the `perplexity` for the `unigram "america"`. Compare if the perplexity is really the inverse probability which has been calculated before via `model.score`. Moreover take the sentence `<s> <s> make america great again </s> </s>` and convert all the bigrams within a list `[('<s>', '<s>'), ('<s>', 'make'), ...]`, which is then used as input for the `perplexity` computation. 
- Second, use the trained language model together with your `validation set` and compute the `perplexity` for each sentence in the validation set using the `everygram` output (compute total perplexity and averaged across all sentences)
- In case the performance is not very promising and/or you are observing a lot of `inf` values (division by zero in the perplexity equation) try to have a look at your pipeline and try to further optimize the text preprocessing and parametric setup (reduce the complexity of the vocabualry, e.g. increase `unk_cutoff`, use categorical approaches (e.g. `NER`), change the size of `N`, use a LM with an integrated smoothing concept (e.g. `Laplace`, etc.), besides looking for any other strange behaviors
- Once the performance on the validation corpus is satisfying, verify your model also on the final and unseen `test set` (via `perplexity` and same approach) - Large deviations?

In [15]:
unigram = [("america",)]
bigrams  = [('<s>', '<s>'), ('<s>', 'make'), ('make', 'america'), ('america', 'great'), ('great', 'again'), ('again', '</s>'), ('</s>', '</s>')]

print(model.perplexity(unigram))
print(model.perplexity(bigrams))
print()

bigram = 0
for bi in bigrams:
    print("Perplexity Bigram=",model.perplexity([bi]))
    bigram += model.perplexity([bi])
print()
print("Averaged Bigram-Perplexity:", bigram/len(bigrams))

393.84172661870537
3.826439252653616

Perplexity Bigram= 2.0
Perplexity Bigram= 345.83561643835645
Perplexity Bigram= 1.411214953271028
Perplexity Bigram= 2.3296089385474863
Perplexity Bigram= 4.420289855072464
Perplexity Bigram= 1.194915254237288
Perplexity Bigram= 1.0

Averaged Bigram-Perplexity: 51.17023506278353


In [16]:
import math

def get_perplexity(everygram_datadist):
    total_uni = 0
    total_bi = 0
    total_tri = 0
    
    val_uni = 0
    val_bi = 0
    val_tri = 0
    
    unigram = 0
    bigram = 0
    trigram = 0  
    
    for grams in everygram_datadist:
        
        for idx in range(len(grams)):
            
            perplex = model.perplexity([grams[idx]])
            
            if len(grams[idx]) == 1:
                total_uni += 1
                if not str(perplex).lower() == "inf":
                    unigram += perplex
                    val_uni += 1
            elif len(grams[idx]) == 2:
                total_bi += 1
                if not str(perplex).lower() == "inf":
                    bigram += perplex
                    val_bi += 1
            elif len(grams[idx]) == 3:
                total_tri += 1
                if not str(perplex).lower() == "inf":
                    trigram += perplex
                    val_tri += 1
                    
    return unigram, bigram, trigram, val_uni, val_bi, val_tri, total_uni, total_bi, total_tri

v_unigram, v_bigram, v_trigram, v_val_uni, v_val_bi, v_val_tri, v_total_uni, v_total_bi, v_total_tri = get_perplexity(everygram_padded_val_tweets)
t_unigram, t_bigram, t_trigram, t_val_uni, t_val_bi, t_val_tri, t_total_uni, t_total_bi, t_total_tri = get_perplexity(everygram_padded_test_tweets)

print("--------------------VALIDATION SET--------------------------")
print("Number of known/non-zero Unigrams/Probabilities=", v_val_uni)
print("Number of known/non-zero Bigrams/Probabilities=", v_val_bi)
print("Number of known/non-zero Trigrams/Probabilities=", v_val_tri)
print()
print("Perplexity Unigram-Level (Total-Uni="+str(v_total_uni)+", Valid-Uni="+str(v_val_uni)+")=", v_unigram/v_val_uni)
print("Perplexity Bigram-Level (Total-Bi="+str(v_total_bi)+", Valid-Bi="+str(v_val_bi)+")=", v_bigram/v_val_bi)
print("Perplexity Trigram-Level (Total-Tri="+str(v_total_tri)+", Valid-Tri="+str(v_val_tri)+")=", v_trigram/v_val_tri)
print()
print()

print("--------------------TEST SET--------------------------")
print("Number of known/non-zero Unigrams/Probabilities=", t_val_uni)
print("Number of known/non-zero Bigrams/Probabilities=", t_val_bi)
print("Number of known/non-zero Trigrams/Probabilities=", t_val_tri)
print()
print("Perplexity Unigram-Level (Total-Uni="+str(t_total_uni)+", Valid-Uni="+str(t_val_uni)+")=", t_unigram/t_val_uni)
print("Perplexity Bigram-Level (Total-Bi="+str(t_total_bi)+", Valid-Bi="+str(t_val_bi)+")=", t_bigram/t_val_bi)
print("Perplexity Trigram-Level (Total-Tri="+str(t_total_tri)+", Valid-Tri="+str(t_val_tri)+")=", t_trigram/t_val_tri)
print()
print()

--------------------VALIDATION SET--------------------------
Number of known/non-zero Unigrams/Probabilities= 12184
Number of known/non-zero Bigrams/Probabilities= 8536
Number of known/non-zero Trigrams/Probabilities= 4624

Perplexity Unigram-Level (Total-Uni=12647, Valid-Uni=12184)= 6413.056553918828
Perplexity Bigram-Level (Total-Bi=11693, Valid-Bi=8536)= 264.1471346939859
Perplexity Trigram-Level (Total-Tri=10739, Valid-Tri=4624)= 143.12515433141965


--------------------TEST SET--------------------------
Number of known/non-zero Unigrams/Probabilities= 11862
Number of known/non-zero Bigrams/Probabilities= 8073
Number of known/non-zero Trigrams/Probabilities= 4249

Perplexity Unigram-Level (Total-Uni=12359, Valid-Uni=11862)= 6289.107093400637
Perplexity Bigram-Level (Total-Bi=11441, Valid-Bi=8073)= 298.82240277636356
Perplexity Trigram-Level (Total-Tri=10523, Valid-Tri=4249)= 172.91214269608065




## Artificial Text Generation
- Use the `generate` functionality together with the `MLE` language model alternative to artificially produce Trump-Tweets via a user-defined number of words   
- You might face a lot of sentence start `<s>` and sentence end `</s>` tokens. Optimize the generated output, by ignoring and eliminating these tokens and produce sentences of at least 15 words during generation (without `<s>` and `</s>`)

In [17]:
generated_text_example_1 = model.generate(3, random_seed=4)
generated_text_example_2 = model.generate(10, random_seed=23)
generated_text_example_3 = model.generate(15, random_seed=154)

print(generated_text_example_1)
print(generated_text_example_2)
print(generated_text_example_3)

['<s>', '<s>', 'make']
['usacc', 'were', 'ready', 'to', 'make', 'america', 'great', 'again', '</s>', '</s>']
['<s>', 'usacc', 'steam', ':', 'as', 'a', 'paragon', 'of', 'virtue', 'just', 'shows', 'how', 'weak', 'and', 'open-and']


In [18]:
import random 

generated_result = []
while True:
    rand_seed = random.randint(1, 1000)
    init_gen = model.generate(40, random_seed=rand_seed)
    for token in init_gen:
        if token == '<s>' or token == '</s>':
            continue
        generated_result.append(token)

    if len(generated_result) <= 15:
        generated_result = []
    else:
        break

In [19]:
print(generated_result)

['with', 'the', 'fantastic', 'ratings', 'last', 'weekend', ',', 'usacc', 'of', 'course', 'there', 'is', 'nobody', 'more', 'against', 'obamacare', 'than', 'me']


## Save and Store your final Model
- Store the trained and final language model by pickling the output using the `dill` Python module
- Call the `dump` function to write the model to a specific output path
- Call the `load` method to load your model, based on a given input path (verfiy the model after loading!)

In [20]:
import dill as pickle

with open('My-NGram-LM.pkl', 'wb') as fout:
    pickle.dump(model, fout)

In [21]:
with open('My-NGram-LM.pkl', 'rb') as fin:
    model_loaded = pickle.load(fin)

In [22]:
print(model.vocab)
print(model.counts)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 8549 items>
<NgramCounter with 3 ngram orders and 454827 ngrams>
