# N-GRAM - Probabilistic Language Modeling

- Let us build our first language model ...
- **Important**: for all of the tasks below please make use of the provided methods given by the `nltk` Python module
- This is a step-by-step coding example to get from raw text information to the final language model
- Have look at the NLTK-LM documentation: https://www.nltk.org/api/nltk.lm.html

## Load and Extract the Data

- Use the provided text document including a series of different Tweets from Donald Trupm (`Donald-Trump-Tweets.csv`)
- Use the `Pandas` Python library to read the CSV-file as a `pandas dataframe` object
- Filter the column which contains the text information (`Tweet_Text`), leading to a Table with 2 columns -- index and text information
- Verify also for invalid entries and filter them in advance (`notnull`function of the `pandas dataframe` object)
- Visualize the resulting and filtered `Pandas` Table

In [1]:
import nltk
import pandas as pd
import matplotlib.pyplot as plt

def read_data(file):
    tweet_dataframe = pd.read_csv(file)
    tweet_table = tweet_dataframe["Tweet_Text"].dropna()
    return tweet_table

tweet_table = read_data("Donald-Trump-Tweets.csv")
print(tweet_table)

0       Today we express our deepest gratitude to all ...
1       Busy day planned in New York. Will soon be mak...
2       Love the fact that the small groups of protest...
3       Just had a very open and successful presidenti...
4       A fantastic day in D.C. Met with President Oba...
                              ...                        
7370    I loved firing goofball atheist Penn @pennjill...
7371    I hear @pennjillette show on Broadway is terri...
7372    Irrelevant clown @KarlRove sweats and shakes n...
7373    "@HoustonWelder: Donald Trump is one of the se...
7374    RT @marklevinshow: Trump: Rove is a clown and ...
Name: Tweet_Text, Length: 7375, dtype: object


## Text Preprocessing
- Clean and prepare (e.g. regex, nltk functionalities, etc.) the entire text information, in order to gurantee a robust sentence tokenization (e.g. how to handle hyperlinks, hashtags, punctuation marks, non-alphanumeric patterns, twitter references via @, upper/lower case, ...?)
- Perform the entire preprocessing in the same way for all three data partitions to ensure comparability
- Split the entire text information into a `training, validation, and unseen test dataset` (take the `first 6,375 lines for training`, the following `500 for validation`, and the remaining `500 for testing`)

In [2]:
import re

trash_xps = [r'http\S+', r'@[\w_]+', r'#[\w_]+']

def split_data(tweet_table):
    return tweet_table[0:6375], tweet_table[6375:6875], tweet_table[6875:7376]

def rem_trash(patterns, sents):
    cleaned_sents = sents
    for pattern in patterns:
        cleaned_sents = [re.sub(pattern, '', sent) for sent in cleaned_sents]
    return cleaned_sents

training, validation, testing = split_data(tweet_table)

training_cleaned = rem_trash(trash_xps, training)
validation_cleaned = rem_trash(trash_xps, validation)
testing_cleaned = rem_trash(trash_xps, testing)

## Sentence Tokenization and Padding
- Joint all the tweet text information and perform sentence tokenization
- Report the number of sentences
- Use the list of detected and individual sentences and either consistently remove any type of punctuation marks, or leave them in the original corpus and treat them as individual words  
- Integrate the required sentence start (`<s>`) and sentence end (`</s>`) for each sentence in the list (`N-Gram order = 3`)
- Convert the list of sentences (list of strings) into a nested list, describing a list of sentences, while each sentence is represented as list of words (word vector $\vec{w}$)

In [3]:
import string
from nltk.lm.preprocessing import pad_both_ends

def preprocess_data(text):
    return nltk.sent_tokenize(text)
    #return [nltk.sent_tokenize(tweet_table[i]) for i in range(len(tweet_table))]

def rem_punctuation(sents_list):
    translator = str.maketrans('', '', string.punctuation)
    return [sent.translate(translator) for sent in sents_list]

def sents_to_nested_list(sents_list):
    return [nltk.word_tokenize(sent) for sent in sents_list]

def pad_sents(sent_list):
    return [list(pad_both_ends(sent, n=3)) for sent in sent_list]

training_text = ' '.join(training_cleaned)
validation_text = ' '.join(validation_cleaned)
testing_text = ' '.join(testing_cleaned)

training_sents = preprocess_data(training_text)
validation_sents = preprocess_data(validation_text)
testing_sents = preprocess_data(testing_text)

print(training_sents)

number_sents = len(training_sents) + len(validation_sents) + len(testing_sents)

training_sents_no_punc = rem_punctuation(training_sents)
validation_sents_no_punc = rem_punctuation(validation_sents)
testing_sents_no_punc = rem_punctuation(testing_sents)

training_nested = sents_to_nested_list(training_sents_no_punc)
validation_nested = sents_to_nested_list(validation_sents_no_punc)
testing_nested = sents_to_nested_list(testing_sents_no_punc)

training_padded = pad_sents(training_nested)
validation_padded = pad_sents(validation_nested)
testing_padded = pad_sents(testing_nested)

['Today we express our deepest gratitude to all those who have served in our armed forces.', 'Busy day planned in New York.', 'Will soon be making some very important decisions on the people who will be running our government!', 'Love the fact that the small groups of protesters last night have passion for our great country.', 'We will all come together and be proud!', 'Just had a very open and successful presidential election.', 'Now professional protesters, incited by the media, are protesting.', 'Very unfair!', 'A fantastic day in D.C. Met with President Obama for first time.', 'Really good meeting, great chemistry.', 'Melania liked Mrs. O a lot!', 'Happy 241st birthday to the U.S. Marine Corps!', 'Thank you for your service!!', 'Such a beautiful and important evening!', 'The forgotten man and woman will never be forgotten again.', 'We will all come together as never before Watching the returns at 9:45pm.', 'RT : Such a surreal moment to vote for my father for President of the Unite

## Generate Vocabulary

- Use the entire padded sentence information (nested list), in order to provide one single list object with all the words (use the `flatten` function from `nltk.lm.preprocessing`)
- Create the vocabulary for our language model, however each word, which does not show up more than `N times (N=1, N=2, ...)` (what happens if `N > 1`?) within the entire corpus, should not be part of the vocabulary and mapped to the `<UNK>` category/tag (use the `unk_cutoff` option)
- Use the `Vocabulary` object of `nltk.lm` to realize the requirements
- Report the size of your entire vocabulary (`|V|`) as well as the number of unique elements, togehter with the top-N most-frequent elements (without considering sentence start `<s>` and `</s>`) - What can you observe regarding the type of words which are the most frequent? 

In [4]:
from nltk.lm.preprocessing import flatten
from nltk.lm import Vocabulary

training_words = list(flatten(training_padded))
validation_words = list(flatten(validation_padded))
testing_words = list(flatten(testing_padded))

vocab = Vocabulary(training_words, unk_cutoff = 5)

## Compute N-Grams

- Use the padded sentence information and compute for each sentence, represented as word vector $\vec{w}$ (list of words), all the N-Gram patterns, using the `everygrams` function from `nltk` (total N-Gram information should include `Unigram, Bigram, and Trigram` - `N=3` )
- Analyze the output (`generator object`) for each sentence and convert it to a list object
- Appending all the sentence-wise list output with all the N-Gram information to another list, representing your final training data
- Compute the counts for all the Unigrams, Bigrams, and Trigrams (use the `Counter` module from `collections` together with the list-converted output of `everygrams`)
- Visualize the `top-N` most-frequent Unigrams, Bigrams, and Trigrams (use `matplotlib` barplot)

In [5]:
from nltk.util import everygrams
from collections import Counter
from nltk.lm.preprocessing import flatten
import matplotlib.pyplot as plt

training_ngrams = [list(everygrams(sent, max_len=7)) for sent in training_padded]
validation_ngrams = [list(everygrams(sent, max_len=7)) for sent in validation_padded]
training_flat = flatten(training_ngrams)
counter = Counter(training_flat).most_common(20)

## Build N-Gram Language Model

- Create a N-Gram language model using the `MLE` (Maximum-Likelihood-Estimator) module from `nltk.lm`
- Train the model calling the `fit` function, which requires two mandatory arguments - the (nested) sentence-related list with all N-Grams per sentence (see result of Section `Compute N-Grams`), in addition to the vocabulary (see result of Section `Generate Vocabulary`)
- Analyze your trained model, using different functionalities, e.g. `counts`, `scores`, `logscore`, `vocab`, together with different words-sequences, also including words which are not in the vocabulary, in order to make sure that the model outputs are valid
- Manually evaluate/verify probabilities for `Unigram`, `Bigram`, and `Trigram` (in case of the `MLE` model) for the following expression: `"make america great"` - Does it match with the `model.counts` and `model.score` values? What happens in the special case of the bigram probability for `<s> <s>` and `</s> </s>` ? Are those counts and score values the same - explain what you are observing?
- There exist also more sophisticated language models, such as `KneserNeyInterpolated`, `Laplace`, `Lidstone`, `WittenBellInterpolated`, `AbsoluteDiscountingInterpolated` (see https://www.nltk.org/api/nltk.lm.html)
- Train also a `KneserNeyInterpolated` language model. What do you encounter when computing probabilities, compared to the `MLE` version (key word: smoothing)
- **Homework:** also have a look at the other LM alternatives (mentioned before), as well as compare results! Have also a look at the different `smoothing` options provided by the `nltk.lm` module

In [6]:
from nltk.lm import MLE
from nltk.lm.models import KneserNeyInterpolated
from nltk.lm.models import Laplace

mle_model = MLE(3)
mle_model.fit(training_ngrams, vocab)
print(mle_model.counts[['make']]['america'])
print(mle_model.score("america", ["make"]))
print(mle_model.logscore('america'))
print(mle_model.vocab['hallo'])

print()

kneser_model = KneserNeyInterpolated(3)
kneser_model.fit(training_ngrams, vocab)
print(kneser_model.counts[['make']]['america'])
print(kneser_model.score("america", ["make"]))
print(kneser_model.logscore('america'))
print(kneser_model.vocab['hallo'])

laplace_model = Laplace(1)
laplace_model.fit(training_ngrams, vocab)

0
0.03669724770642202
-3.490336276569173
0

0
0.04241733510300652
-4.624546368121777
0


## Evaluation of the Language Model 

- First apply the trained language model and compute the `perplexity` for the `unigram "america"`. Compare if the perplexity is really the inverse probability which has been calculated before via `model.score`. Moreover take the sentence `<s> <s> make america great again </s> </s>` and convert all the bigrams within a list `[('<s>', '<s>'), ('<s>', 'make'), ...]`, which is then used as input for the `perplexity` computation. 
- Second, use the trained language model together with your `validation set` and compute the `perplexity` for each sentence in the validation set using the `everygram` output (compute total perplexity and averaged across all sentences)
- In case the performance is not very promising and/or you are observing a lot of `inf` values (division by zero in the perplexity equation) try to have a look at your pipeline and try to further optimize the text preprocessing and parametric setup (reduce the complexity of the vocabualry, e.g. increase `unk_cutoff`, use categorical approaches (e.g. `NER`), change the size of `N`, use a LM with an integrated smoothing concept (e.g. `Laplace`, etc.), besides looking for any other strange behaviors
- Once the performance on the validation corpus is satisfying, verify your model also on the final and unseen `test set` (via `perplexity` and same approach) - Large deviations?

In [7]:
import nltk
import math

sent = ['<s>', '<s>', 'make', 'america', 'great', 'again', '</s>', '</s>']

print(mle_model.perplexity("america"))
print(mle_model.perplexity(nltk.bigrams(sent)))

avg_perp = 0
num_inf = 0

for sent in validation_ngrams:
    perp = mle_model.perplexity(sent)
    print(perp)
    if not math.isinf(perp):
        avg_perp += perp
    else:
        num_inf += 1

print(f'total infs: {num_inf}')
print(f'total: {avg_perp}')
print(f'avg: {avg_perp / len(validation_ngrams)}')


163.3597706917847
28.319879004769216
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
10.17425678559116
inf
inf
inf
6.896356606571488
inf
3.9254423504393605
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
inf
8.474623217538873
inf
inf
inf
inf
6.233395205118561
inf
inf
inf
inf
inf
inf
7.932506628418597
inf
inf
inf
inf
5.2291568858363355
inf
inf
inf
inf
inf
inf
inf
inf
9.052103072983165
5.006856028250872
inf
inf
inf
4.206831216148622
inf
inf
inf
inf
inf
5.006856028250872
inf
inf
5.702866287979767
inf
inf
5.079912932589483
inf
5.006856028250872
inf
inf
inf
6.791149294796717
5.417858300844248
inf
5.006856028250872
inf
4.206831216148622
5.006856028250872
6.348006465763581
inf
inf
inf
5.2291568858363355
inf
inf
inf
6.470464796946197
inf
inf
4.206831216148622
4.206831216148622
inf
inf
6.785407762827079
inf
inf
inf
7.12098242531365
inf
inf
inf
inf
6.470464796946197
inf
inf
inf
13.734351074330437
inf
inf
inf
inf
inf
5.006856028250872
inf
inf
inf
inf
inf
inf
in

## Artificial Text Generation
- Use the `generate` functionality together with the `MLE` language model alternative to artificially produce Trump-Tweets via a user-defined number of words   
- You might face a lot of sentence start `<s>` and sentence end `</s>` tokens. Optimize the generated output, by ignoring and eliminating these tokens and produce sentences of at least 15 words during generation (without `<s>` and `</s>`)

In [8]:
def gen_tweet(num_words, tweet = []):
    new_tweet = laplace_model.generate(num_words)
    filtered = [word for word in new_tweet if word != '<s>' and word != '</s>']
    return filtered
    #tweet = tweet + filtered

    #if len(tweet) < num_words:
        #gen_tweet(num_words, tweet)
    #else:
        #print(tweet)
        #return tweet

for i in range(20):
    print(gen_tweet(20))

['<UNK>', 'do', '<UNK>', 'owe', 'you', 'an', 'apology']
[]
[]
['<UNK>', '<UNK>', 'to', 'Go', 'to', '<UNK>', 'for', '<UNK>', 'Trump', '<UNK>', 'Breitbart', 'lets', 'go', '<UNK>', 'he', 'has', 'my', 'vote', 'Go', 'Trump']
['was', 'he', 'fired', 'FOX', 'he', 'would', 'have', 'voted', 'for', 'Clinton', 'over', 'McCain']
['It', 'wont', 'work']
['True', 'can', 'you', 'think', 'of', 'anyone', 'who', 'wants', 'to', 'watch']
['Just', 'victory', 'victory', 'and', 'more', 'victory', 'as', 'you', '<UNK>', 'the', 'truth', 'that', 'is', 'in', 'your', 'heart']
['will', 'win', 'big']
['Ted', 'Cruz', 'makes', '<UNK>', 'poll', 'via', '<UNK>', 'Great', 'family']
['a', 'friend', 'and', '<UNK>', 'built', 'for', '<UNK>', '<UNK>', 'dont', 'people', 'get', 'it']
['genius']
['Thank', 'you']
[]
['<UNK>', 'EST']
[]
['to', 'run', 'for']
['of', '<UNK>']
['<UNK>', 'Saudi', 'Arabia', 'was', '<UNK>', 'against', 'the', 'Iran', 'nuclear', 'deal']
['mess']


## Save and Store your final Model
- Store the trained and finale language model by pickling the output using the `dill` Python module
- Call the `dump` function to write the model to a specific output path
- Call the `load` method to load your model, based on a given input path (verfiy the model after loading!)