# N-GRAM - Probabilistic Language Modeling

- Let us build our first language model ...
- **Important**: for all of the tasks below please make use of the provided methods given by the `nltk` Python module
- This is a step-by-step coding example to get from raw text information to the final language model
- Have look at the NLTK-LM documentation: https://www.nltk.org/api/nltk.lm.html

## Load and Extract the Data

- Use the provided text document including a series of different Tweets from Donald Trupm (`Donald-Trump-Tweets.csv`)
- Use the `Pandas` Python library to read the CSV-file as a `pandas dataframe` object
- Filter the column which contains the text information (`Tweet_Text`), leading to a Table with 2 columns -- index and text information
- Verify also for invalid entries and filter them in advance (`notnull`function of the `pandas dataframe` object)
- Visualize the resulting and filtered `Pandas` Table

In [144]:
import pandas as pd
data = pd.read_csv('Donald-Trump-Tweets.csv').filter(['Tweet_Text']).dropna().drop_duplicates()


In [145]:
display(data)

Unnamed: 0,Tweet_Text
0,Today we express our deepest gratitude to all ...
1,Busy day planned in New York. Will soon be mak...
2,Love the fact that the small groups of protest...
3,Just had a very open and successful presidenti...
4,A fantastic day in D.C. Met with President Oba...
...,...
7370,I loved firing goofball atheist Penn @pennjill...
7371,I hear @pennjillette show on Broadway is terri...
7372,Irrelevant clown @KarlRove sweats and shakes n...
7373,"""@HoustonWelder: Donald Trump is one of the se..."


## Text Preprocessing
- Clean and prepare (e.g. regex, nltk functionalities, etc.) the entire text information, in order to gurantee a robust sentence tokenization (e.g. how to handle hyperlinks, hashtags, punctuation marks, non-alphanumeric patterns, twitter references via @, upper/lower case, ...?)
- Perform the entire preprocessing in the same way for all three data partitions to ensure comparability
- Split the entire text information into a `training, validation, and unseen test dataset` (take the `first 6,375 lines for training`, the following `500 for validation`, and the remaining `500 for testing`)

In [146]:
import re
text = data.values.tolist()

In [147]:
for i in range(len(text)):
    text[i] = re.sub('([#][\w_-]+)', 'Hashtag', str(text[i]))
    text[i] = re.sub('([@][\w_-]+)', 'Mention', str(text[i]))
    text[i] = re.sub('(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?\S', 'Link', str(text[i]))

In [148]:
text

["['Today we express our deepest gratitude to all those who have served in our armed forces. Hashtag Link]",
 "['Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!']",
 "['Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud!']",
 "['Just had a very open and successful presidential election. Now professional protesters, incited by the media, are protesting. Very unfair!']",
 "['A fantastic day in D.C. Met with President Obama for first time. Really good meeting, great chemistry. Melania liked Mrs. O a lot!']",
 "['Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! Link]",
 "['Such a beautiful and important evening! The forgotten man and woman will never be forgotten again. We will all come together as never before']",
 "['Watching the returns at 9:45pm.\\nHashtag Hashtag Link]",
 "['RT Mention: Such a 

## Sentence Tokenization and Padding
- Joint all the tweet text information and perform sentence tokenization
- Report the number of sentences
- Use the list of detected and individual sentences and either consistently remove any type of punctuation marks, or leave them in the original corpus and treat them as individual words  
- Integrate the required sentence start (`<s>`) and sentence end (`</s>`) for each sentence in the list (`N-Gram order = 3`)
- Convert the list of sentences (list of strings) into a nested list, describing a list of sentences, while each sentence is represented as list of words (word vector $\vec{w}$)

In [149]:
allTweets = " ".join(text)
allTweets = re.sub("(\[\\'\.|\[\\'\"|\[\\')", '', allTweets)
allTweets = re.sub("(\\'])", '', allTweets)

In [150]:
import nltk
nltk.data.path.append('../NLTK_Data')

sentences = nltk.tokenize.sent_tokenize(allTweets.lower())
print('Anzahl Sätze aus Tweets: {}'.format(sentences.__len__()))

Anzahl Sätze aus Tweets: 10283


In [151]:
#remove punctuation
for i in range(len(sentences)):
    sentences[i] = re.sub('(!|\.|,|-|\&amp;|\?|\]|\\+n|:|;|")', '', str(sentences[i]))

In [152]:
#Add BoS and EoS tags
from nltk.lm.preprocessing import pad_both_ends
sentence_word = []
for i in range(len(sentences)):
    words = nltk.tokenize.word_tokenize(sentences[i])
    padded = list(pad_both_ends(words, n =3))
    sentence_word.append([" ".join(padded), padded])

In [153]:
sentence_word[0][0]

'<s> <s> today we express our deepest gratitude to all those who have served in our armed forces </s> </s>'

## Generate Vocabulary

- Use the entire padded sentence information (nested list), in order to provide one single list object with all the words (use the `flatten` function from `nltk.lm.preprocessing`)
- Create the vocabulary for our language model, however each word, which does not show up more than `N times (N=1, N=2, ...)` (what happens if `N > 1`?) within the entire corpus, should not be part of the vocabulary and mapped to the `<UNK>` category/tag (use the `unk_cutoff` option)
- Use the `Vocabulary` object of `nltk.lm` to realize the requirements
- Report the size of your entire vocabulary (`|V|`) as well as the number of unique elements, togehter with the top-N most-frequent elements (without considering sentence start `<s>` and `</s>`) - What can you observe regarding the type of words which are the most frequent? 

In [154]:
from nltk.lm.preprocessing import flatten
from nltk.lm import Vocabulary
test = []
for element in sentence_word:
    for word in element:
        test.append(word)

flat_list = [item for sublist in test for item in sublist]
vocab = Vocabulary(flat_list, unk_cutoff=2)

In [155]:
len(vocab)

4437

In [156]:
vocab.counts

Counter({' ': 154377,
         's': 72194,
         'e': 60639,
         'n': 55335,
         't': 53600,
         'i': 49081,
         'a': 48694,
         'o': 47407,
         '<': 41132,
         '>': 41132,
         'r': 30383,
         'h': 27421,
         'l': 26134,
         'm': 21494,
         '/': 20980,
         '<s>': 20566,
         '</s>': 20566,
         'd': 17210,
         'g': 15038,
         'u': 14383,
         'c': 12826,
         'w': 11559,
         'y': 11201,
         'p': 11161,
         'b': 8505,
         'k': 8438,
         'f': 7992,
         'mention': 7761,
         'v': 5608,
         'the': 3843,
         'link': 2588,
         'to': 2437,
         'hashtag': 2407,
         'in': 1884,
         'and': 1835,
         'j': 1791,
         'is': 1768,
         'you': 1723,
         'of': 1546,
         '0': 1409,
         'on': 1404,
         'for': 1370,
         '\\': 1329,
         'will': 1286,
         'trump': 1122,
         'be': 1063,
         'gre

## Compute N-Grams

- Use the padded sentence information and compute for each sentence, represented as word vector $\vec{w}$ (list of words), all the N-Gram patterns, using the `everygrams` function from `nltk` (total N-Gram information should include `Unigram, Bigram, and Trigram` - `N=3` )
- Analyze the output (`generator object`) for each sentence and convert it to a list object
- Appending all the sentence-wise list output with all the N-Gram information to another list, representing your final training data
- Compute the counts for all the Unigrams, Bigrams, and Trigrams (use the `Counter` module from `collections` together with the list-converted output of `everygrams`)
- Visualize the `top-N` most-frequent Unigrams, Bigrams, and Trigrams (use `matplotlib` barplot)

In [157]:
from nltk import ngrams
from nltk import everygrams
trigram = []
unigram = []
bigram = []
for sentence in sentence_word:
    trigram.append(list(everygrams(sentence_word[0][1], max_len=3)))
    bigram.append(list(everygrams(sentence_word[0][1], max_len=2)))
    unigram.append(list(everygrams(sentence_word[0][1], max_len=1)))

## Build N-Gram Language Model

- Create a N-Gram language model using the `MLE` (Maximum-Likelihood-Estimator) module from `nltk.lm`
- Train the model calling the `fit` function, which requires two mandatory arguments - the (nested) sentence-related list with all N-Grams per sentence (see result of Section `Compute N-Grams`), in addition to the vocabulary (see result of Section `Generate Vocabulary`)
- Analyze your trained model, using different functionalities, e.g. `counts`, `scores`, `logscore`, `vocab`, together with different words-sequences, also including words which are not in the vocabulary, in order to make sure that the model outputs are valid
- Manually evaluate/verify probabilities for `Unigram`, `Bigram`, and `Trigram` (in case of the `MLE` model) for the following expression: `"make america great"` - Does it match with the `model.counts` and `model.score` values? What happens in the special case of the bigram probability for `<s> <s>` and `</s> </s>` ? Are those counts and score values the same - explain what you are observing?
- There exist also more sophisticated language models, such as `KneserNeyInterpolated`, `Laplace`, `Lidstone`, `WittenBellInterpolated`, `AbsoluteDiscountingInterpolated` (see https://www.nltk.org/api/nltk.lm.html)
- Train also a `KneserNeyInterpolated` language model. What do you encounter when computing probabilities, compared to the `MLE` version (key word: smoothing)
- **Homework:** also have a look at the other LM alternatives (mentioned before), as well as compare results! Have also a look at the different `smoothing` options provided by the `nltk.lm` module

## Evaluation of the Language Model 

- First apply the trained language model and compute the `perplexity` for the `unigram "america"`. Compare if the perplexity is really the inverse probability which has been calculated before via `model.score`. Moreover take the sentence `<s> <s> make america great again </s> </s>` and convert all the bigrams within a list `[('<s>', '<s>'), ('<s>', 'make'), ...]`, which is then used as input for the `perplexity` computation. 
- Second, use the trained language model together with your `validation set` and compute the `perplexity` for each sentence in the validation set using the `everygram` output (compute total perplexity and averaged across all sentences)
- In case the performance is not very promising and/or you are observing a lot of `inf` values (division by zero in the perplexity equation) try to have a look at your pipeline and try to further optimize the text preprocessing and parametric setup (reduce the complexity of the vocabualry, e.g. increase `unk_cutoff`, use categorical approaches (e.g. `NER`), change the size of `N`, use a LM with an integrated smoothing concept (e.g. `Laplace`, etc.), besides looking for any other strange behaviors
- Once the performance on the validation corpus is satisfying, verify your model also on the final and unseen `test set` (via `perplexity` and same approach) - Large deviations?

## Artificial Text Generation
- Use the `generate` functionality together with the `MLE` language model alternative to artificially produce Trump-Tweets via a user-defined number of words   
- You might face a lot of sentence start `<s>` and sentence end `</s>` tokens. Optimize the generated output, by ignoring and eliminating these tokens and produce sentences of at least 15 words during generation (without `<s>` and `</s>`)

## Save and Store your final Model
- Store the trained and finale language model by pickling the output using the `dill` Python module
- Call the `dump` function to write the model to a specific output path
- Call the `load` method to load your model, based on a given input path (verfiy the model after loading!)