# Twit like a Trump 
Given the Trump Twitter Archive (~290 tweets attributed to the former 
US president, available among the class materials)
- acquire two language models (one bi-gram and one tri-gram) on this set of texts;
- use the two models to produce tweets


### Instructions
The ng-gram (generic) model can be found under the src folder:
- base.py, basic implementation
- log.py, log probs instead of using word frequencies as probs
- smooth.py, implemented laplace smoothing to handle normalization 

In [1]:

# Import necessary libraries and modules
import random
import nltk
import csv
from collections import defaultdict
import math 
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import re

%load_ext autoreload
%reload_ext autoreload

['Peter Piper picked a peck of pickled pepper. ', "Where's the pickled pepper that Peter Piper picked?"]
P(Peter|('<s>',)) = 0.5
P(Piper|('Peter',)) = 1.0
P(picked|('Piper',)) = 1.0
P(</s>|('picked',)) = 0.5
Probability of the sentence: 0.25
Sentence Probability: 0.25


  cleaned_words = [re.sub(r"[^\w\s]", "", word)   for word in cleaned_words if re.sub(r"[^\w\s]", "", word) is not '']
  cleaned_words = [re.sub(r"[^\w\s]", "", word)   for word in cleaned_words if re.sub(r"[^\w\s]", "", word) is not '']
  cleaned_words = [re.sub(r"[^\w\s]", "", word)   for word in cleaned_words if re.sub(r"[^\w\s]", "", word) is not '']
  cleaned_words = [re.sub(r"[^\w\s]", "", word)   for word in cleaned_words if re.sub(r"[^\w\s]", "", word) is not '']


KeyboardInterrupt: 

# Base ngram

### Defining the ng gram model blue print
The provided code snippet represents a training function for an n-gram language model. Let's break down its steps:

**Preprocessing**: The function takes a corpus as input, which is a collection of sentences or text. It preprocesses each sentence by tokenizing it into individual words or tokens. The preprocessSentence function is called to perform any necessary preprocessing steps, such as removing punctuation and lowercasing. The processed sentences are then padded with start and end tokens '( <s.> and </s.> )' to delimit the boundaries of each sentence.

**Generating n-grams**: The function generates n-grams from the processed corpus. It iterates over the range of the corpus length minus n plus 1. For each iteration, it creates a tuple of the previous n-1 words as the key (prev_words) and the current word as the value (current_word). It updates the ngrams dictionary by incrementing the frequency count of the current word in the context of the previous n-1 words. 'For language modeling, it's better to generate n-grams by considering the word order.'

**Probability estimation**: After generating the n-grams, the function proceeds to estimate the probabilities of each word given the previous n-1 words. It iterates over the ngrams dictionary and calculates the total count of next words (total_count) for each context. Then, it computes the probability of each word in the context by dividing its frequency count by the total count. The probabilities are stored in the ngrams dictionary under the 'probability' key.

In [2]:
def train(self,corpus):
        # Preprocess corpus to add start and end tokens
        # + PADDING of the sentence -> are added two markers (<s> and </s> ) that delimit the sentences
        processed_corpus = []
        for sentence in corpus:
            
            tokenized_sentence = self.preprocessSentence(sentence)
            # add padding
            processed_sentence = ['<s>'] * (self.n - 1) + tokenized_sentence + ['</s>']
            processed_corpus.extend(processed_sentence)

        # Generate n-grams from the corpus
        for i in range(len(processed_corpus) - self.n + 1):
            prev_words = tuple(processed_corpus[i:i+self.n-1])
            current_word = processed_corpus[i+self.n-1]

            self.ngrams[prev_words][current_word]['frequency'] += 1
            self.ngrams[prev_words][current_word]['probability'] += 1

        # Normalize counts to estimate probabilities using MLE from self.ngrams
        for prev_words, next_words in self.ngrams.items():
            total_count = sum(next_words[word]['frequency'] for word in next_words)
            for word in next_words:
                self.ngrams[prev_words][word]['probability'] = next_words[word]['frequency'] / total_count

Let's inspect the ngram dictionary on this simple example (taken by the slides). 

In [5]:
from src.base import NGramLanguageModel 

# Create an instance of the n-gram language model with n=2 (bigram model)
bi_model = NGramLanguageModel(2)
tri_model = NGramLanguageModel(3)

# Train the model on a corpus (a list of sentences or words)
corpus =[
    'Peter Piper picked a peck of pickled pepper. ',
    "Where's the pickled pepper that Peter Piper picked?",
]

bi_model.train(corpus)
bi_model.printDataframe()

tri_model.train(corpus)
tri_model.printDataframe()

['Peter Piper picked a peck of pickled pepper. ', "Where's the pickled pepper that Peter Piper picked?"]
P(Peter|('<s>',)) = 0.5
P(Piper|('Peter',)) = 1.0
P(picked|('Piper',)) = 1.0
P(</s>|('picked',)) = 0.5
Probability of the sentence: 0.25
Sentence Probability: 0.25
Generated Text: ['<s>', 'Peter', 'Piper', 'picked', 'a', 'peck', 'of', 'pickled', 'pepper', '</s>']


  cleaned_words = [re.sub(r"[^\w\s]", "", word)   for word in cleaned_words if re.sub(r"[^\w\s]", "", word) is not '']
  df = pd.concat([df.drop(['next_words'], axis=1), json_normalize(df['next_words'])], axis=1)


Unnamed: 0,prev_words,frequency,Peter.frequency,Peter.probability,Where.frequency,Where.probability,Piper.frequency,Piper.probability,picked.frequency,picked.probability,...,pepper.frequency,pepper.probability,that.frequency,that.probability,<s>.frequency,<s>.probability,s.frequency,s.probability,the.frequency,the.probability
0,"(<s>,)",2,1.0,0.5,1.0,0.5,,,,,...,,,,,,,,,,
1,"(Peter,)",1,,,,,2.0,1.0,,,...,,,,,,,,,,
2,"(Piper,)",1,,,,,,,2.0,1.0,...,,,,,,,,,,
3,"(picked,)",2,,,,,,,,,...,,,,,,,,,,
4,"(a,)",1,,,,,,,,,...,,,,,,,,,,
5,"(peck,)",1,,,,,,,,,...,,,,,,,,,,
6,"(of,)",1,,,,,,,,,...,,,,,,,,,,
7,"(pickled,)",1,,,,,,,,,...,2.0,1.0,,,,,,,,
8,"(pepper,)",2,,,,,,,,,...,,,1.0,0.5,,,,,,
9,"(</s>,)",1,,,,,,,,,...,,,,,1.0,1.0,,,,


  df = pd.concat([df.drop(['next_words'], axis=1), json_normalize(df['next_words'])], axis=1)


Unnamed: 0,prev_words,frequency,Peter.frequency,Peter.probability,Where.frequency,Where.probability,Piper.frequency,Piper.probability,picked.frequency,picked.probability,...,pepper.frequency,pepper.probability,that.frequency,that.probability,<s>.frequency,<s>.probability,s.frequency,s.probability,the.frequency,the.probability
0,"(<s>, <s>)",2,1.0,0.5,1.0,0.5,,,,,...,,,,,,,,,,
1,"(<s>, Peter)",1,,,,,1.0,1.0,,,...,,,,,,,,,,
2,"(Peter, Piper)",1,,,,,,,2.0,1.0,...,,,,,,,,,,
3,"(Piper, picked)",2,,,,,,,,,...,,,,,,,,,,
4,"(picked, a)",1,,,,,,,,,...,,,,,,,,,,
5,"(a, peck)",1,,,,,,,,,...,,,,,,,,,,
6,"(peck, of)",1,,,,,,,,,...,,,,,,,,,,
7,"(of, pickled)",1,,,,,,,,,...,1.0,1.0,,,,,,,,
8,"(pickled, pepper)",2,,,,,,,,,...,,,1.0,0.5,,,,,,
9,"(pepper, </s>)",1,,,,,,,,,...,,,,,1.0,1.0,,,,


## Evaluation of sentence_probability 
self.ngrams[prev_words] contains the prefix (n-1)  of the n-gram, while self.ngrams[prev_words] [next_word] contains the possible suffixes of the current n-gram.
NGramLanguageModel is the basic implementation of the n gram language model that doesn't use log probabiliies.



In [6]:
def sentence_probability(self, sentence,debug=False):
        # Preprocess the input sentence
        tokenized_sentence = self.preprocessSentence(sentence)
        processed_sentence = ['<s>'] * (self.n - 1) + tokenized_sentence + ['</s>']

        # Initialize probability to maximum in logaritmic space
        probability = 1.0
        
        
        # Iterate over the sentence to compute the probability
        for i in range(len(processed_sentence) - self.n + 1):
            prev_words = tuple(processed_sentence[i:i+self.n-1])
            current_word = processed_sentence[i+self.n-1]

            # Check if the n-gram exists in the language model
            if prev_words in self.ngrams and current_word in self.ngrams[prev_words]:
                # Multiply the probability by the conditional probability of the current word given the previous words
                prob = self.ngrams[prev_words][current_word]['probability']
                log_prob = np.log(prob)
                probability *= prob
                if debug:
                    print("P({}|{}) = {}".format(current_word, prev_words, prob))
            else:
                # if the n-gram doesn't exist, return 0.0 or smooth the probability
                probability *=0.0000000001
                if debug:
                    print("P({}|{}) = {}".format(current_word, prev_words, 'N/A'))

        if(debug):
            print("Probability of the sentence: {}".format(probability))
        return probability

**Let us compute P(Peter Piper picked)** 
For the sake of this example, I have removed the punctuation in the pre processing step.

In [7]:
# Train the model on a corpus (a list of sentences or words)
corpus =[
    'Peter Piper picked a peck of pickled pepper. ',
    "Where's the pickled pepper that Peter Piper picked?",
]

sentence = "Peter Piper picked"

print("Bigram model")
bi_probability = bi_model.sentence_probability(sentence,debug=True)
print('\n')
print("Trigram model")
tri_probability = tri_model.sentence_probability(sentence,debug=True)


Bigram model
P(Peter|('<s>',)) = 0.5
P(Piper|('Peter',)) = 1.0
P(picked|('Piper',)) = 1.0
P(</s>|('picked',)) = 0.5
Probability of the sentence: 0.25


Trigram model
P(Peter|('<s>', '<s>')) = 0.5
P(Piper|('<s>', 'Peter')) = 1.0
P(picked|('Peter', 'Piper')) = 1.0
P(</s>|('Piper', 'picked')) = 0.5
Probability of the sentence: 0.25


Notice that the treatment of punctuation in an n-gram language model depends on the specific use case and requirements. Including I include punctuation in the training data, the language model will treat it as part of the n-gram context. This means that n-grams will be formed including punctuation, and the model will learn patterns that involve punctuation marks. This can be useful if you want the model to capture punctuation-related information, such as sentence boundaries or specific punctuation usage patterns.

Let's now test the generative capabilities of the model:
- seed, is a list of words that guide the text generation ( from which the consecutive ngram are chosen)
- max lenght, max number of words of the generated textù




In [8]:
def generate(self, seed=None, max_length=10,top_k=5):
        if seed is None:
            seed = ['<s>'] * (self.n - 1)
        
        prev_words = tuple(seed)[-(self.n - 1):]
        sentence = list(seed)
        
        while len(sentence) < max_length:
            possible_next_words = self.ngrams[tuple(prev_words)]
            if not possible_next_words:
                break
            
            # Select the top-k most probable words
            top_words = sorted(possible_next_words.keys(),
                           key=lambda word: possible_next_words[word]['probability'],
                           reverse=True)[:top_k]

            next_word = random.choice(list(top_words))
            # append the selected word to the sentence extract the word from the list
            sentence.append(next_word)
            
            # update the previous words for the next iteration
            # remove the first word and add the selected word at the end
            prev_words = prev_words[1:] + (next_word,)
                
        
        return sentence

In [24]:
print("Bigram model")

seed = None
generated_text = bi_model.generate(max_length=20)
print('Generated Text:', (" ".join(generated_text)))
# Generate new text using the model
seed = ['a', 'peck']
generated_text = bi_model.generate(seed=seed,max_length=7)
print('Generated Text:', (" ".join(generated_text)))
# Generate new text using the model with top-k = 1, so that the most probable word is always selected
seed = ['a', 'peck']
generated_text = bi_model.generate(seed=seed,max_length=3)
print('Generated Text:', (" ".join(generated_text)))
# Generate new text using the model with top-k = 1, so that the most probable word is always selected
seed = ['peck','a']
generated_text = bi_model.generate(seed=seed,max_length=3,top_k=1)
print('Generated Text:', (" ".join(generated_text)))


print("Trigram model")
seed = None
generated_text = tri_model.generate(max_length=20)
print('Generated Text:', (" ".join(generated_text)))
# Generate new text using the model
seed = ['a', 'peck']
generated_text = tri_model.generate(seed=seed,max_length=7)
print('Generated Text:', (" ".join(generated_text)))
# Generate new text using the model with top-k = 1, so that the most probable word is always selected
seed = ['a', 'peck']
generated_text = tri_model.generate(seed=seed,max_length=3,top_k=1)
print('Generated Text:', (" ".join(generated_text)))



Bigram model
Generated Text: <s> Peter Piper picked a peck of pickled pepper </s>
Generated Text: a peck of pickled pepper that Peter
Generated Text: a peck of
Generated Text: peck a
Trigram model
Generated Text: <s> <s> Peter Piper picked a peck of pickled pepper that
Generated Text: a peck of pickled pepper </s> <s>
Generated Text: a peck of


# Log NGram
The only difference is the usage of log probabilities (instead of multiplying probs, now we are adding them)

# Smooth NGram
- sentence pre processing, words not found in the unique token are converted to unk
 tokens = [token if token in self.ngrams.keys()  else '<unk>' for token in tokens]
- applied laplace smoothing in the training algo


          #add the <unk> token to the vocabulary and dont initialize it
        self.ngrams[('<unk>',)]= {}
        # Apply Laplace smoothing and normalize counts to estimate probabilities
        vocabulary_size = len (self.ngrams.keys())  # Size of the vocabulary, included the padding
        for prev_words, next_words in self.ngrams.items():
            total_count = sum(next_words[word]['frequency'] for word in next_words)
            for word in next_words:
                word_count = next_words[word]['frequency']
                smoothed_count = word_count + 1  # Apply Laplace smoothing
                smoothed_probability = smoothed_count / (total_count + vocabulary_size)
                self.ngrams[prev_words][word]['probability'] = smoothed_probability
                self.ngrams[prev_words][word]['frequency'] += 1

    
            remaining_words = set(self.word_tokens) - set(next_words.keys()) 
            for word in remaining_words:
                smoothed_probability = 1 / (total_count + vocabulary_size)
                if(self.ngrams[prev_words].get(word) is None):  
                    self.ngrams[prev_words][word] = defaultdict(lambda: {"frequency": 0, "probability": 0.0})
               
                self.ngrams[prev_words][word]['frequency'] =1 
                self.ngrams[prev_words][word]['probability'] =smoothed_probability



# Test on the Twitter dataset

In [25]:
def read_csv(filename):
    dataset = []
    with open(filename, 'r', encoding='utf-8') as file:
        # Create a CSV reader object
        reader = csv.reader(file)
        for row in reader:
            if not row[1].startswith("@"):
                dataset.append((row[1]))  
    return dataset

file_path = 'data/tweets.csv'
df = read_csv(file_path)




In [26]:
# divide the dataset into train and test sets
train_size = int(len(df) * 0.8)
train_set = df[:train_size]
test_set = df[train_size:]


In [27]:

from src.log import NGramLanguageModelLogProbs 
from src.smooth import NGramLanguageModelSmoothing 

models_names = ['NGramLanguageModel','NGramLanguageModelLogProbs','NGramLanguageModelSmoothing']

bi_models = {"NGramLanguageModel":NGramLanguageModel(2),"NGramLanguageModelLogProbs":NGramLanguageModelLogProbs(2),"NGramLanguageModelSmoothing":NGramLanguageModelSmoothing(2)}
tri_models = {"NGramLanguageModel":NGramLanguageModel(3),"NGramLanguageModelLogProbs":NGramLanguageModelLogProbs(3),"NGramLanguageModelSmoothing":NGramLanguageModelSmoothing(3)}

for bi_model in bi_models.values():
    bi_model.train(train_set) 

for tri_model in tri_models.values():
    tri_model.train(train_set) 

['Peter Piper picked a peck of pickled pepper. ', "Where's the pickled pepper that Peter Piper picked?"]
Sentence Probability: -2.0794415416798357
Sentence Probability: -69.77069997038132
Input 1 Sentence Perplexity: 1.2968395546510096
Input 2 Sentence Perplexity: 37606030.93086393
Generated Text: ['<s>', 'Where', 's', 'the', 'pickled', 'pepper', 'that', 'Peter', 'Piper', 'picked', '</s>']
['Peter Piper picked a peck of pickled pepper. ', "Where's the pickled pepper that Peter Piper picked?"]
Sentence Probability: -23.294367526066186
Sentence Probability: -161.18095650958318
Input 1 Sentence Perplexity: 13.306638665166679
Input 2 Sentence Perplexity: 464158883361.2762
Generated Text: ['<s>', '<s>', 'Piper']


In [28]:
# evaluate all of the models on the test set and print the perplexity scores
bi_models_perplexities =  {'NGramLanguageModelLogProbs':[],'NGramLanguageModelSmoothing':[]}
tri_models_perplexities = {'NGramLanguageModelLogProbs':[],'NGramLanguageModelSmoothing':[]}

for model in models_names[1:]:
    
    for sentence in test_set:
        bi_models_perplexities[model].append(bi_models[model].perplexity(sentence))
        tri_models_perplexities[model].append(tri_models[model].perplexity(sentence))

    
    

In [29]:


#compare the perplexity scores of the models bi-gram vs tri-gram
bi_models_perplexities_df = pd.DataFrame.from_dict(bi_models_perplexities)
tri_models_perplexities_df = pd.DataFrame.from_dict(tri_models_perplexities)

bi_models_perplexities_df['better_model (lower ppl)'] = bi_models_perplexities_df.idxmin(axis=1)
tri_models_perplexities_df['better_model(lower ppl)'] = tri_models_perplexities_df.idxmin(axis=1)

print("BI-GRAM Models: counting the number of times the Smoothing model has a lower perplexity score than LogProbs\n")
print(bi_models_perplexities_df['better_model (lower ppl)'].value_counts())
print('\n')
print("TRI-GRAM Models: counting the number of times the Smoothing model has a lower perplexity score than LogProbs\n")
print(tri_models_perplexities_df['better_model(lower ppl)'].value_counts())

BI-GRAM Models: counting the number of times the Smoothing model has a lower perplexity score than LogProbs

NGramLanguageModelSmoothing    33
NGramLanguageModelLogProbs      2
Name: better_model (lower ppl), dtype: int64


TRI-GRAM Models: counting the number of times the Smoothing model has a lower perplexity score than LogProbs

NGramLanguageModelSmoothing    33
NGramLanguageModelLogProbs      2
Name: better_model(lower ppl), dtype: int64


Generally the model that uses SMoothing had better perplexities results, which means that it handles better unknown data.
( So the perplexity computed on each sentence of the test set, resulted lower on the Smoothing model-> better)

Let's now see what can be generated by the 2 smoothing models (bi-gram and tri-gram)

In [2]:
print(bi_models['NGramLanguageModelSmoothing'].generate())

print(''.join(tri_models['NGramLanguageModelSmoothing'].generate()))



['<s>', '<s>', 'I', 'have','panelists']
ignored by winners.  -- @ CoachJoeGibbs


In [37]:

print((bi_models['NGramLanguageModelSmoothing'].generate()))
print(''.join(tri_models['NGramLanguageModelSmoothing'].generate()))


['<s>', '<s>', 'I', 'have', 'not', 'panelists']
Obama is laughing at Karl Rove losers-true ! I never said anything bad about

In [40]:
print(''.join(bi_models['NGramLanguageModelSmoothing'].generate(['Judges'])))

Judges Taxes Regulations Healthcare the Military Vets ( Choice ! )


With an example seed