# Baseline

In [2]:
import os

path = 'Trump Rally Speeches/'
files = os.listdir(path)
files = [path + file for file in files]
 
dates = []
locations = []
years = []
days = []
months = []
speeches_text = []
 
month_ab = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep','Oct', 'Nov', 'Dec']

for file in files:
    for month in month_ab:
        if month in file:
            locations.append(file[file.find('/')+1:file.find(month)])
            break
    for i, mont in enumerate(month_ab):
        if month in file:
            date = file[file.find(month):file.find('.txt')]
            dates.append(date)
            months.append(date[:3])
            days.append(str(date[3]))
            years.append(date[-4:])
            break   
        
for file in files:
    with open(file, 'r') as f:
        speeches_text.append(f.read())     
        
import pandas as pd
 
df = pd.DataFrame({'Speech':files, 'Date':dates, 'Location':locations, 'Year':years, 'Month':months, 'Day':days, 'Speech_Text':speeches_text})

In [3]:
from preprocessing import preprocessing_pipline

preprocessing = preprocessing_pipline(df['Speech_Text'])
df['Speech_Text_prepro'] = preprocessing.preprocess_light()

thank thank thank vice president pence hes good guy weve done great job together merry christmas mic
Thank you. Thank you. Thank you to Vice President Pence. He's a good guy. We've done a great job tog
thank thank thank vice president pence hes good guy weve done great job together merry christmas mic


So for the baseline model, we want something simple that can be used as a benchmark for the other models. We will use N-gram that will operate on the tokenized text. 

Explaination about N-gram: 

- "N" in N-gram means the number of words that will be used as a feature. With N = 3 we will we will use 3 words as a feature. 
- It is based on the frequency of words sequence. For exemple with N = 2 we will count the frequency of each pair consecutive words.
- Then we can we can calculate the probability of the next word given the N-1(or more) previous words.
-> Choose from the best probability

These leads to a low understanding of the context and long term dependencies. However this is what we are looking for in this baseline model.


![alt text](https://i.stack.imgur.com/8ARA1.png)

We dont have x and y like in a classification problem. We have a sequence of words and we want to predict the next word. So we will use a sliding window to create the features and the labels. We so create a kind of y for each x, so we need to split between test and train to have an realist evaluation, and see hox it generalize.

We can not realy use accuracy as a metric cause it's not the prediction of a class. So we will use the perplexity:
The lower the perplexity is the better the model is!

In [93]:
from sklearn.model_selection import train_test_split

text_corpus = [word for speech in df['Speech_Text'].str.split() for word in speech]
train_corpus, test_corpus = train_test_split(text_corpus, test_size=0.2, random_state=42)

# Preprocess text 
text_corpus_prepro = [word for speech in df['Speech_Text_prepro'].str.split() for word in speech]
train_corpus_prepro, test_corpus_prepro = train_test_split(text_corpus_prepro, test_size=0.2, random_state=42)
# some preprocessed text;
print(train_corpus_prepro[:10])

['go', 'going', 'alabama', 'privilege', 'say', 'joe', 'understand', 'soo', 'special', 'every']


We create a class "my_model_NGram" that will be used to train and test the model. 
It will permit to :
- Build the model -> construct a dictionary of N-gram, where each key represent a sequence of N-1 words and the value is a list of possible next words with their frequency.
- Then I ve found a way to calculate the performance of the model -> calculate the perplexity of the model on the test set.
  We calculate the likelihood of the test set given the model(trained). Then we calculate the perplexity with the formula: 2^(-1/N * log(likelihood))
- Finaly we can try to generate text from the start of a sentence. We will use the trained model to predict the next word and then we will add this to the sentence and so on. We can choose the number of words we want to generate. Also I'va create a version with random choice of the next word, this is not completly random cause we still use the frequency of... but it create interesting results :)



In [78]:
import random
from nltk import ngrams
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from math import log


class my_model_NGram:
    def __init__(self, n, text_corpus):
        self.n = n
        self.text_corpus = text_corpus
        self.ngrams = list(ngrams(text_corpus, n))
        self.model = self.build_model()

    def build_model(self):
        model = {}
        for ngram in self.ngrams:
            prefix = tuple(ngram[:-1])
            target = ngram[-1]

            if prefix in model:
                model[prefix].append(target)
            else:
                model[prefix] = [target]
        return model

    def generate_text_with_random(self, seed_text, max_length=100):
        output_text = seed_text.split()
        prefix = tuple(output_text[-(self.n - 1):])

        for _ in range(max_length):
            if prefix not in self.model:
                break
            next_word = random.choice(self.model[prefix])
            output_text.append(next_word)
            prefix = prefix[1:] + (next_word,)

        return ' '.join(output_text)
    
    def generate_text(self, seed_text, max_length=100):
        output_text = seed_text.split()
        prefix = tuple(output_text[-(self.n - 1):])

        for _ in range(max_length):
            if prefix not in self.model:
                break
            next_word = self.model[prefix][0]
            output_text.append(next_word)
            prefix = prefix[1:] + (next_word,)

        return ' '.join(output_text)

    def calculate_perplexity(self, test_corpus):
        test_ngrams = list(ngrams(test_corpus, self.n))
        log_prob_sum = 0
        num_ngrams = len(test_ngrams)

        for ngram in test_ngrams:
            context = tuple(ngram[:-1])
            word = ngram[-1]

            if context in self.model:
                word_probabilities = self.model[context]
                if word in word_probabilities:
                    word_probability = (word_probabilities.count(word) + 1) / (len(word_probabilities) + len(self.text_corpus))
                else:
                    word_probability = 1 / (len(word_probabilities) + len(self.text_corpus))
            else:
                word_probability = 1 / len(self.text_corpus)

            log_prob_sum += log(word_probability)

        perplexity = 2 ** (-log_prob_sum / num_ngrams)
        return perplexity

Generated Text: I want to and It's over wanted of have all, millions job, see again. wage with toward the just take they killer. AIDS
Perplexity: 6204.833172882722


In [66]:
# try different N value
for n in range(2, 6):
    ngram_model = my_model_NGram(n=n, text_corpus=train_corpus)
    perplexity = ngram_model.calculate_perplexity(test_corpus)
    print("N =", n, "Perplexity:", perplexity)
 

N = 2 Perplexity: 3380.9841212515016
N = 3 Perplexity: 6099.203647607046
N = 4 Perplexity: 6204.833172882722
N = 5 Perplexity: 6205.798200535082


In [88]:
# with preprocessing
for n in range(2, 6):
    ngram_model = my_model_NGram(n=n, text_corpus=train_corpus_prepro)
    perplexity = ngram_model.calculate_perplexity(test_corpus_prepro)
    print("N =", n, "Perplexity:", perplexity)

N = 2 Perplexity: 3146.029000302403
N = 3 Perplexity: 3886.032907464139
N = 4 Perplexity: 3889.3496831546913
N = 5 Perplexity: 3889.3496505377457


Our best perplexity 3380 with N = 2 without preprocessing
And with preprocessing we have 3146 with N = 2 and better value for any other N

In [82]:
# try different seed no random
seeds = ["I want to", "I will do"]
ngram_model = my_model_NGram(n=3, text_corpus=train_corpus)
for seed in seeds:
    generated_text = ngram_model.generate_text(seed, max_length=20)
    print("Seed text:", seed)
    print("Generated Text:", generated_text)
     


Seed text: I want to
Generated Text: I want to lying people supporters to give they'd crowd. all did at But share today, party President tonight, we because of The
Seed text: I will do
Generated Text: I will do with there all loves them, is a paying their It's we very the that. an up. job, all true. me


In [84]:
# with random
ngram_model = my_model_NGram(n=3, text_corpus=train_corpus)
for seed in seeds:
    generated_text = ngram_model.generate_text_with_random(seed, max_length=20)
    print("Seed text:", seed)
    print("Generated Text:", generated_text)

Seed text: I want to
Generated Text: I want to that's us Ohio, can called a kept again. And and war. 90. wonderful actually renegotiated just this said, She were
Seed text: I will do
Generated Text: I will do on be for the we've deals adding announced the you you And be we He's he They're study will who
