# Setup

In [1]:
# general packages
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

In [2]:
# package for splitting the file into sentences
import re

In [3]:
# packages for randomely selecting words from list
import random

# ---------------------------------------------------------------------------------------------------------------

# Part 1

Part 1 involves implementing an unsmoothed bigram language model in python.
This model was to be trained on a corpus of 4 sentences and then used to calculate the probability of specified test sentences.

These probabilities were calculated using the following formula:

    P(a|b) = number of occurances of 'a b' in corpus/number of occurances of 'b' in corpus

### Define train & test corpus'

In [4]:
part_1_sentence_corpus = ["<s> a b </s>",
                          "<s> b b </s>",
                          "<s> b a </s>",
                          "<s> a a </s>",
                         ]

In [5]:
part_1_test_strings = ["<s> b </s>",
                       "<s> a </s>",
                       "<s> a b </s>",
                       "<s> a a </s>",
                       "<s> a b a </s>",
                      ]

### Define a function to create a dictionary of all the sequences of 'n'  words in the corpus and their counts

As the probabilities rely heavily on the number of occurances of words or word sequences in the corpus, the best way to optimise this was to count these occurances beforehand and store them in a dictionary.
This dictionary could then be indexed to find the number of occurances of each word or word sequence.
This greatly sped up the process as it meant that these counts didn't have to be counted each time.

The function for this is defined below:

In [6]:
def get_dict_of_counts(corpus, words_to_count):
    
    '''
    Count the occurances of each word in the corpus
    
    corpus: list of strings - this list contains all of the sentences in the corpus
    words_to_count: integer - this is the number of joined together words to look for in the corpus
                            - This allows us to count the number of occurances of 'n' words in sequence
    
    Return: dictionary of sting to integer {str: int} - a word mapped to the number of times the word occurs in the corpus
    '''
    
    word_counts = {}
    # iterate through all the sentences in the corpus
    for sentence in corpus:
               
        # split the sentece into words
        tokens = sentence.split(" ")

        # iterate through these words
        for i in range(len(tokens) - (words_to_count - 1)):
            
            # we may want to count the occurances of each 1 word, or we may want to count the occuances of each word pair
            words = " ".join(tokens[i: i+words_to_count])
                
            # add a count to the number of occurances of this word
            if words in word_counts:
                word_counts[words] += 1
            else:
                word_counts[words] = 1
            
    return word_counts

### Define an n-gram language model

For the implementation of the model, I decided to take a general approach. While this task was to implement a bigram model, I decided to implement an ngram model with a given 'n' parameter that I could specify. This meant that if I wanted to run a bigram model, I could just specify this parameter as *2* and if I wanted to use other ngram sizes, I could also do this. I did this as I knew this would give my code an extra level of flexibility and that this could be done while also not effect the performance of the model or the output.

As Part 2 uses the same model on a different corpus except with the small change of adding smoothing, I have implemented this model with a smoothing option. This will enable me to set this parameter as 'False' for Part 1 and then 'True' for Part 2 when the smoothing is needed.

In [7]:
def get_prob_word_a_given_b(a, b, corpus_size, indiv_w_counts, multi_w_counts):

    '''
    Calculate the probability of a word given a string of the previous 'n' words that came before it
    
    a: string - The word that we want to get the probability of occuring after string 'b' occurs
    b: string - The string of words that occur before word 'a'
              - This can be an empty string (meaning we get the unigram probability) or it can be many space delimited words
    indiv_w_counts: dictionary {str: int} - all the words in the corpus mapped to a count of the # of occurances
    multi_w_counts: dictionary {str: int} - all the 'n' word sequences in the corpus mapped to a count of the # of occurances
    
    Return: integer - the probability that 'a' occurs after the string 'b' occurs
    '''
    
    # count the number of times both 'a' and 'b' occured together and the number of times 'b' occured regardless of 'a'
    if b == "":
        # b == '' implies there is no given, so we have a unigram which involves a count of all words in the corpus
        num_combined_occurances = indiv_w_counts[a] if (a in indiv_w_counts) else 0
        num_b_occurances = corpus_size
        
    else:
        # there is a given, so we have an n-gram
        combined_str = " ".join([b, a])
        num_combined_occurances = multi_w_counts[combined_str] if (combined_str in multi_w_counts) else 0
        num_b_occurances = indiv_w_counts[b]
        
    # return the probability of word 'a' occurring given 'b' occured
    return num_combined_occurances/num_b_occurances

In [8]:
def get_prob_w_or_wo_smoothing(word, given_words, corpus_size, smoothing, indiv_w_counts, multi_w_counts):
    
    '''
    Calculate the probability of a word given a string of the previous 'n' words that came before it
    This can be calculated using a smoothing technique or not using smoothing
    
    word: string - The word that we want to get the probability of occuring after string 'given_words' occurs
    given_words: string - The string of words that occur before word 'word'
                        - This can be an empty string (meaning we get the unigram probability) or many space delimited words
    smoothing: boolean - This parameter specifies whether we want to do smoothing or not
    indiv_w_counts: dictionary {str: int} - all the words in the corpus mapped to a count of the # of occurances
    multi_w_counts: dictionary {str: int} - all the 'n' word sequences in the corpus mapped to a count of the # of occurances
    
    Return: integer - the probability that 'word' occurs after the string 'given_words' occurs
    
    '''
    
    # smooth the probabilities if specified
    if smoothing:
        ngram_prob = get_prob_word_a_given_b(word, given_words, corpus_size, indiv_w_counts, multi_w_counts)
        unigram_prob = get_prob_word_a_given_b(word, "", corpus_size, indiv_w_counts, multi_w_counts)

        return (0.5 * ngram_prob) + (0.5 * unigram_prob) 

    else:
        return get_prob_word_a_given_b(word, given_words, corpus_size, indiv_w_counts, multi_w_counts)

In [9]:
def get_n_gram_probs_in_sentence(sentence, n, corpus_size, smoothing, indiv_w_counts, multi_w_counts):
    
    '''
    Take a sentence and calculate the probabilities of all the ngram's that occur in it
    For a bigram model, if sentence='he went shopping', we calculate prob of [he], [he, went], [went, shopping]
    
    sentence: string - A string of words in a sentence
    n: integer - This specifies what size 'n' our model is (# words in the ngram) - bigram -> n=2, trigram -> n=3, etc..
    corpus_size: integer - This is the number of tokens (words) in the corpus
    smoothing: boolean - This parameter specifies whether we want to do smoothing or not
    indiv_w_counts: dictionary {str: int} - all the words in the corpus mapped to a count of the # of occurances
    multi_w_counts: dictionary {str: int} - all the 'n' word sequences in the corpus mapped to a count of the # of occurances
    
    returns: list of integers - this list contains the probabilites of each ngram contained in the sentence
    '''
    
    # tokanise the sentences
    tokens_list = sentence.split(" ")
        
    n_gram_probs = []
    for i in range(len(tokens_list)):
        
        # get the number of tokens in the n_gram model
        tokens = [tokens_list[i]] if i < (n-1) else tokens_list[i-(n-1):i+1]
        
        word = tokens[-1]
        given_words = " ".join(tokens[:-1])
    
        # get the probability of this word given the previous words
        prob = get_prob_w_or_wo_smoothing(word, given_words, corpus_size, smoothing, indiv_w_counts, multi_w_counts)
        
        # add this probability to a list of the probabilites of each n-gram token set
        n_gram_probs.append(prob)
        
    return n_gram_probs

In [10]:
def get_prob_of_test_sentences(n_size, test_sentences, corpus, smoothing):

    '''
    1. Take in a list of sentences and a corpus
    2. Train an ngram model on the corpus
    3. Then calculate the probabilites of each sentence using this trained model
    
    n_size: integer - This specifies the size 'n' our model is (# words in the ngram) - bigram -> n=2, trigram -> n=3, etc..
    test_sentences: list of strings - This contains a list of all the sentences that we want to get the probability of
    corpus: list of strings - This contains a list of all the sentences in the corpus
    smoothing: boolean - This parameter specifies whether we want to do smoothing or not
    
    Return: dictionary {str: int} - return a dictionary containing the sentence and then the probability of that sentence 
                                    using the model trained on the given corpus
    '''
    
    # turn the text in the corpus to lowercases to make it uniform
    lower_corpus = [s.lower() for s in corpus]
    
    print("1.) Counting the word occurances in the corpus")
    # create a dictionary of all the sequences words in the corpus and their counts
    indiv_w_counts = get_dict_of_counts(lower_corpus, 1)
    corpus_size = np.sum([v for k,v in indiv_w_counts.items()])
    
    # create a dictionary of all the sequences of 'n' words in the corpus and their counts
    multi_w_counts = get_dict_of_counts(lower_corpus, n_size)
        
    print("2.) Calculating the probabilites of the sentences")
    # iterate through the sentences in the test set
    sentence_to_prob = {}
    for sentence in test_sentences:
        
        # turn the sentence to lowercase to make it uniform
        sentence = sentence.lower()
        
        n_gram_probs = get_n_gram_probs_in_sentence(sentence, n_size, corpus_size, smoothing, indiv_w_counts, multi_w_counts)
        
        # multiply the n-gram probabilites together to get the probability of the sentence - store this in a dictionary
        sentence_to_prob[sentence] = np.prod(n_gram_probs)
        
    return sentence_to_prob

### Run an unsmoothed bigram model on the data

This is when I run the above defined blocks of code. The corpus and the test sentences are fed in and I get back a dictionary mapping each sentence to its calculated probability. I can then iterate through this dictionary to print these probabilities.

Again, as my functions define an ngram model, I must speify that I want a bigram model. This is done by setting *'n=2'*.

In [11]:
# tell the algorithm we want to use a bigram model
n = 2

In [12]:
part_1_prob_dict = get_prob_of_test_sentences(n, part_1_test_strings, part_1_sentence_corpus, smoothing=False)

1.) Counting the word occurances in the corpus
2.) Calculating the probabilites of the sentences


In [13]:
for sentence, prob in part_1_prob_dict.items():
    print("P('{}') = {}".format(sentence, prob))

P('<s> b </s>') = 0.0625
P('<s> a </s>') = 0.0625
P('<s> a b </s>') = 0.015625
P('<s> a a </s>') = 0.015625
P('<s> a b a </s>') = 0.00390625


# ---------------------------------------------------------------------------------------------------------------

# Part 2

Part 2 involved running the same bigram model except this time adding smoothing into the model.
As I had already included this as a parameter in the model created for Part 1, I didn't have to do anything to change the model but instead could just run these same functions, this time specifying *'smoothing=True'*.

This calculates the probabilites as follows: (where the counts are done on the whole corpus)
    
    P(a|b) = (0.5 * count('a b')/count('b')) + (0.5 * count('a')/count(all words))

### Read in the training corpus

This part involved using a different corpus to Part 1.
This corpus was shared with me as a '.txt' file via google drive so I downloaded this file to my local disk so that I could read it into this notebook.

In [14]:
# read in the data as one big string
wiki_oscars_contents = open('data/wikiOscars.txt', 'r').read()

wiki_oscars_contents

"The Academy Awards , commonly known as The Oscars , is an annual American awards ceremony honoring achievements in the film industry . Winners are awarded the statuette , officially the Academy Award of Merit , that is much better known by its nickname Oscar . The awards , first presented in 1929 at the Hollywood Roosevelt Hotel , are overseen by the Academy of Motion Picture Arts and Sciences (AMPAS) .\nThe awards ceremony was first televised in 1953 and is now seen live in more than 200 countries . The Oscars is also the oldest entertainment awards ceremony ; its equivalents , the Emmy Awards for television , the Tony Awards for theatre , and the Grammy Awards for music and recording , are modeled after the Academy Awards .\nThe 86th Academy Awards were held on March 2 , 2014 , at the Dolby Theatre in Los Angeles .\nThe first Academy Awards were presented on May 16 , 1929 , at a private dinner at the Hollywood Roosevelt Hotel with an audience of about 270 people . The post Academy A

### Define a function to process the data

Once read in, the data had to be processed into a list of sentences so that the model could be trained on these sentences.
This was done by:

    1. Splitting up the text into sentences on the full stops (.) and the newline characters (\n),
    2. Cleaning & tokanising the data,
    3. Adding in the '\<s>' and '\</s>' characters to specify the start and end of each sentence.
    
Step *2* & *3* were done using the below function.
This function is long and involves a number of different steps that are not neccessary for this part as it is also used in Part 3. Due to the sheer size of the data used in Part 3, there was a lot of processing needed, however, this function still performs well and is still a good processing function for this part.

In [15]:
def process_corpus(corpus_sentences):
    
    '''
    Take in a list of sentences, clean and tokanise them, add the sentence boundary characters to them, then output them
    
    corpus_sentences: list of strings - This contains a list of all the sentences in the corpus
    
    Returns: list of strings - This contains a list of all the now cleaned and processed sentences in the corpus
    '''
    
    replace_dict = {# errors
                    'â€“': '-',
                    "’": "'",
                    "‘": "'",
                    '“': '"',
                    '”': '"',
                    '…': '...',
                    # acronyms
                    'u.s.': 'united states',
                    'u.n.': 'united nations',
                    'l.a.': 'los angeles',
                    'n.y.': 'new york',
                    'w.h.': 'white house',
                    's.c.': 'supreme court',
                    # just remove '.'
                    'a.m.': 'am',
                    'p.m.': 'pm',
                    'mr.': 'mr',
                    'ms.': 'ms',
                    'mrs.': 'mrs',
                    # shortened words
                    'jr.': 'junior',
                    'st.': 'street',
                    'no.': 'number',
                    'dr.': 'doctor',
                    'dept.': 'department',
                    'pres.': 'president',
                    'sec.': 'secretary',
                    'sen.': 'senetor',
                    'rep.': 'representative',
                    # shortened months
                    'jan.': 'january',
                    'feb.': 'feruary',
                    'mar.': 'march',
                    'arp.': 'april',
                    'jun.': 'june',
                    'jul.': 'july',
                    'aug.': 'august',
                    'sep.': 'september',
                    'oct.': 'october',
                    'nov.': 'november',
                    'dec.': 'december',
                   }
    
    regex_replace_dict = {"(?:^|(?<=\s))\'": "' ", # match an apostrophy occuring at the beginning of the string or after a whitespace char in a positive lookbehind
                          "\'(?:$|(?=\s))": " '",  # match an apostrophy occuring before a whitespace char or at the end of the string in a positive lookahead
                         }
        
    chars_to_put_spaces_around = [',', '.', '?', '!', ':', ';', '"', '(', ')']
    dont_add_list = [' ', '']
        
    processed_corpus = []
    for sentence in tqdm(corpus_sentences):
        
        # step 1 - remove sentences that are empty
        if sentence == "":
            continue
            
        # step 2 - turn the sentence to lowercase
        sentence = sentence.lower()
        
        # step 3 - clean up the data
        for k, v in replace_dict.items():
            sentence = sentence.replace(k, v)

        # step 4 - ensure spaces are appropriately placed
        new_s = ''
        for i in range(len(sentence)):

            two_prev_c = sentence[i-2] if i > 1 else ''
            prev_c = sentence[i-1] if i > 0 else ''
            c = sentence[i]
            next_c = sentence[i+1] if i < len(sentence)-1 else ''
            
            if c in chars_to_put_spaces_around and not (prev_c.isdigit() and next_c.isdigit()):
                
                # pattern if there are multiple full stops all in a line - '...'
                multi_prev_full_stop = (c == '.' and prev_c == '.')
                multi_post_full_stop = (c == '.' and next_c == '.')
                
                # pattern for peoples middle initials
                initial_befre_full_stop = (c == '.' and prev_c.isalpha() and two_prev_c == ' ')
                
                add_before_bool = False if (multi_prev_full_stop or initial_befre_full_stop) else prev_c not in dont_add_list
                add_after_bool = False if multi_post_full_stop else next_c not in dont_add_list
                
                new_s += (add_before_bool * ' ') + c + (add_after_bool * ' ')

            else:
                new_s += c

        # step 5 - put spaces around the quotation marks
        for k, v in regex_replace_dict.items():
            new_s = re.sub(k, v, new_s)
        
        # step 6 - add the sntence boundaries
        final_s = '<s> ' + new_s.strip() + ' </s>'
        
        # step 7 - again remove multiple spaces occuring together
        final_s = re.sub(' +', ' ', final_s)
            
        # step 8 - add this processed sentence to a list of the other processed sentences
        processed_corpus.append(final_s)
        
    return processed_corpus

### Process the data

As stated above, the first step is to split up the data into sentences. This was done by using the full stops (.) and the newline characters (\n). This approach seemed to be a good way to split up the data in this case. Despite some limitations in this approach if the data was different or contained more errors, this worked perfectly on the wikiOscars data.

In [16]:
# split this string into sentences - on the newline characters & full stops followed by spaces
wiki_oscars_sentences = re.split(r'(?<=\.) |\n', wiki_oscars_contents)

wiki_oscars_sentences

['The Academy Awards , commonly known as The Oscars , is an annual American awards ceremony honoring achievements in the film industry .',
 'Winners are awarded the statuette , officially the Academy Award of Merit , that is much better known by its nickname Oscar .',
 'The awards , first presented in 1929 at the Hollywood Roosevelt Hotel , are overseen by the Academy of Motion Picture Arts and Sciences (AMPAS) .',
 'The awards ceremony was first televised in 1953 and is now seen live in more than 200 countries .',
 'The Oscars is also the oldest entertainment awards ceremony ; its equivalents , the Emmy Awards for television , the Tony Awards for theatre , and the Grammy Awards for music and recording , are modeled after the Academy Awards .',
 'The 86th Academy Awards were held on March 2 , 2014 , at the Dolby Theatre in Los Angeles .',
 'The first Academy Awards were presented on May 16 , 1929 , at a private dinner at the Hollywood Roosevelt Hotel with an audience of about 270 peopl

In [17]:
# process and clean these sentences using the above defined function
wiki_oscars_processed_corpus = process_corpus(wiki_oscars_sentences)

wiki_oscars_processed_corpus

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=23.0), HTML(value='')))




['<s> the academy awards , commonly known as the oscars , is an annual american awards ceremony honoring achievements in the film industry . </s>',
 '<s> winners are awarded the statuette , officially the academy award of merit , that is much better known by its nickname oscar . </s>',
 '<s> the awards , first presented in 1929 at the hollywood roosevelt hotel , are overseen by the academy of motion picture arts and sciences ( ampas ) . </s>',
 '<s> the awards ceremony was first televised in 1953 and is now seen live in more than 200 countries . </s>',
 '<s> the oscars is also the oldest entertainment awards ceremony ; its equivalents , the emmy awards for television , the tony awards for theatre , and the grammy awards for music and recording , are modeled after the academy awards . </s>',
 '<s> the 86th academy awards were held on march 2 , 2014 , at the dolby theatre in los angeles . </s>',
 '<s> the first academy awards were presented on may 16 , 1929 , at a private dinner at the h

### Define the test strings

The below are the test sentences specified during this task.

In [18]:
part_2_test_string = ["<s> The first Oscar was presented in 1929 . </s>",
                      "<s> The first Oscar was presented in 1929 </s>",
                      "<s> The first Oscar was awarded in 1929 . </s>",
                      "<s> The Oscar was first awarded in 1929 . </s>",
                      "<s> The first best Picture Arts and the . </s>",
                      "<s> The party was at the Mayfair Hotel . </s>",
                      "<s> The party was changed to the Mayfair . </s>",
                      "<s> The party was at the Emil Jannings . </s>",
                      "<s> The party was at the Hollywood Hotel . </s>",
                     ]

### Get the probabilites of each test sentence using a smoothed bigram model

Once the data was defined and processed, the next step then was to train the model on the corpus and calculate the probabilites of the test sentences. I could use the model specified in Part 1 and this time set the smoothing parameter to 'True'.

As my functions define an ngram model, I must again specify that I want a bigram model. This is done by setting *'n=2'*.

In [19]:
# tell the algorithm we want to use a bigram model
n = 2

In [20]:
part_2_prob_dict = get_prob_of_test_sentences(n, part_2_test_string, wiki_oscars_processed_corpus, smoothing=True)

1.) Counting the word occurances in the corpus
2.) Calculating the probabilites of the sentences


In [21]:
for sentence, prob in part_2_prob_dict.items():
    print("P('{}') = {}".format(sentence, prob))

P('<s> the first oscar was presented in 1929 . </s>') = 6.475086340530641e-16
P('<s> the first oscar was presented in 1929 </s>') = 1.2501505281087507e-15
P('<s> the first oscar was awarded in 1929 . </s>') = 5.65686170155563e-17
P('<s> the oscar was first awarded in 1929 . </s>') = 9.21051636752571e-17
P('<s> the first best picture arts and the . </s>') = 1.1052838174421534e-09
P('<s> the party was at the mayfair hotel . </s>') = 4.167185270744562e-12
P('<s> the party was changed to the mayfair . </s>') = 1.7458354121306087e-14
P('<s> the party was at the emil jannings . </s>') = 3.605351069985228e-14
P('<s> the party was at the hollywood hotel . </s>') = 4.0589466922836645e-14


# ---------------------------------------------------------------------------------------------------------------

# Part 3

Part 3 involved a bit more complexity.
In this part, I had to source a dataset and then using this dataset along with my bigram model, I had to generate sentences.

### Finding a dataset

When it came to choosing a dataset, I actually ended up choosing 2 different datasets.

Early on in this assignment, I chose the first dataset. I obtained this from the website Totoeba at https://tatoeba.org/eng/downloads.
Tatoeba is a large database of sentences and translations often used by people looking to learn a new language. Hence, the dataset I had downloaded contained a list of basic english sentences.
This folder initially downlaoded as a '.bz2' file but I was able to convet it to a '.zip' file using https://cloudconvert.com/bz2-to-zip, and then extract the file from this zip file.
This dataset offered me access to a dataset of over 1.4 million sentences and nearly 80,000 words, however due to the basic nature of these sentences I didn't find the generations very interesting. Despite this, it did prove a useful first step in getting the generation process up and running.

After this, I wanted to choose a dataset that was more interesting in terms of the generation. This is where I found a dataset of both real and fake news. I was able to download this from https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset. Theis download contained two files, with each containing a table of information on either fake or real news articles. This table contained an article title, the article itself, the date the article was posted and the topic of the artic.
In my case, I was only interested in the title column, which contained over 23,000 article titles in the fake news CSV. This was perfect for my use case as it gave 23,000 easily accessible fake news sentences that I could use to train my generation model and generate more fake news article titles.

While the README.md only addresses this fake news dataset, I also generate sentences using the original basic english sentence dataset and the real news dataset. These can be found at the bottom of this notebook as extra to this assignment. They demonstrate the generation ability of this code across multiple datasets.

### Read in the corpus

As described in the README.md file, I decided to proceed with a fake news article dataset. This dataset contained a list of over 23,000 article titles which I could covert to sentences so that when I generated new sentences, I was actually generating more fake news article titles.

I created a function to get some statistics on these sentences once I read them in.

In [22]:
def stats_on_corpus(corpus):
    
    '''
    Calculate a few statistics on the corpus
    
    corpus: list of strings - This contains a list of all the sentences in the corpus
    
    Returns: None
    '''
    
    print("Statistics on the corpus:")
    print(" - There are '{:,}' sentences in this corpus\n".format(len(corpus)))

    print(" - The average length of these sentences is '{:,}'\n".format(round(sum(1 for sentence in corpus for _ in sentence.split(" "))/len(corpus), 3)))

    word_counts = get_dict_of_counts(corpus, 1)
    print(" - There are {:,} words in this dataset\n".format(len(word_counts)))

    print("The following are the first 100 of those sentences:\n")
    for sentence in corpus[:100]:
        print(sentence)

In [23]:
fake_news_df = pd.read_csv("data/fake_news.csv")
fake_news_corpus = [s.strip() for s in fake_news_df["title"]]

stats_on_corpus(fake_news_corpus)

Statistics on the corpus:
 - There are '23,481' sentences in this corpus

 - The average length of these sentences is '14.732'

 - There are 37,429 words in this dataset

The following are the first 100 of those sentences:

Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing
Drunk Bragging Trump Staffer Started Russian Collusion Investigation
Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye’
Trump Is So Obsessed He Even Has Obama’s Name Coded Into His Website (IMAGES)
Pope Francis Just Called Out Donald Trump During His Christmas Speech
Racist Alabama Cops Brutalize Black Boy While He Is In Handcuffs (GRAPHIC IMAGES)
Fresh Off The Golf Course, Trump Lashes Out At FBI Deputy Director And James Comey
Trump Said Some INSANELY Racist Stuff Inside The Oval Office, And Witnesses Back It Up
Former CIA Director Slams Trump Over UN Bullying, Openly Suggests He’s Acting Like A Dictator (TWEET)
WATCH: Brand-New Pro-Trump Ad Featur

### Process the corpus

Once these sentences were read in, I could process them using the defined processing function above. While this function was initially defined to process just the data from part 2, I expanded it so that it was able to process this corpus as well.
Hence, the above processing function is quite long and does a good job with this dataset.

In [24]:
fake_news_processed_corpus = process_corpus(fake_news_corpus)

fake_news_processed_corpus

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=23481.0), HTML(value='')))




["<s> donald trump sends out embarrassing new year's eve message ; this is disturbing </s>",
 '<s> drunk bragging trump staffer started russian collusion investigation </s>',
 "<s> sheriff david clarke becomes an internet joke for threatening to poke people ' in the eye ' </s>",
 "<s> trump is so obsessed he even has obama's name coded into his website ( images ) </s>",
 '<s> pope francis just called out donald trump during his christmas speech </s>',
 '<s> racist alabama cops brutalize black boy while he is in handcuffs ( graphic images ) </s>',
 '<s> fresh off the golf course , trump lashes out at fbi deputy director and james comey </s>',
 '<s> trump said some insanely racist stuff inside the oval office , and witnesses back it up </s>',
 "<s> former cia director slams trump over un bullying , openly suggests he's acting like a dictator ( tweet ) </s>",
 '<s> watch : brand-new pro-trump ad features so much a** kissing it will make you sick </s>',
 "<s> papa john's founder retires , 

### Define the function to get the list of the words that follow each word

This function enabled me to cut down on the amount of iterations I needed to do when generating the next word in the sentence. As this function enables me to know, from the corpus, the words that follows every word, I was able to discount some words from the iteration as I knew they did not come after my given word so therefore had no chance of being the next word in the generated sentence.

In [25]:
def get_list_of_words_that_follow_each_word(two_word_sequence_counts):
    
    '''
    Take in a dictionary, iterate through its keys, recording all the second words that come after the first in the sequence
    
    two_word_sequence_counts: dictionary {str: int} - a sequence of two words mapped to the # times they occur in the corpus
    
    Returns: dictionary {str: [str, str, ...]} - each first word being mapped to a list of all the second words that follow it
    '''
    
    following_words_dict = {}
    for word_pair in tqdm(two_word_sequence_counts.keys()):

        word_1, word_2 = word_pair.split(" ")
        
        if word_1 in following_words_dict:
            following_words_dict[word_1].append(word_2)

        else:
            following_words_dict[word_1] = [word_2]
        
    return following_words_dict

### Define the functions to generate sentences from the corpus

These are the functions used to generate the sentences. The process for doing this is:

    1. Start with a <s> sentence boundary
    2. Iterate through every word that follows this word and calculate the probability that each word follows this word
    3. Select one word from this list based on the probability that it occurs next
    4. Add this selected word to the sentence
    5. Repeat step 2, 3 & 4 untill we have added a <\s> sentence boundary to our sentence
    
This process generates the sentence. I then take this sentence and clean it up to make it look more realistic when it is output. Both the raw and cleaned generated sentences are stored in a list.

I continue generating these sentences until I have reached my targeted amount of sentences. I in turn return the raw and cleaned list of sentences.

In [26]:
def tidy_output(created_sentence):
    
    '''
    Take in a generated sentence and tidy it so that it looks more like a real sentence when it is output
    
    created_sentence: string - A generated sentence
    
    Returns: string - a cleaned version of the input generated sentence, more suitable for outputting
    '''
    
    tokanised_sentence = created_sentence.split(" ")[1:-1] # don't include the '<s>' & '</s>' words
    
    quotes_list = ['"', "'"]
    
    # step 1 - capitalise 'i'
    for i in range(len(tokanised_sentence)):
        word = tokanised_sentence[i]
        if word[0] == 'i' and (len(word) == 1 or word[1] == "'"):
            tokanised_sentence[i].capitalize()
         
    # step 2 - get rid of the unnecessary spacing
    num_quotes = 0
    final_sentence = ""
    for i in range(len(tokanised_sentence)):
        word = tokanised_sentence[i]
        last_char = final_sentence[-1] if final_sentence != '' else ''
        
        if (word in ['.', ',', '?', '!', ';', ':', ')']) or (word in quotes_list and num_quotes % 2 == 1):
            # don't add a space before adding these words
            final_sentence += word
            
        elif (last_char in ['(']) or (last_char in quotes_list and num_quotes % 2 == 1):
            # don't add a space before adding our new word
            final_sentence += word
            
        else:
            # add a space before adding this new word
            final_sentence += " " + word
            
        # count the number of quotes already in this sentence
        if word in quotes_list:
            num_quotes += 1
           
    # step 3 - capitalise the first character in the sentence
    i = 0
    while not final_sentence[i].isalpha() and i < len(final_sentence):
        i = i + 1

    if i < len(final_sentence):
        final_sentence = final_sentence[:i] + final_sentence[i].capitalize() + final_sentence[i+1:]
    
    return final_sentence.strip()

In [27]:
def generate_sentences(num_sentences, corpus, min_sentence_length, smoothing):
    
    '''
    Take in a corpus, train a bigram model on it, generate sentences using the word probabilities of this bigram model
    
    num_sentences: integer - The number of sentences we want to generate from our corpus
    corpus: list of strings - This contains a list of all the sentences in the corpus
    min_sentence_length: integer - The min number of words our generated sentences can have (excluding sentence boundaries)
    smoothing: boolean - This parameter specifies whether we want to do smoothing or not
    
    Returns: tuple (list of strings, list of strings) - This contains: a list of all the raw generated sentences in the corpus
                                                                     : a list of the tidied version of these generated sentence
    '''
    
    # turn the text in the corpus to lowercases to make it uniform
    lower_corpus = [s.lower() for s in corpus]
    
    print("Counting the word occurances in the corpus")
    # create a dictionary of all the sequences words in the corpus and their counts
    individual_word_counts_dict = get_dict_of_counts(lower_corpus, 1)
    print("     There are {} words in this corpus\n".format(len(individual_word_counts_dict)))
    corpus_size = np.sum([v for k,v in individual_word_counts_dict.items()])
    
    # create a dictionary of all the sequences of 'n' words in the corpus and their counts
    word_pair_counts_dict = get_dict_of_counts(lower_corpus, 2)
    print("     There are {} word pairs in this corpus\n".format(len(word_pair_counts_dict)))
    
    # create a dictionary mapping each word to a list of all the words that follow that word
    following_words_dict = get_list_of_words_that_follow_each_word(word_pair_counts_dict)
    
    all_raw_sentences = []
    all_cleaned_sentences = []
    print("Generating the sentences")
    for i in tqdm(range(num_sentences)):
        last_word = "<s>"
        
        created_sentence = last_word
        i = 0
        while last_word != "</s>":
            
            get_dict_of_probs = {}
            
            # get the list of all potential next words
            for next_word in following_words_dict[last_word]:
                    
                # do not allow the sentence to finish if there are less than a certain number of words in the sentence
                if next_word == '</s>' and i < (min_sentence_length + 2): # +2 to allow for <s> & <\s>
                    continue
                    
                elif next_word != '</s>':
                    # do not allow a word to be chosen if the only word that follows that word is an end character and the sentence will be too short
                    cant_use = False
                    j = i
                    w = next_word
                    while len(following_words_dict[w]) == 1 and j < (min_sentence_length + 2): # +2 to allow for <s> & <\s>
                        w = following_words_dict[w][0]
                        if w == '</s>':
                            cant_use = True
                            break
                        j += 1

                    if cant_use:
                        continue
                
                # get the probability of this word being the next word
                get_dict_of_probs[next_word] = get_prob_w_or_wo_smoothing(next_word, last_word, corpus_size, smoothing, individual_word_counts_dict, word_pair_counts_dict)
                

            try:
                # generate the next words for the sentence
                chosen_next_word = random.choices(population=list(get_dict_of_probs.keys()), weights=list(get_dict_of_probs.values()), k=1)[0]
            
            except:
                print(following_words_dict[last_word])
                print(following_words_dict[next_word])
                print(get_dict_of_probs)
            
            # add this word to the sentence
            created_sentence += " " + chosen_next_word
                
            last_word = chosen_next_word
            i += 1
            
        # Fix the capitalisation in the sentence and the commas, quotations and full stops
        final_sentence = tidy_output(created_sentence)
            
        # store this sentence in a list
        all_raw_sentences.append(created_sentence)
        all_cleaned_sentences.append(final_sentence)
        
    return all_raw_sentences, all_cleaned_sentences

### Generate the sentences

Once I know that the data was defined and processed, and the functions have been defined, the next step is to generate the sentences using the data and my defined functions.
For generating the sentences, I use part of the model defined in Part 1 to calculate the probabilities of each word following another word.

Unlike the previous two parts, I did not define an ngram model here but instead explicitely defined a bigram model.

I can specify the number of sentences I generate, the minimum sentence length and if I want to use smoothing or not. This will all have an effect on the generated output sentences.

In [28]:
num_sentences = 100
min_sentence_length = 4
smoothing = True

In [29]:
raw_gen_fake_news, clean_gen_fake_news = generate_sentences(num_sentences, fake_news_processed_corpus, min_sentence_length, smoothing)

Counting the word occurances in the corpus
     There are 17721 words in this corpus

     There are 145703 word pairs in this corpus



HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=145703.0), HTML(value='')))


Generating the sentences


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




### Output the generated sentences to standard output and to a text file

What is returned by the generation function is two lists, one containing the raw generated sentences and the other containing the cleaned versions of these generated sentences.
I output both versions of these to a text file to store them and also print both to standard output to show them in this notebook.

I have defined a function to help this process. You can see the generated sentences being output as this function is run below.

I will discuss the outputs of this sentence generation in more detail in the README.md.

In [30]:
def write_sentences_to_file(gen_sentences, filename):
    
    '''
    Iterate through the generated sentences, output them to a file, also print them to standard output
    
    gen_sentences: list of strings - This contains a list of all the generated sentences
    filename: string - the name of the file to output the sentences to
    
    Returns: None
    '''
    
    with open(filename, 'w') as f:
        for sentence in gen_sentences:
            f.write(sentence)
            print(sentence)
            print()

In [31]:
write_sentences_to_file(raw_gen_fake_news, 'outputs/raw_generated_fake_news_sentences.txt')

<s> putin [video] hillary pretended to force his after rally cancelled </s>

<s> cnn is a comprehensive and his children </s>

<s> russia ( video ) patriots let them overpriced chachkies </s>

<s> joe arpaio for ' if nfl players threaten to hide illegal alien protected ? [video] </s>

<s> holy moly ! [video] : was more to turkey you'll be discharged for next to defend hillary's criticism of brilliant video ) </s>

<s> trump throws hillary to " fee of illinois : to get dirt on twitter after protecting obama fights break it does segment </s>

<s> ohio governor just how only cares so much trump wins support grows </s>

<s> dear democrats in her of the forgotten what he can't make your stomach during 2016 election because he said next was attacked repeatedly by maria bartiromo ... does mental illness ' for him with muslim wins , tonight [video] </s>

<s> trump , of marijuana to arm our borders illegally with muslim ban with vegas massacre ... asked about reduction in the answer the woman .

In [32]:
write_sentences_to_file(clean_gen_fake_news, 'outputs/clean_generated_fake_news_sentences.txt')

Putin [video] hillary pretended to force his after rally cancelled

Cnn is a comprehensive and his children

Russia (video) patriots let them overpriced chachkies

Joe arpaio for 'if nfl players threaten to hide illegal alien protected? [video]

Holy moly! [video]: was more to turkey you'll be discharged for next to defend hillary's criticism of brilliant video)

Trump throws hillary to "fee of illinois: to get dirt on twitter after protecting obama fights break it does segment

Ohio governor just how only cares so much trump wins support grows

Dear democrats in her of the forgotten what he can't make your stomach during 2016 election because he said next was attacked repeatedly by maria bartiromo ... does mental illness 'for him with muslim wins, tonight [video]

Trump, of marijuana to arm our borders illegally with muslim ban with vegas massacre ... asked about reduction in the answer the woman ... ever (details)

Ted cruz on live sketch ended obama's a rooftop ... hillary clinton l

# ---------------------------------------------------------------------------------------------------------------

# Generate real news sentences

The folder containing the fake news CSV that I downloaded, also contained an identically formatted real news file. As I had this file already downloaded, I figured that I would run my code on this dataset as well to attempt to generate some real news article titles.

The parameters for smoothing, the minimum length of the sentences and the number of sentences generated are the same as those for the fake news article title generation.

In [33]:
real_news_df = pd.read_csv("data/real_news.csv")
real_news_corpus = [s.strip() for s in real_news_df["title"]]

stats_on_corpus(real_news_corpus)

Statistics on the corpus:
 - There are '21,417' sentences in this corpus

 - The average length of these sentences is '9.956'

 - There are 22,867 words in this dataset

The following are the first 100 of those sentences:

As U.S. budget fight looms, Republicans flip their fiscal script
U.S. military to accept transgender recruits on Monday: Pentagon
Senior U.S. Republican senator: 'Let Mr. Mueller do his job'
FBI Russia probe helped by Australian diplomat tip-off: NYT
Trump wants Postal Service to charge 'much more' for Amazon shipments
White House, Congress prepare for talks on spending, immigration
Trump says Russia probe will be fair, but timeline unclear: NYT
Factbox: Trump on Twitter (Dec 29) - Approval rating, Amazon
Trump on Twitter (Dec 28) - Global Warming
Alabama official to certify Senator-elect Jones today despite challenge: CNN
Jones certified U.S. Senate winner despite Moore challenge
New York governor questions the constitutionality of federal tax overhaul
Factbox: Trum

In [34]:
real_news_processed_corpus = process_corpus(real_news_corpus)

real_news_processed_corpus

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=21417.0), HTML(value='')))




['<s> as united states budget fight looms , republicans flip their fiscal script </s>',
 '<s> united states military to accept transgender recruits on monday : pentagon </s>',
 "<s> senior united states republican senator : ' let mr mueller do his job ' </s>",
 '<s> fbi russia probe helped by australian diplomat tip-off : nyt </s>',
 "<s> trump wants postal service to charge ' much more ' for amazon shipments </s>",
 '<s> white house , congress prepare for talks on spending , immigration </s>',
 '<s> trump says russia probe will be fair , but timeline unclear : nyt </s>',
 '<s> factbox : trump on twitter ( dec 29 ) - approval rating , amazon </s>',
 '<s> trump on twitter ( dec 28 ) - global warming </s>',
 '<s> alabama official to certify senator-elect jones today despite challenge : cnn </s>',
 '<s> jones certified united states senate winner despite moore challenge </s>',
 '<s> new york governor questions the constitutionality of federal tax overhaul </s>',
 '<s> factbox : trump on t

In [35]:
raw_gen_real_news, clean_gen_real_news = generate_sentences(num_sentences, real_news_processed_corpus, min_sentence_length, smoothing)

Counting the word occurances in the corpus
     There are 15360 words in this corpus

     There are 113367 word pairs in this corpus



HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=113367.0), HTML(value='')))


Generating the sentences


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [36]:
write_sentences_to_file(raw_gen_real_news, 'outputs/raw_generated_real_news_sentences.txt')

<s> united states lawsuit alleging misconduct inquiry after armed forces </s>

<s> factbox : united nations , tillerson nomination of ethnic cleansing ' leap to october 2 to enact contentious democratic party says </s>

<s> governor says meeting with ' to roll back to interview </s>

<s> trump method not just in ' deadline </s>

<s> republican united on chemical law in investment climate goals despite leaving nothing concrete evidence : united states senate candidate platforms used in early voting fraud </s>

<s> trump's biggest sex assault to mexican leftist lopez obrador leads trump likely dooms reform </s>

<s> organizers of ' after four people in europe to ' </s>

<s> democrat defends eu ' call trump ' strong result </s>

<s> former military grants , politics , respect the fire tear gas at china's xi says war ' turning business roots , rebels in iowa supreme court on inauguration day events after sanctions </s>

<s> after asia trip after united states attorney general escalates </s

In [37]:
write_sentences_to_file(clean_gen_real_news, 'outputs/clean_generated_real_news_sentences.txt')

United states lawsuit alleging misconduct inquiry after armed forces

Factbox: united nations, tillerson nomination of ethnic cleansing 'leap to october 2 to enact contentious democratic party says

Governor says meeting with 'to roll back to interview

Trump method not just in 'deadline

Republican united on chemical law in investment climate goals despite leaving nothing concrete evidence: united states senate candidate platforms used in early voting fraud

Trump's biggest sex assault to mexican leftist lopez obrador leads trump likely dooms reform

Organizers of 'after four people in europe to'

Democrat defends eu 'call trump' strong result

Former military grants, politics, respect the fire tear gas at china's xi says war 'turning business roots, rebels in iowa supreme court on inauguration day events after sanctions

After asia trip after united states attorney general escalates

Trump a smooth obamacare subsidies to california university staff exits, trump tweets raises lots of 

# ---------------------------------------------------------------------------------------------------------------

# Generate basic English sentences

The original dataset I chose contained very basic engllish sentences. While I moved away from this dataset to the fake news dataset, I still had this dataset downloaded and available so I decided to run the sentence generation on this too. This dataset is much bigger, containing 1.4 million sentences, and hence took longer to process.

The parameters for smoothing, the minimum length of the sentences and the number of sentences generated are the same as those for the fake news article title generation.

In [38]:
basic_sentences_df = pd.read_csv("data/eng_sentences.tsv", sep="\t", header=None).rename(columns = {0:"id", 1:"language", 2:"sentence"})
basic_eng_corpus = list(basic_sentences_df["sentence"])

stats_on_corpus(basic_eng_corpus)

Statistics on the corpus:
 - There are '1,428,266' sentences in this corpus

 - The average length of these sentences is '7.653'

 - There are 156,709 words in this dataset

The following are the first 100 of those sentences:

Let's try something.
I have to go to sleep.
Today is June 18th and it is Muiriel's birthday!
Muiriel is 20 now.
The password is "Muiriel".
I will be back soon.
I'm at a loss for words.
This is never going to end.
I just don't know what to say.
That was an evil bunny.
I was in the mountains.
Is it a recent picture?
I don't know if I have the time.
Education in this world disappoints me.
You're in better shape than I am.
You are in my way.
This will cost €30.
I make €100 a day.
I may give up soon and just nap instead.
It's because you don't want to be alone.
That won't happen.
Sometimes he can be a strange guy.
I'll do my best not to disturb your studying.
I can only wonder if this is the same for everyone else.
I suppose it's different when you think about it over

In [39]:
basic_eng_processed_corpus = process_corpus(basic_eng_corpus)

basic_eng_processed_corpus

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1428266.0), HTML(value='')))




["<s> let's try something . </s>",
 '<s> i have to go to sleep . </s>',
 "<s> today is june 18th and it is muiriel's birthday ! </s>",
 '<s> muiriel is 20 now . </s>',
 '<s> the password is " muiriel " . </s>',
 '<s> i will be back soon . </s>',
 "<s> i'm at a loss for words . </s>",
 '<s> this is never going to end . </s>',
 "<s> i just don't know what to say . </s>",
 '<s> that was an evil bunny . </s>',
 '<s> i was in the mountains . </s>',
 '<s> is it a recent picture ? </s>',
 "<s> i don't know if i have the time . </s>",
 '<s> education in this world disappoints me . </s>',
 "<s> you're in better shape than i am . </s>",
 '<s> you are in my way . </s>',
 '<s> this will cost €30 . </s>',
 '<s> i make €100 a day . </s>',
 '<s> i may give up soon and just nap instead . </s>',
 "<s> it's because you don't want to be alone . </s>",
 "<s> that won't happen . </s>",
 '<s> sometimes he can be a strange guy . </s>',
 "<s> i'll do my best not to disturb your studying . </s>",
 '<s> i can o

In [40]:
raw_gen_basic_eng, clean_gen_basic_eng = generate_sentences(num_sentences, basic_eng_processed_corpus, min_sentence_length, smoothing)

Counting the word occurances in the corpus
     There are 74912 words in this corpus

     There are 1004325 word pairs in this corpus



HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1004325.0), HTML(value='')))


Generating the sentences


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [41]:
write_sentences_to_file(raw_gen_basic_eng, 'outputs/raw_generated_basic_eng_sentences.txt')

<s> really like that he shouldn't ask </s>

<s> it's not enough to be grateful to to pop music . </s>

<s> what everyone of well he said that , to buy . </s>

<s> the chances tom was selling weed . </s>

<s> in private ones . to her . </s>

<s> mary think tom to miss me with everybody that a tom stayed in </s>

<s> i want to them did you not i had spent his government income was half , . in all put . that's not he ? . </s>

<s> doing " did to tom to , if that he was in the libyan crisis affords me tom who stole the lord . on the same that tom will happen ? on the wait here because tom fast you know tom has told out for . </s>

<s> i happy " " what he immigrated with you that </s>

<s> am a couple of her up . is tom won't tom and to murder and mary punished ? home . </s>

<s> she the president of her we're afraid that was cold didn't help with me not going to , as you go with tom be ? . . </s>

<s> the different . . very . </s>

<s> i on charcoal and that's about good for you . </s>

<s

In [42]:
write_sentences_to_file(clean_gen_basic_eng, 'outputs/clean_generated_basic_eng_sentences.txt')

Really like that he shouldn't ask

It's not enough to be grateful to to pop music.

What everyone of well he said that, to buy.

The chances tom was selling weed.

In private ones. to her.

Mary think tom to miss me with everybody that a tom stayed in

I want to them did you not i had spent his government income was half,. in all put. that's not he?.

Doing "did to tom to, if that he was in the libyan crisis affords me tom who stole the lord. on the same that tom will happen? on the wait here because tom fast you know tom has told out for.

I happy "" what he immigrated with you that

Am a couple of her up. is tom won't tom and to murder and mary punished? home.

She the president of her we're afraid that was cold didn't help with me not going to, as you go with tom be?..

The different.. very.

I on charcoal and that's about good for you.

Don't think tom assumed that sure whether i'll mail it.

Have they tom forever tiroes, be i wonder i thought he saw her one here a tree.

. The of 