In this kernel, I am creating a simple quote generator using the "South Park dialogue" corpus. Quote generators are often used as a just for fun tool - for instance in Twitter parody accounts, and indeed they can yield some funny things. 

# Imports

In [None]:
import pandas as pd
import numpy as np
import nltk
from math import log
from collections import defaultdict
import random

# Load the Data

In [None]:
quotes = pd.read_csv('../input/All-seasons.csv')

# Get the Characters to Quote

The dataset contains many characters, but most of them don't have enough entries that would enable us to create something funny. So we will just stick to those who have at least 1000 words.

In [None]:
quotes_by_character = quotes.groupby('Character')
quotes_by_character.count()[quotes_by_character.count().Line > 1000]

We have six entries, I will not work on all of them in this kernel, I will just pick one for demonstration. Let's go with Kyle, the 3rd with most entries.

In [None]:
kyle_quotes = quotes[quotes.Character == "Kyle"].Line

In [None]:
kyle_quotes.head()

## Preprocessing

It's often a good idea to put the words into lower case before starting any word counting task in NLP. Let's do that. Also, let's strip the new line characters!

In [None]:
kyle_quotes_lower = kyle_quotes.apply(str.lower).apply(str.rstrip, '\n')

Next thing to do is to tokenize the entries. Tokenizing means turning the sentences into a list of its constituent words. This is not an easy task as it may seem at first: for example, in the head() display of Kyle's quotes above, you can see that the question marks are written right after its preceeding words, so the task is more demanding than a mere string splitting over spaces. We will use the NLTK's built-in tokenizer to do the job:

In [None]:
kyle_tokens = kyle_quotes_lower.apply(nltk.word_tokenize)

In [None]:
kyle_tokens.head()

As you can see, it is not perfect (For example, it made "can't" as two words, "ca" and "'nt". But it's good enough for our purposes.

The last thing I want to do is to transform this Pandas series into a Python list, to make life easier for us and NLTK

In [None]:
kyle_tokens_list =  [ word for inner_list in list(kyle_tokens) for word in inner_list]

### A Random Question: How Articulate is Kyle?

I want to see his lexical diversity, does he use a lot of different words?

In [None]:
kyle_lexical_diversity = len(set(kyle_tokens_list)) / len(kyle_tokens_list)
print(kyle_lexical_diversity)

The higher the number, the more different words he uses. But of course the number is dependent over the length of the corpus, a corpus of 5 million words would yield can different range than this one.

### How long are his sentences, on average?

In [None]:
len(kyle_tokens_list)/len(kyle_tokens)

On average, his sentences are 12 words long.

### How Does He Compare to the Other Characters?

I will get the data for those who have more than a 100 words, but will display only the top six

In [None]:
top_characters = quotes_by_character.count()[quotes_by_character.count().Line > 100].index

In [None]:
#This function will redo all the computation explained above to compute the lexical diversity and the average sentence length
def get_character_params(data, character):
    character_quotes = data[data.Character == character].Line
    character_quotes_lower = character_quotes.apply(str.lower).apply(str.rstrip, '\n')
    character_tokens = character_quotes_lower.apply(nltk.word_tokenize)
    character_tokens_list =  [ word for inner_list in list(character_tokens) for word in inner_list]
    number_of_unique_words = len(set(character_tokens_list))
    character_lexical_diversity = number_of_unique_words / len(character_tokens_list)
    character_avg_sentence_length = len(character_tokens_list)/len(character_tokens)
    
    return [len(character_tokens), len(character_tokens_list),  character_avg_sentence_length, number_of_unique_words, character_lexical_diversity]

In [None]:
top_characters_tokens = []

columns = ['Name', 'Number of Lines','Total Word Count', "Average Sentence Length", 'Unique Words', "Lexical Diversity"]
character_quotes_parameters_df = pd.DataFrame(columns=columns)

for speaker in top_characters:
    temp_entry_dict = {'Name':"", 'Number of Lines':"",'Total Word Count':"", 
                       "Average Sentence Length":"", 'Unique Words':"", "Lexical Diversity":""}
    
    character_params = get_character_params(quotes, speaker)
    
    temp_entry_dict['Name'] = speaker
    temp_entry_dict['Number of Lines'] = character_params[0]
    temp_entry_dict['Total Word Count'] = character_params[1]
    temp_entry_dict['Average Sentence Length'] = character_params[2]
    temp_entry_dict['Unique Words'] = character_params[3]
    temp_entry_dict['Lexical Diversity'] = character_params[4]
    
    character_quotes_parameters_df = character_quotes_parameters_df.append(temp_entry_dict, ignore_index=True)
    
character_quotes_parameters_df.head()

Somehow the numbers seems to have a relationship of some sort, let's visualize them!

### Visualising the Result

### Lexical Diversity vs Total Word Count

In [None]:
import matplotlib.pyplot  as plt

plt.plot(character_quotes_parameters_df['Total Word Count'], character_quotes_parameters_df['Lexical Diversity'],'ro')

plt.xlabel('Total Word Count', fontsize=16)
plt.ylabel('Lexical Diversity', fontsize=16)

plt.show()

An exponential decay maybe? Let's take the log of the variables:

In [None]:
import matplotlib.pyplot  as plt
import math
plt.plot(np.log(character_quotes_parameters_df['Total Word Count'].astype(float)), np.log(character_quotes_parameters_df['Lexical Diversity'].astype(float)),'ro')

plt.xlabel('Ln Total Word Count', fontsize=16)
plt.ylabel('Ln Lexical Diversity', fontsize=16)

plt.show()

### Unique Word Count vs Total Word Count

In [None]:
import matplotlib.pyplot  as plt

plt.plot(character_quotes_parameters_df['Total Word Count'], character_quotes_parameters_df['Unique Words'],'ro')

plt.xlabel('Total Word Count', fontsize=16)
plt.ylabel('Unique Word Count', fontsize=16)

plt.show()

In [None]:
import matplotlib.pyplot  as plt

plt.plot(np.log(character_quotes_parameters_df['Total Word Count'].astype(float)), np.log(character_quotes_parameters_df['Unique Words'].astype(float)),'ro')

plt.xlabel('Ln Total Word Count', fontsize=16)
plt.ylabel('Ln Unique Word Count', fontsize=16)

plt.show()

Note: The log function in numpy is actually the natural log (ln), not the log to the base 10 or 2. I think it is amazing how this number can approximate growth in real life. So the relationship in here is that of a natural exponential growth.

## Word Frequencies

In [None]:
kyle_word_freq = nltk.FreqDist(kyle_tokens_list)

In [None]:
kyle_word_log_probability = [ {word: -log(float(count)/ len(kyle_tokens_list))} for word, count in kyle_word_freq.items() ]

In [None]:
kyle_word_log_probability[:10]

## N-Grams

Grams are **successive** words that appear in the text. The N in N-grams can be substituted by any number starting from one, so 1-gram is also called unigram, 2-gram = bigram, 3 = trigram ..etc. When we take a bigram for example and apply it for the following sentence:

"The quick brown fox jumps over the lazy dog"

the bigrams that we have are:<br>
 - The quick
 - quick brown
 - brown fox
 - fox jumps ..etc
 
The trigrams would be:
 - The quick brown
 - quick brown fox
 - brown fox jumps ..etc

A unigram is just each individual word, same as the word list we created above.

So why would we want to use that? If we create a list of bigrams with the probabilities of the following word, this can be used to mimic a person's style. So for example, if we create the bigrams of the sentence above, and see the word "the": it appeared twice, once followed by quick and another time followed by lazy. So if we want to create a sentence similar to this one (Which is totally boring, I know), when we encounter the word "the", we know that it is followed 50% of the time by "quick" and another 50% of the time by "lazy". When the corpus of text is much larger, we have a lot of different possibilities and interesting patterns of speech may be found.

Although NLTK offers a function to get bigrams, we want to get more to generate nicer sentences, so let's build our own n-gram builder function:

In [None]:
#Build n-gram builder
def build_ngram(text, n):
    #sanity check, but generally speaking we want the text to be much much longer than the sentence length to get
    #interesting\funny results
    if len(text) < n:
        print( "Text length is less than n")
        return text
    index = 0
    tokenized_text = nltk.word_tokenize(text)#Try it with lower case, i.e. text.lower()
    
    ngram = defaultdict()
    
    #Loop over all text, except the last n words, since they cannot have n words after
    for index in range(len(tokenized_text) - n):
        #Get current word from the corpus
        current_word = tokenized_text[index]
        
        #Get the next n words, so that we can push them into the current word's entry in the ngram dictionary
        ngram_tail = " ".join(tokenized_text[index + 1 : index+n])
        
        #The general structure of an entry is as follows: the beginning of the ngram is the key. Its contents is a dictionary
        #that contains the total number of grams that are started by this first word, plus another dictionary of all the grams
        #and their counts. To save a little space, only the tail is stored in that last dictionary. That way, we can compute 
        #easily the probability since everything needed is already stored inside
        
        #If this is a new entry, create a new one
        if current_word not in ngram.keys():
            ngram[current_word] = {
                                     'total_grams_start' : 1, 
                                     'grams':  { ngram_tail : 1  } 
                                   }
        else:
            #increase the total count of grams starting with this word
            ngram[current_word]['total_grams_start'] += 1
            #If this ngram tail is new, create a new sub-entry with this ngram
            if ngram_tail not in ngram[current_word]['grams'].keys():
                ngram[current_word]['grams'][ngram_tail] = 1
            #else, increment the entry count by one
            else:
                ngram[current_word]['grams'][ngram_tail] += 1

    return ngram

# Let the Fun Begin: Quote Generation

In [None]:
def Generate_quote(grammed_input, gram_size, start_word, quote_length):
    
    output_str = start_word
    
    #This is like the seed based on which we will pick the next word.
    current_word = start_word.lower()
    
    next_word = ""
    
    #iterate length by gram size times + 1. We want to iterate as much as needed to build a sentence of n size
    for i in range(quote_length//gram_size + 1):
        #We want some randomness in picking the next word, not just pick the highest probable next word. So we are going to
        #set a minimum probability under which the gram is not going to get picked.
        random_num = random.random()
        
        #cumulative probability
        cum_prob = 0
        for potential_next_word, count in grammed_input[current_word]['grams'].items():
            #The cumulative probability is the count of this gram-tail divided by how many time the see word appeared
            cum_prob += float(count)/grammed_input[current_word]['total_grams_start']
            #print cum_prob, random_num
            #If the cumulative probability has reached the minimum probability threshold, then this is the gram to use
            if cum_prob > random_num:
                output_str += ( " " + potential_next_word) 
                current_word = potential_next_word.split()[-1]
                break
            #else, i.e. this gram's probability is lower than our random threshold, get the next gram
            else:
                continue
    # finish with an end of sentence. For now, a sentence ends with a full stop, no question\exclamation marks.
    # The code will continue to generate text until we encounter a gram that ends with a full stop.
    if output_str[-1] != '.':
        #eos = end of sentence
        no_eos = True
        while no_eos:
            cum_prob = 0
            random_num = random.random()
            
            for potential_next_word, count in grammed_input[current_word]['grams'].items():
                cum_prob += float(count)/grammed_input[current_word]['total_grams_start']
                #print cum_prob, random_num
                if cum_prob > random_num:
                    if '.' in potential_next_word:
                        potential_next_word = potential_next_word.split('.')[0]
                        output_str += ( " " + potential_next_word + ".") 
                        no_eos = False
                    else:
                        output_str += ( " " + potential_next_word) 
                        current_word = potential_next_word.split()[-1]
                    break
                else:
                    continue
        
    return output_str

So now, let's generate some sentences based on different grams. Let's start with the bigram:

### Bigram

In [None]:
kyle_bigram = build_ngram(' '.join(kyle_tokens_list), 2)

In [None]:
Generate_quote(kyle_bigram, 2, 'i', 12)

Some bigram sentences:

 - "i know what 's illegal for being our technology really old world 's office is it ! friends were no ! aaaaah ! i 'm pretty dead ."<br>
 - "i 'll get him your farts in warm water ? ten ! he even be okay , he went really wrong ."<br>
 - 'i mean , but now ! stan ? cartman ? dude okay , we would feel our hands beneath a way ! look alike .'<br>
 - 'i saw this stupid and look , and- yeah .'<br>
 - "i 'm gon na go on ? ! nonono , we need to die ! you ! please , i 'm finally taught me ."<br>
 - 'i saw that it pissed you sleep , there ! no jews ! aahhh .'

Sounds funny, but it doesn't make a lot of sense. Let's try the trigram and see how this would improve:

### Trigrams

In [None]:
kyle_trigram = build_ngram(' '.join(kyle_tokens_list), 3)

In [None]:
Generate_quote(kyle_trigram, 3, 'i', 12)

 -  "i guess it 's dying on our pig has used somalia ."<br>
 - "i 've been able to see it 's supposed to care for it is n't the same laws and you know how i said i do n't know , do n't think i say he got all right ? ! oh this is the coolest guy in there ."<br>
 - "i would have sex with water ! mom , dreidel , chef ! now ! '' `` giant douche ."<br>
 - "i 'm kyle broflovski . nothing 's in a long time ago ."
 - "i 'm sorry dude . i 'm getting a big performance in denver tomorrow."<br>

Note that last one, sounds like a normal statement.

### Quadgrams:

In [None]:
kyle_quadgram = build_ngram(' '.join(kyle_tokens_list), 4)

In [None]:
Generate_quote(kyle_quadgram, 4, 'i', 12)

 - 'i had the same thing ; that ms. ellen was such über pwnage oh yeah .'<br>
 - "i gut dragged for christ 's sake of all humanity than peace in its head ! see ? i have this strange ."<br>
 - "i thought ... why ? like what 's so important part of our language ! aw , come on ! hang on the left side , stan ? change our name ."<br>
 - "i swear to god , i hate crime ? but i i did n't eat so much better about what you 're in detention ! twenty dollars ."<br>
 - "i do n't care . so it either . kenny ! kenny , molestered , bad guy after all right ! yeah ."

### Pentgrams:

In [None]:
kyle_pentgram = build_ngram(' '.join(kyle_tokens_list), 5)

In [None]:
Generate_quote(kyle_pentgram, 5, 'i', 12)

 - "i am never sucking your wiener in his mouth , and i 've got the doll ! they got it working on our school project the movies on cartman ! goddammit cartman , check it out ."<br>
 - "i ca n't stop feeling this might be the fuck is wrong with a pig ."<br>
 - "i do n't believe it , that was n't gon na work now go fund yourself ."
 - "i guess . look , it 's true . i wan na be cool ."<br>
 - "i am not , not scary , fatass ! so that 's why bad ? because they doing thiiis ? ! and what the hell is wrong with them ! hello , everybodyyy ."

A Richer Corpus can create nicer sentences. Here is an example of sentences generated offline using the same code based on Shakespeare's work, using a quadgram:

"I ' faith , yet needful 't is gone , Achilles . Then call them to the alehouse with us . Come , kiss ; And on the place , it concluded , No , sir , thou art overthrown by noble Brutus ."

# Conclusion

There is something that I want to point to before wrapping-up this kernel:<br>

The higher the number of grams you will use:<br>
  - The longer the time it will take to generate the probabilities, so beware of that. Doing a pentgram on Shakespeare's work is going to take a while to finish<br>
  - The more the sentences are going to be exact copies from the original text. So using a 12-gram to generate a sentence of length of 12 can give back an original sentence, unchanged (The only variation is from the random number generated to pick the next word).
