# Fill in the blanks
- The cat sat on ____ mat
- Switch ____ the fan
- I would like some spaghetti ____ marinara.
- I am terribly ____ !

## How are we doing this ??

Given a sequence of words w_1 - w_i can we predict the next word ?

In [1]:
def tokenise(sentence): 
    return sentence.split(" ") # Naive tokeniser
def tokenise_corpus(corpus):
    return [tokenise(sentence) for sentence in corpus]

In [2]:
# Let us make some corpus. 

corpus = ["the dog barks", "a cat purrs", "a dog barks loudly", "a cat sleeps", "the dog bites"]

# Preprocessing - Let us tokenise
corpus = tokenise_corpus(corpus)
print(corpus)


[['the', 'dog', 'barks'], ['a', 'cat', 'purrs'], ['a', 'dog', 'barks', 'loudly'], ['a', 'cat', 'sleeps'], ['the', 'dog', 'bites']]


## Let us get a sense of our data. 
- We need the number of unique tokens that occur and their frequencies.

In [3]:
# Let us get a sense of our data. 

def token_counts(corpus):
    tokens = {}
    for sentence in corpus:
        for word in sentence:
            if word not in tokens:
                tokens[word] = 0
            tokens[word] += 1
    return tokens

from ipy_table import *
tokens = token_counts(corpus)
sorted_tokens = sorted(tokens.items(), key = lambda x: x[1], reverse=True)
make_table(sorted_tokens)

#

0,1
a,3
dog,3
cat,2
the,2
barks,2
sleeps,1
purrs,1
bites,1
loudly,1


In [4]:
# Plot our findings
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *
init_notebook_mode(connected=True)
iplot([{"x" : list(zip(*sorted_tokens))[0], "y": list(zip(*sorted_tokens))[1]}])

## Let us try a real world example 
- Adventures of sherlock holmes from Gutenberg.org

In [5]:
import requests
import re

by_new_lines = re.compile("\n+")
def get_corpus(url):
    return requests.get(url).text

corpus = get_corpus("http://www.gutenberg.org/cache/epub/1661/pg1661.txt") # Get it from the site

# Simple sentence tokenisation.
corpus = by_new_lines.split(corpus)
print(len(corpus))

13053


In [6]:
# Tokenise 
corpus = tokenise_corpus(corpus)
tokens = token_counts(corpus)
sorted_tokens = sorted(tokens.items(), key = lambda x: x[1], reverse=True)
make_table(sorted_tokens[:100])

0,1
the,4832
,2667
and,2559
of,2504
to,2496
I,2345
a,2299
in,1558
that,1470
was,1243


In [7]:
iplot([{"x" : list(zip(*sorted_tokens))[0], "y": list(zip(*sorted_tokens))[1]}])

## Recap
- Distribution of words is a __Zipf__ Curve
- Unique tokens and their frequencies are called __Unigrams__
## What do we not have ?
- Relative distribution of the words or relationship between words.
- Here we introduce __Bigrams__ : pairs of words that occur together.

In [8]:
def get_bigrams(corpus):
    bigrams = {}
    for sentence in corpus:
        for index, word in enumerate(sentence):
            if index > 0:
                pair  = (sentence[index - 1], word)
                if pair not in bigrams:
                    bigrams[pair] =0
                bigrams[pair] += 1
    return bigrams


# Note :: Try this with the smaller corpus first.

bigrams = get_bigrams(corpus)
sorted_bigrams = sorted(bigrams.items(), key = lambda x: x[1], reverse=True)

sorted_bigrams = sorted_bigrams[:1000] # RAM problems
#print(sorted_bigrams) 
#print(make_table(sorted_bigrams))
iplot([{"x" : ['_'.join(x) for x in list(zip(*sorted_bigrams))[0]], "y": list(zip(*sorted_bigrams))[1]}])

In [9]:
# Let us look at 

def bgof(x, y):
    return {k[1] : v for k, v in x.items() if k[0] == y}

sorted_the = sorted(bgof(bigrams, 'the').items(), key = lambda x : x[1], reverse=True)
iplot([{"x" : list(zip(*sorted_the))[0], "y": list(zip(*sorted_the))[1]}])

sorted_good = sorted(bgof(bigrams, 'good').items(), key = lambda x : x[1], reverse=True)
iplot([{"x" : list(zip(*sorted_good))[0], "y": list(zip(*sorted_good))[1]}])

#make_table(list(bgof(bigrams, 'good').items()))

print(bgof(bigrams, 'good')['man'])
print(bgof(bigrams, 'the')['man'])



1
24


## Comparision
- 'Good Man' occurs once while 'The Man' occurs 24 times. 
- But, given 'The' how likely are we to predict 'Man' vis a vis given 'Good' ?.
- The one occurance of man after good has more weight than it man given the.
- To capture this, we need conditional probability and not joint counts or probabilities.

In [10]:
import math
def cond(bigrams, key):
    
    joint = {k[1] : v for k, v in bigrams.items() if k[0] == key}
    sum_count = sum(joint.values())
    return {k : v / float(sum_count) for k, v in joint.items() }

the_cond = cond(bigrams, 'the')
good_cond = cond(bigrams, 'good')


sorted_the = sorted(the_cond.items(), key = lambda x : x[1], reverse=True)
iplot([{"x" : list(zip(*sorted_the))[0], "y": list(zip(*sorted_the))[1]}])

sorted_good = sorted(good_cond.items(), key = lambda x : x[1], reverse=True)
iplot([{"x" : list(zip(*sorted_good))[0], "y": list(zip(*sorted_good))[1]}])

print(good_cond['man'])
print(the_cond['man'])


0.01282051282051282
0.004966887417218543


In [11]:
# Generating Text
import random

def generate(unigrams, bigrams, length=5, first_word = None):
    words = []
    if first_word == None:
        first_word = list(unigrams.keys())[random.randrange(0, len(unigrams))]
    words.append(first_word)
    for i in range(length - 1):
        prev = words[i]
        prev_dict = cond(bigrams, prev)
        
        next_word = sorted(prev_dict.items(), key = lambda x : x[1], reverse = True)[0]
        words.append(next_word[0])
    return words
        
generate(tokens, bigrams, length = 100, first_word="the")


['the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of',
 'the',
 'other',
 'side',
 'of']