# Learning Objectives

In this lab we are going to:

- Play around with text corpora <br>
- Learn some statistics tricks in Python and NLTK <br>
- Learn about language modelling <br>
- Learn about n-grams <br>
- Naive bayes as a LM <br>
- Know about data sparsity and smoothing techniques <br>


# Accessing A Text Corpus

Open a Python session and  obtain the <a href="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip">Brown corpus</a>, using NLTK.

In [0]:
import nltk

# You will need to download 'Brown' as follows:
nltk.download('brown')
# NLTK Downloader
# ---------------------------------------------------------------------------
#     d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
# ---------------------------------------------------------------------------
# Downloader> d
#
# Download which package (l=list; x=cancel)?
#   Identifier> brown
#     Downloading package brown to /home/jmccrae/nltk_data...
#       Unzipping corpora/brown.zip.

# The following should now work:
from nltk.corpus import brown

# read a list of the words in the Brown Corpus
list_words = brown.words()

print(list_words[0:20])

We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). 


In [0]:
brown.sents()

The Brown corpus consists of different categories. We can list the available categories as follows:

In [0]:
brown.categories()

We can access the text of a certain category as follows:

In [0]:
brown.words(categories='fiction')

**Exercise 1: **

What is the frequency of the word (ignoring case) &lsquo;world&rsquo; in the news category in Brown corpus?

In [0]:
def count_freq(category, given_word):
    count = 0
    for word in brown.words(categories=category):
        word = word.lower()
        if word == given_word:
            count += 1
    return count

print(count_freq('news', 'world'))

# Frequency of Words

We can easily get the frequency distribution of the words in a corpus as follows:

In [0]:
from nltk.probability import FreqDist

news_text = brown.words(categories='news')
fdtest = FreqDist(list_words)
# the frequency of each vocabulary item in the text
fd = FreqDist(news_text)

# total number of samples
print (fd.N()) 

# how many unique words does this corpus have
print (fd.B())

# Get a list of the top 10 words sorted by frequency
print(fd.most_common(10))
print (len(list_words))
print (fdtest.N())

**Exercise 2:**
In the Brown Corpus, in which category(s) of the  news, government and editorial categories, the word (ignoring case) &lsquo;world&rsquo; has the highest total frequency?
* news
* government
* editorial

In [0]:
print(count_freq('news', 'world'))
print(count_freq('government', 'world'))
print(count_freq('editorial', 'world'))

# Probabilities

**Exercise 3:**
Calculate probabilities (relative frequency) of all words for only __news__ category in Brown corpora.
What is the probability of the words &lsquo;jury&rsquo; and &lsquo;government&rsquo;?

In [0]:
def prob(category, given_word):
    count = 0
    total = 0

    for word in brown.words(categories=category):
        word = word.lower()
        if word == given_word:
            count += 1
        total += 1
            
    return float(count)/total

print(prob('news','jury'))
print(prob('news','government'))

# N-Grams

The Probabilisic Language Models (a.k.a n-gram LM) is developed to construct the joint probability distribution of a sequence of words. Based on the Markov assumption, the process of predicting a word sequence is broken up into predicting one word at a time.

We can extract unigrams, and bigrams from a corpus as follows:
In this example, we are going to generate unigrams and bigrams from the novel Emma by Jane Austen from The Gutenberg Corpus

In [0]:
#explore the corpus
nltk.corpus.gutenberg.fileids()


In [0]:
# get the text of the novel Emma by Jane Austen 
emma_words = nltk.corpus.gutenberg.words('austen-emma.txt')
emma = " ".join(emma_words) 
emma[:500] #first 500 words


In [0]:
from nltk.tokenize import word_tokenize

tokens = nltk.word_tokenize(emma)
tokens[:20] #first 20 token

In [0]:
from nltk.util import ngrams

#unigrams
print (list(ngrams(word_tokenize(emma), 1))[:10])


In [0]:
#bigrams
print (list(ngrams(word_tokenize(emma), 2)))

#or simply
print(list(nltk.bigrams(emma_words))[:10])

In [0]:
from nltk.probability import ConditionalFreqDist

#Make a conditional frequency distribution of all the bigrams in the novel Emma by Jane Austen from The Gutenberg Corpus
bigrams = nltk.bigrams(emma_words)

cfd = ConditionalFreqDist(bigrams)

#get the most frequently used word after ‘fully’
cfd['fully']


In [0]:
#same with 'good' but sort by freq
cfd['good'].most_common(20) 

**Exercise 4:**
Write a function to find the most common phrases (trigrams) in the __fiction__ category of the brown corpus.

In [0]:
from nltk.corpus import brown

fiction_text = brown.words(categories='fiction')
trigram =  [t for t in nltk.trigrams(fiction_text)]
freq = nltk.FreqDist(trigram) #have you noticed the difference between ConditionalFreqDist and FreqDist!
freq.most_common(20)


# Probabilistic modeling


## Naïve Bayes	as	a	Language	Model
Based on probabilities of words in only the news and fiction categories in the brown corpus, classify the phrase 'mysterious murder case' to one of these categories. 

You should implement Naive Bayes classifier using probabilities of each word:

$P(fiction|mysterious\ murder\ case) \propto P(mysterious|fiction) \times P(murder|fiction) \times P(case|fiction) \times P(fiction)$
where $P(news) = 0.5$ and $P(fiction) = 0.5$

**Exercise 5:**
Write a general purpose Naive Bayes classifier such as follows:

In [0]:
from random import random

def calculate_probability(phrase, category):
    p = 1.0
    for word in phrase.split():
        word = word.lower()
        p *= prob(category, word)
        return p * 0.5

def naive_bayes(phrase):
    news_prob = calculate_probability(phrase, 'news')
    fiction_prob = calculate_probability(phrase, 'fiction')
    if news_prob > fiction_prob:
        return 0 #news
    else:
        return 1 #fiction

print(naive_bayes("mysterious murder case"))

## Smoothing

A simple n-gram model would give zero probability to all of the combination that were not encountered in the training corpus, i.e. it would most likely give zero probability to most of the out-of-sample test cases. This problem is known as data sparsity and the traditional solution to it is to use smoothing techniques.

### Example: bigram model

(pen and paper exercises)

Given Corpus:

$JOHN\ READ\ MOBY\ DICK$
<br>
$MARY\ READ\ A\ DIFFERENT\ BOOK$
<br>
$SHE\ READ\ A\ BOOK\ BY\ CHER$


**Exercise 6:**
Calculate the probability of the sentence "JOHN READ A BOOK"?
![Solution Exercise 6](https://preview.ibb.co/igWFpK/exercises6.png)

**Exercise 7:**
What is the $p(CHER\ READ\ A\ BOOK)$?

![Solution Exercise 7](https://preview.ibb.co/gcHYbz/exercises7.png)

### Add-one smoothing

$p(w_i|w_{i-1}) = \frac{1 + c(w_{i−1} w_i)} {\sum{w_i} [1 + c(w_{i−1} w_i)] }$

**Exercise 8:**
Re-calculate the $p(JOHN\ READ\ A\ BOOK)$ and $p(CHER\ READ\ A\ BOOK)$ using add-one smoothing

![Solution Exercise 8](https://preview.ibb.co/bYhc3e/exercises8.png)

### Other Smoothing methods
- Additive smoothing
- Good-Turing estimate
- Jelinek-Mercer smoothing (interpolation)
- Katz smoothing (backoff)
- Witten-Bell smoothing
- Absolute discounting
- Kneser-Ney smoothing