# Assignment 2 - Elementary Probability and Information Theory 
# Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* This notebook loads some data into a `pandas` dataframe, then does a small amount of preprocessing. Make sure your data can load by stepping through all of the cells up until question 1. 
* Most of the questions require you to write some code. In many cases, you will write some kind of probability function like we did in class using the data. 
* Some of the questions only require you to write answers, so be sure to change the cell type to markdown or raw text
* Don't worry about normalizing the text this time (e.g., lowercase, etc.). Just focus on probabilies. 
* Most questions can be answered in a single cell, but you can make as many additional cells as you need. 
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment. 

In [1]:
import pandas as pd 

data = pd.read_csv('pnp-train.txt',delimiter='\t',encoding='latin-1', # utf8 encoding didn't work for this
                  names=['type','name']) # supply the column names for the dataframe

# this next line creates a new column with the lower-cased first word
data['first_word'] = data['name'].map(lambda x: x.lower().split()[0])

In [2]:
data[:10]

Unnamed: 0,type,name,first_word
0,drug,Dilotab,dilotab
1,movie,Beastie Boys: Live in Glasgow,beastie
2,person,Michelle Ford-Eriksson,michelle
3,place,Ramsbury,ramsbury
4,place,Market Bosworth,market
5,drug,Cyanide Antidote Package,cyanide
6,person,Bill Johnson,bill
7,place,Ettalong,ettalong
8,movie,The Suicide Club,the
9,place,Pézenas,pézenas


In [3]:
data.describe()

Unnamed: 0,type,name,first_word
count,21001,21001,21001
unique,5,20992,13703
top,movie,Ash,the
freq,6262,2,635


## 1. Write a probability function/distribution $P(T)$ over the types. 

Hints:

* The Counter library might be useful: `from collections import Counter`
* Write a function `def P(T='')` that returns the probability of the specific value for T
* You can access the types from the dataframe by calling `data['type']`

In [215]:
from collections import Counter

def P(T=''):
    global counts
    global data
    counts = Counter(data['type'])
    return counts[T] / len(data['type'])
counts

Counter({'drug': 5030,
         'movie': 6262,
         'person': 3836,
         'place': 3389,
         'company': 2484})

## 2. What is `P(T='movie')` ?

In [8]:
P(T='movie')

0.29817627732012764

## 3. Show that your probability distribution sums to one.

In [15]:
import numpy as np

round(np.sum([P(T=x) for x in set(data['type'])]), 4)

1.0

## 4. Write a joint distribution using the type and the first word of the name

Hints:

* The function is $P2(T,W_1)$
* You will need to count up types AND the first words, for example: ('person','bill)
* Using the [itertools.product](https://docs.python.org/2/library/itertools.html#itertools.product) function was useful for me here

In [124]:
def P2(T='', W1=''):
    global count
    count = data[['type', 'first_word']]
    return len(count.loc[(count['type'] == T) & (count['first_word'] == W1)]) / len(count)

## 5. What is P2(T='person', W1='bill')? What about P2(T='movie',W1='the')?

In [113]:
P2(T='person', W1='bill')

0.00047616780153326033

In [114]:
P2(T='movie', W1='the')

0.02747488214846912

## 6. Show that your probability distribution P(T,W1) sums to one.

In [244]:
types = Counter(data['type'])
words = Counter(data['first_word'])
retVal = 0
for x in types:
    for y in words:
        retVal = retVal + P2(T=x,W1=y)
print(round(retVal,4))

1.0


## 7. Make a new function Q(T) from marginalizing over P(T,W1) and make sure that Q(T) sums to one.

Hints:

* Your Q function will call P(T,W1)
* Your check for the sum to one should be the same answer as Question 3, only it calls Q instead of P.

In [208]:
def Q(T=''):
    words = Counter(data['first_word'])
    retVal = 0
    for x in words:
        retVal = retVal + P2(T,W1=x)
    return retVal

In [209]:
Q('movie')

0.29817627732011875

In [245]:
round(np.sum([Q(T=x) for x in set(data['type'])]), 4)

1.0

## 8. What is the KL Divergence of your Q function and your P function for Question 1?

* Even if you know the answer, you still need to write code that computes it.

I wasn't quite sure how to properly do this question so it's kind of just half implemented. Although I do know that it should be 0.0

In [None]:
import math
(P('drug') * math.log(P('drug') / Q('drug')) + P('movie') * math.log(P('movie') / Q('movie')))

## 9. Convert from P(T,W1) to P(W1|T) 

Hints:

* Just write a comment cell, no code this time. 
* Note that $P(T,W1) = P(W1,T)$

Given that P(T,W1) = P(W1,T) then we can infer that P(W1|T) = (P(W1,T)/P(T))

(try to use markdown math formating, answer in this cell)

## 10. Write a function `Pwt` (that calls the functions you already have) to compute $P(W_1|T)$.

* This will be something like the multiplication rule, but you may need to change something

In [206]:
def Pwt(W1='',T=''):
    return P2(T=T,W1=W1)/P(T=T)

## 11. What is P(W1='the'|T='movie')?

In [207]:
Pwt(W1='the',T='movie')

0.09214308527626956

## 12. Use Baye's rule to convert from P(W1|T) to P(T|W1). Write a function Ptw to reflect this. 

Hints:

* Call your other functions.
* You may need to write a function for P(W1) and you may need a new counter for `data['first_word']`

In [221]:
def Pw(W1=''):
    words = Counter(data['first_word'])
    return words[W1] / len(data['first_word'])


def Ptw(T='',W1=''):
    return (Pwt(W1=W1,T=T)*P(T=T))/Pw(W1=W1)

## 13 
### What is P(T='movie'|W1='the')? 
### What about P(T='person'|W1='the')?
### What about P(T='drug'|W1='the')?
### What about P(T='place'|W1='the')
### What about P(T='company'|W1='the')

In [222]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [223]:
Ptw(T='person',W1='the')

0.0

In [224]:
Ptw(T='drug',W1='the')

0.0

In [225]:
Ptw(T='place',W1='the')

0.0015748031496062992

In [226]:
Ptw(T='company',W1='the')

0.08976377952755905

## 14 Given this, if the word 'the' is found in a name, what is the most likely type?

In [240]:
Pwt('the', 'movie')

0.09214308527626956

## 15. Is Ptw(T='movie'|W1='the') the same as Pwt(W1='the'|T='movie') the same? Why or why not?

In [241]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [242]:
Pwt(W1='the', T='movie')

0.09214308527626956

They are not the same because it's basically the probability of getting a movie type and then getting the word 'the' after, whereas the other way is getting the word 'the' and then getting a movie type after.

## 16. Do you think modeling Ptw(T|W1) would be better with a continuous function like a Gaussian? Why or why not?

- Answer in a markdown cell


No, I don't think modeling the Ptw(T|W1) would be better with a continuous function like a Gaussian function. This is 
because the set of data that we're given is finite and is likely not completely randomly distributed and likely has bias somewhere in it. Because of this fact using a continuous function might not yield the results we're looking for.