# Assignment 2 - Elementary Probability and Information Theory 
# Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* This notebook loads some data into a `pandas` dataframe, then does a small amount of preprocessing. Make sure your data can load by stepping through all of the cells up until question 1. 
* Most of the questions require you to write some code. In many cases, you will write some kind of probability function like we did in class using the data. 
* Some of the questions only require you to write answers, so be sure to change the cell type to markdown or raw text
* Don't worry about normalizing the text this time (e.g., lowercase, etc.). Just focus on probabilies. 
* Most questions can be answered in a single cell, but you can make as many additional cells as you need. 
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment. 

In [1]:
import pandas as pd 

data = pd.read_csv('pnp-train.txt',delimiter='\t',encoding='latin-1', # utf8 encoding didn't work for this
                  names=['type','name']) # supply the column names for the dataframe

# this next line creates a new column with the lower-cased first word
data['first_word'] = data['name'].map(lambda x: x.lower().split()[0])

In [2]:
data[:10]

Unnamed: 0,type,name,first_word
0,drug,Dilotab,dilotab
1,movie,Beastie Boys: Live in Glasgow,beastie
2,person,Michelle Ford-Eriksson,michelle
3,place,Ramsbury,ramsbury
4,place,Market Bosworth,market
5,drug,Cyanide Antidote Package,cyanide
6,person,Bill Johnson,bill
7,place,Ettalong,ettalong
8,movie,The Suicide Club,the
9,place,Pézenas,pézenas


In [3]:
data.describe()

Unnamed: 0,type,name,first_word
count,21001,21001,21001
unique,5,20992,13703
top,movie,Valentine,the
freq,6262,2,635


In [4]:
from collections import Counter as ctr

In [5]:
data_ctr=ctr(data.type)
print(data_ctr)

Counter({'movie': 6262, 'drug': 5030, 'person': 3836, 'place': 3389, 'company': 2484})


## 1. Write a probability function/distribution $P(T)$ over the types. 

Hints:

* The Counter library might be useful: `from collections import Counter`
* Write a function `def P(T='')` that returns the probability of the specific value for T
* You can access the types from the dataframe by calling `data['type']`

In [41]:
from collections import Counter as ctr
data_ctr=ctr(data.type)

def P(T=''):
    return data_ctr[T]/len(data.type)


In [42]:
P(T='movie') ,  P(T='drug') ,  P(T='person') , P(T='place') ,P(T='company') 

(0.29817627732012764,
 0.23951240417122993,
 0.18265796866815867,
 0.16137326793962192,
 0.11828008190086187)

## 2. What is `P(T='movie')` ?

In [43]:
def P(T='movie'):
    return data_ctr[T]/len(data.type)

P(T='movie')

0.29817627732012764

## 3. Show that your probability distribution sums to one.

In [44]:
P(T='movie') +  P(T='drug') +  P(T='person') + P(T='place') +P(T='company') 

1.0

## 4. Write a joint distribution using the type and the first word of the name

Hints:

* The function is $P2(T,W_1)$
* You will need to count up types AND the first words, for example: ('person','bill)
* Using the [itertools.product](https://docs.python.org/2/library/itertools.html#itertools.product) function was useful for me here

In [107]:
zs = zip(data.type,data.first_word)

In [108]:
#list(zs)

In [109]:
data_ctr2= ctr(zip(data.type,data.first_word))
#data_ctr2

In [110]:
def P2(T='', W1=''):
    return data_ctr2[(T, W1)] / sum(data_ctr2.values())


## 5. What is P2(T='person', W1='bill')? What about P2(T='movie',W1='the')?

In [111]:
P2(T='person', W1='bill')  

0.00047616780153326033

In [56]:
P2(T='movie', W1='the')

0.02747488214846912

## 6. Show that your probability distribution P(T,W1) sums to one.

In [57]:
import numpy as np
(np.sum([P2(T=t,W1=w1) for (t,w1) in set(data_ctr2.keys())]))


1.0

## 7. Make a new function Q(T) from marginalizing over P(T,W1) and make sure that Q(T) sums to one.

Hints:

* Your Q function will call P(T,W1)
* Your check for the sum to one should be the same answer as Question 3, only it calls Q instead of P.

In [58]:
from scipy.stats.contingency import margins
def Q(T= ''):
    sub_data = data[data.type == T]
    return sum([P2(T=T, W1=word) for word in set(sub_data.first_word)])
    

In [59]:
#sub_data

In [61]:
round(np.sum([Q(T=t) for t in set(data.type)]))



1

In [63]:
Q(T='movie')

0.2981762773201257

## 8. What is the KL Divergence of your Q function and your P function for Question 1?

* Even if you know the answer, you still need to write code that computes it.

In [64]:
Pb= [P(T='movie') ,  P(T='drug') ,  P(T='person') , P(T='place') ,P(T='company')]

 

In [65]:
Qb=  [Q(T='movie') ,  Q(T='drug') ,  Q(T='person') , Q(T='place') ,Q(T='company') ]

In [104]:
from math import log2
# calculate the kl divergence
def kl_divergence(Q, P):
    return sum(- Q[i] * log2(Q[i]/P[i]) for i in range(len(Q)))

In [105]:
(kl_divergence(Qb, Pb))

-3.820933829374213e-14

## 9. Convert from P(T,W1) to P(W1|T) 

Hints:

* Just write a comment cell, no code this time. 
* Note that $P(T,W1) = P(W1,T)$

\begin{align*}
𝑃(𝑇,𝑊1)& =𝑃(𝑊1,𝑇)=P(T)P(w1|T)\\
        & \implies P(w1|T)=𝑃(𝑊1,𝑇)/P(T)
\end{align*}

## 10. Write a function `Pwt` (that calls the functions you already have) to compute $P(W_1|T)$.

* This will be something like the multiplication rule, but you may need to change something

In [68]:
def Pwt(W1='',T=''):
    return (data_ctr2[(T, W1)] / sum(data_ctr2.values())) /(data_ctr[T]/len(data.type))

## 11. What is P(W1='the'|T='movie')?

In [69]:
Pwt(W1='the',T='movie')

0.09214308527626956

## 12. Use Baye's rule to convert from P(W1|T) to P(T|W1). Write a function Ptw to reflect this. 

Hints:

* Call your other functions.
* You may need to write a function for P(W1) and you may need a new counter for `data['first_word']`

In [112]:
data_ctr1=ctr(data.first_word)
#print(data_ctr1)

def Pa(W1=''):
    return data_ctr1[W1]/len(data.first_word)

In [113]:
 def Ptw (T='',W1=''):
    return (Pwt(W1,T)* P(T))/ (Pa(W1))

## 13 
### What is P(T='movie'|W1='the')? 
### What about P(T='person'|W1='the')?
### What about P(T='drug'|W1='the')?
### What about P(T='place'|W1='the')
### What about P(T='company'|W1='the')

In [95]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [96]:
Ptw(T='person',W1='the')

0.0

In [97]:
Ptw(T='drug',W1='the')

0.0

In [98]:
Ptw(T='place',W1='the')

0.0015748031496062992

In [99]:
Ptw(T='company',W1='the')

0.08976377952755905

## 14 Given this, if the word 'the' is found in a name, what is the most likely type?

In [77]:
data_ctr2=ctr(data.name)

def P(N=''):
    return data_ctr[N]/len(data.name)

zs = zip(data.type,data.name)
data_ctr3=ctr(zip(data.type,data.name))

def Ptn(T='',N=''):
    return  (data_ctr3[(T, N)] / sum(data_ctr3.values()))/data_ctr[N]/len(data.name)

## 15. Is Ptw(T='movie'|W1='the') the same as Pwt(W1='the'|T='movie') the same? Why or why not?

In [100]:
Ptw(T='movie',W1='the')

0.9086614173228347

In [101]:
Pwt(W1='the', T='movie')

0.09214308527626956

They are not the same. Because Ptw(T='movie',W1='the') define Probability of Type=movie when it is given that the W1(first_word)="the". Pwt(W1='the', T='movie') define Probability of W1(first_word)="the" when it is given that the Type=movie.

## 16. Do you think modeling Ptw(T|W1) would be better with a continuous function like a Gaussian? Why or why not?

- Answer in a markdown cell


In this Problem we are dealing with variables where we can  simply count the relative frequencies.So, I don't think it would be better to go with Gaussian.