## COG403: Problem 2 of Problem Set 1: Betas and Homophones

### All 3 problems for Problem Set 1 Due 4 October 2018

Imagine you are a child learning English. You are building up your vocabulary, but you are struggling with homophones (words that mean different things but sound the same). In particular, you are working on learning the difference between *for* and *four*, both of which are phonetically [fɔɹ]. (This is IPA. For more information, see https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).

You assume that there is a parameter, $\theta$, that controls how often [fɔɹ] conveys each meaning (*for* and *four*). This assumption can be formalized as the graphical model shown below. Each $X_i$ represents a sentence that uses a word that sounds like [fɔɹ]. $X_i$ takes on value 1 if [fɔɹ] in sentence $i$ means *for* and 0 if it means *four*. The assumption underlying this graphical model is that there is some unobserved value of $\theta$ and that for each sentence, a biased coin is flipped to determine whether a word that sounds like [fɔɹ] will mean *for* (with probability $\theta$) or mean *four* (with probability $1- \theta$ ). You would like to learn the value of $\theta$ by observing sentences that contain words that are pronounced [fɔɹ].

We will model learning [fɔɹ] based on corpus data.

![betas graphical model](https://notebooks.azure.com/juliawatson/libraries/q2-betas-and-homophones/raw/grapical_model_betas.png)

### (a) 

Write a function to load in the Brent Ratner child-directed speech corpus and return a dictionary mapping each word type to its frequency in the corpus. This corpus is stored in `data/brent-ratner-corpus.txt` in the library for this notebook.

In [14]:
from collections import Counter

def compute_word_counts(path_to_file):
    """
    path_to_file: string -- the path to the corpus file
    
    Return a dictionary mapping each unique type in the corpus at path_to_file
    to the number of times it occurs in the corpus. Make sure to convert words
    to all lowercase to get unique types.
    """
    counter = Counter()
    file = open(path_to_file, "r")
    for line in file:
        for word in line.split():
            word = word.lower()
            counter[word]+= 1
    return counter
            

corpus_counts = compute_word_counts('data/brent-ratner-corpus.txt')

### (b)

The estimate of $\theta$ that maximizes your posterior distribution is called the MAP estimate (MAP stands for maximum a posteriori), which we refer to as $\hat{\theta}$. 

You can model your posterior distribution as $Beta(a_1 + k_1, a_2 + k_2)$, where $a_1$ and $a_2$ are from your prior distribution, and $k_1$ is the number of observed sentences with [fɔɹ] that meant *for*, and $k_2$ is the number of observed sentences with [fɔɹ] that meant *four*. 

When you model the posterior distribution this way, then you can compute $\hat{\theta}$ using the formula $\frac{a_1+k_1−1}{(a_1+k_1−1)+(a_2+k_2−1)}$ . 

Use this formula to compute $\hat{\theta}$ after observing child-directed utterances below (taken from the Brent Ratner Corpus$^1$ from CHILDES$^2$):
1. This is food **for** the dragon.
2. One block, two blocks, three blocks, **four** blocks.
3. Thank you **for** the phone.
4. What do you want to get **for** your birthday?

Assume an initial prior distrubution of $Beta(1, 1)$. You may do this by hand or write Python code. If you choose to do it by hand, be sure to show your work.

Note: for more information on the Beta distribution see the tutorial from week 3:
https://betastutorial-juliawatson.notebooks.azure.com/j/notebooks/betas.ipynb

Assume prior distribution Beta(1, 1), i.e. $a_1 = 1 \bigwedge a_2 = 1$<br/>
Observing child-directed utterances yield 3 instances where 'for' is the intended reference and 1 instance where 'four' is the intended reference. <br/>
Then $k_1 = 3$ and $k_2 = 1$<br/>
$\hat{\theta}$ calculates to $\frac{1+3−1}{(1+3−1)+(1+1−1)} = 0.75$

### (c)

You have two friends, Jack and Jill, who are also trying to learn the meaning of [fɔɹ]. Their learning biases are different from yours. Jack has a prior distribution of $Beta(10,10)$ and Jill has a prior distribution of $Beta(100,100)$. After observing the utterances from part b, what are the parameters of their posterior distributions? What are their MAP estimates of $\theta$? You may do this by hand or write Python code. If you choose to do it by hand, be sure to show your work.

Their parameters of their posterior distributions is given by $(a_1 + k_1, a_2 + k_2)$.
<br/>
Where $a_1$ and $a_2$ are the parameters of the prior distribution and $k_1$ and $k_2$ are given by the observing utterances from part b. In this case, $k_1 = 3$ and $k_2 = 1$.
For Jack, it is $(10 + 3, 10 + 1)$.
For Jill, it is $(100 + 3, 100 + 1)$.

Their MAP estimates of $\theta$, i.e. $\hat{\theta}$ is given by $\frac{a_1+k_1−1}{(a_1+k_1−1)+(a_2+k_2−1)}$
<br/>
For Jack, it is $\hat{\theta} = \frac{10+3−1}{(10+3−1)+(10+1−1)} = 0.54545454545$ <br/>
For Jill, it is $\hat{\theta} = \frac{100+3−1}{(100+3−1)+(100+1−1)} = 0.50495049505$ <br/>


### (d)

Write a function that uses the word frequencies computed in part a above to compute the probability of word given a list of its homophones. This probability will serve as the "true" $\theta$ value -- the value that the child is seeking to learn from the sample of data they're exposed to.

In [15]:
def compute_theta(corpus_counts, word1, homophone_list):
    """
    corpus_counts: dict of str->int, mapping words to their frequencies
    word1: str
    homophone_list: list of words that sound the same as word1. word1 must be in homophone_list.
    
    Return the probability of word1 given that a word from homophone_list occurred.
    """
    prob_num = (corpus_counts[word1])
    prob_denom = 0
    for word in homophone_list:
        prob_denom += corpus_counts[word]
    probability = (float(prob_num)/float(prob_denom))
    return probability
true_theta = compute_theta(corpus_counts, "for", ["for", "four"])

### (e)

Compute the squared error of each of the MAP estimates (for you, Jack, and Jill) based on the true $\theta$ computed in part d. Who had the lowest squared error: you, Jack, or Jill? You may do this by hand or write Python code. If you choose to do it by hand, be sure to show your work.

Let squared error of the MAP estimates be
$$E = (\delta)^{2} = (\theta - \hat{\theta})^{2}$$ 
For me, given by the calculations of my MAP estimate given in part b),
my squared error of my MAP estimate is
$$ E_{Me} = (\delta)^{2}_{Me} = (0.9074074074074074 - 0.75)^{2} = 0.0247770919$$
For Jack, given by the calculations of the MAP estimate given in c),
their squared error of their MAP estimate is
$$ E_{Jack} = (\delta)^{2}_{Jack} = (0.9074074074074074 - 0.54545454545)^{2} = 0.13100987427$$
For Jill, their's is
$$ E_{Jill} = (\delta)^{2}_{Jill} = (0.9074074074074074 - 0.50495049505)^{2} = 0.1619715663$$

Therefore, I have the lowest squared error.

### (f)

Write a function `generate_corpus` that creates a random corpus that you and your friends might encounter based on the theta value computed in part d. For any integer $n$, your command should return a 1-by-n vector consisting of ones (uses of [fɔɹ] that mean *for*) and zeros (uses of [fɔɹ] that mean *four*), where 1 occurs approximately $\theta$ fraction of the time and 0 occurs approximately (1 - $\theta$) fraction of the time. (Hint: using `numpy.random.rand(n)` will give you a vector of $n$ random numbers uniformly sampled from $[0,1)$. How can you use this to generate a list of ones and zeros where 1 occurs $\theta$ fraction of the time?)

In [16]:
from numpy import random
def generate_corpus (true_theta, n):
    return (random.rand(n) < true_theta).astype(int)

### (g)

Write a function `learn` to simulate a learner. Your function should take in a number $n$, the parameters $a_1$ and $a_2$ of the Beta prior distribution, and the true $\theta$ value. It should first generate a random corpus of length $n$ (using `generate_corpus` from f), then use this corpus together with the prior to find the parameters of the posterior distribution, and finally use those parameters to compute the MAP estimate $\hat{\theta}$ and the squared error of this estimate. Your function should return the MAP estimate as well as the squared error.

Test out this function by calling it using:
 * `a_1 = a_2 = 1`
 * `true_theta` = the true theta value for *for* (computed in part d)
 * `n = 100`

In [17]:
def learn(a1, a2, n, true_theta):
    """
    a1: int -- parameter for prior Beta distribution
    a2: int -- parameter for prior Beta distribution
    n: int -- number of samples to use to update Beta distribution
    true_theta: float -- the theta value we want to model. We use it to generate a corpus.

    Return MAP theta value and squared error for Beta(a1, a2) after seeing n examples
    of a word that sounds like "for" used to mean *for* and *four*. The examples are
    generated randomly, with "for" meaning *for* true_theta fraction of the time and
    meaning *four* (1 - true_theta) fraction of the time.
    """
    corpus = generate_corpus(true_theta, n)
    k1 = 0
    k2 = 0
    for k in corpus:
        if k == 1:
            k1 += 1
        elif k == 0:
            k2 += 1
    theta_num = (a1 + k1 - 1)
    theta_denom = ((a1 + k1 - 1) + (a2 + k2 - 1))
    theta = ((float(theta_num))/(float(theta_denom)))
    squared_error = (true_theta - theta)**2
    return theta, squared_error
        
    

learn(1, 1, 100, true_theta)

(0.94, 0.0010622770919067159)

### (h)

Run an experiment to see which initial beta distribution produces the best results across multiple corpora: Write a function `evaluate_learners` that runs your simulation `learn` 1,000 times for each of five corpus sizes ($n$=1, 2, 3, 4, and 5) and for each of the three learners (you, Jack, and Jill). For each corpus size and each learner, compute the average value of $\hat{\theta}$ and the average squared error across the 1,000 trials, and print a summary. (To clarify: your script should run `learn` a total of 15,000 times.)

In your print statements, make sure to round any numbers to four decimal places, so that they are easily interpretable.

In [18]:
def evaluate_learners(word1, word2):
    """
    Run trials for You, Jack, and Jill for different numbers of samples (1-5). Print a report of the
    average theta and average MSE for each set of trials.

    Based on the directions in the handout, assume that:
      - You have an initial distribution of Beta(1, 1)
      - Jack has an initial distribution of Beta(10, 10)
      - Jill has an initial distribution of Beta(100, 100)

    """
    homophone_list = [word1, word2]
    local_true_theta = compute_theta(corpus_counts, word1, homophone_list)
    learners = ["Josh", "Jack", "Jill"]
    for learner in learners:
        print("For " + learner + ":")
        if learner == "Josh":
            a1 = 1
            a2 = 1
        elif learner == "Jack":
            a1 = 10
            a2 = 10
        elif learner == "Jill":
            a1 = 100
            a2 = 100
        for size in range(1, 6):
            theta_total = 0.0
            mse_total = 0.0
            for run in range(1000):
                data = (learn(a1, a2, size, local_true_theta))
                theta_total += data[0]
                mse_total += data[1]
            theta_average = theta_total/1000.0
            mse_average = mse_total/1000.0
            print("with size " + str(size) + 
                   ":"+ "\nAverage theta is " + str(round(theta_average, 4)) + 
                   "\nAverage MSE is " + str(round(mse_average, 4)))

### (i)

Run `evaluate_learners` on *for* and *four*. For each of Me, Jack, and Jill, explain the impact of the number of samples on the mean theta and the mean squared error.

In [19]:
evaluate_learners("for", "four")

For Josh:
with size 1:
Average theta is 0.921
Average MSE is 0.0729
with size 2:
Average theta is 0.898
Average MSE is 0.0452
with size 3:
Average theta is 0.909
Average MSE is 0.0281
with size 4:
Average theta is 0.9135
Average MSE is 0.0196
with size 5:
Average theta is 0.9034
Average MSE is 0.018
For Jack:
with size 1:
Average theta is 0.5211
Average MSE is 0.1495
with size 2:
Average theta is 0.5411
Average MSE is 0.1346
with size 3:
Average theta is 0.5577
Average MSE is 0.1229
with size 4:
Average theta is 0.5745
Average MSE is 0.1115
with size 5:
Average theta is 0.5889
Average MSE is 0.1022
For Jill:
with size 1:
Average theta is 0.5021
Average MSE is 0.1642
with size 2:
Average theta is 0.5041
Average MSE is 0.1626
with size 3:
Average theta is 0.5061
Average MSE is 0.1611
with size 4:
Average theta is 0.508
Average MSE is 0.1596
with size 5:
Average theta is 0.51
Average MSE is 0.158


For "Josh", as the sample size increases, the mean theta approaches true_theta most of the three rapidly due to his initial lack of exposure. Therefore, the average mean squared error starts off high, but decreases most rapidly as well. 

For "Jack", as the sample size increases, the mean theta approaches true_theta slower than Josh. This is because Jack already has some exposure with these two homophones, so their measure of their MAP estimate will change, but not as quickly as if they had no exposure. Therefore, their average mean squared error starts smaller, but the sample size is not large enough to make significant decreases.

For "Jill", as the sample size increases, the mean theta approaches true_theta so slowly that the impact is almost non-existent. Jill has a lot of exposure with these two homophones, so their measure of their MAP estimate will change, but only if the sample size increases above by merely 5. Therefore, the sample size has very small impact on the mean squared error, and will take a much larger sample size consistent with true_theta to actually dramatically reduce error.

### (j)

Run `evaluate_learners` on the homophones *too* and *two*, which are both phonetically [tu]. Which learner does best (you, Jack, or Jill)? Is this different from the results for the homophone pair (*for*, *four*)? Explain why or why not.


In [20]:
evaluate_learners("too", "two")

For Josh:
with size 1:
Average theta is 0.407
Average MSE is 0.2414
with size 2:
Average theta is 0.401
Average MSE is 0.1207
with size 3:
Average theta is 0.4157
Average MSE is 0.0772
with size 4:
Average theta is 0.4125
Average MSE is 0.058
with size 5:
Average theta is 0.4126
Average MSE is 0.047
For Jack:
with size 1:
Average theta is 0.4938
Average MSE is 0.0087
with size 2:
Average theta is 0.4901
Average MSE is 0.0086
with size 3:
Average theta is 0.4861
Average MSE is 0.0084
with size 4:
Average theta is 0.4825
Average MSE is 0.0081
with size 5:
Average theta is 0.4803
Average MSE is 0.0082
For Jill:
with size 1:
Average theta is 0.4995
Average MSE is 0.0091
with size 2:
Average theta is 0.499
Average MSE is 0.009
with size 3:
Average theta is 0.4983
Average MSE is 0.0089
with size 4:
Average theta is 0.4982
Average MSE is 0.0089
with size 5:
Average theta is 0.4979
Average MSE is 0.0088


The learner that does "best" is arguably Jack. Jack's average mean squared error is smaller than Josh's and Jill's across all sample sizes, and especially when it increases.

This is a different result from previously, and this is because the true_theta calculated for \tu\ involves two criterion that affects the mean squared error of each person.

1. closeness to MAP estimate
2. elasticity of learning

For Josh, while elasticity of learning is high, the MAP estimate is widely variable, which makes the distance of his MAP estimate change dramatically, and inconsistently, therefore his mean squared error is the largest of all learners.

For Jill, while the MAP estimate is steady and consistent, the inelasticity of learning due to large exposure inhibits Jill's MAP estimate to approach true_theta. 

For Jack, their MAP estimates are relatively consistent, yet doesn't sacrifice elasticity of learning. This combination allows the MAP estimate to approach true_theta faster than Jill, and more consistently than Josh.

The reason why these results are so dramatically different than \fɔɹ\ is because the true_theta for \fɔɹ\ is further away from the learner's MAP estimate than \tu\, encouraging elastic learning over consistency. 

### (k)

The human language data we give our model is very limited (just counts). What additional information do you think children use to learn the difference between homophones like *for* and *four*?

An additional aid may include explicit direction of differentiation of homophones.

Rather than passive learning, where a child merely counts a word's occurance in a specific context, an explicit differentiation with their definitions may help them map the features to a particular homophone. For example, teaching a child that "four" is a number in a specific context where numbers are involved, and that "for" is a preposition used in context where a relations of nouns are involved will allow further distinction between the two word's meaning.



### Citations

$^1$ Brent, M.R. and T.A. Cartwright. 1996. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition 61: 93-125.

$^2$ B.MacWhinney and C. Snow. 1985. The child language data exchange system. Journal of Child Language, 12:271-296.