# CS 555: Homework 3
### Eric Stevens
### November 6, 2018

In [79]:
# If you what use ngrams.py you should use python2
# Or, otherwise, you need to modify ngrams.py by yourself in order to use it in python3.
from __future__ import print_function
from __future__ import unicode_literals
from string import punctuation
import re
import numpy as np
from ngrams import ngrams
from collections import defaultdict
from bitweight import BitWeight, BitWeightRangeError

In [80]:

small_corpus = ['Why dont we start here',
                  'Why dont we end there',
                  'Let us start with a few other examples',
                  'We never start with an example with so few tokens',
                  'Tokens can be words that we start with in example docs']


In [81]:
# TOKENIZE - converts lists of sentences, like in the small_corpus
# to list of list of tokens. All of the other functions will require
# their parameters in the form of the outupt of the tokenize function.
def tokenize(corpus):
    tokens = [sentence.split(' ') for sentence in corpus]
    return tokens



# HW3: Language Modeling
For this part of the assignment, you will implement two simple count-based n-gram language models: one based on maximum-likelihood estimation, and another based on Witten-Bell smoothing. The data you will be using is a subset of the Penn Treebank's tagged Wall Street Journal articles on which we have done some initial processing. There are two versions of the data for this assignment:

##### wsj.pos.gz
##### wsj-normalized.pos.gz
The difference is that, in the second (normalized) version of the data, we have collapsed some entries from certain tag categories (e.g. CDs, NNPs, etc.) into type-tokens to help reduce sparsity. Take a look at the data and see for yourself. Consider: what would be the benefits and drawbacks to this method of sparsity reduction? Note that, for this part of the assignment, the tags are un-necessary, so you'll want to work with the un-normalized version of the corpus.

### Task 1: produce a tag-free corpus

For this task, you have two jobs. 
* First, you need to write a function to filter out all tags. 
* Second, Make sure your code works for both wsj.pos.gz and wsj-normalized.pos.gz

####What to turn in
* your code
* some samples to show me that your code works as it should be

### POS Filter for 'wsj.pos' and 'wsj-normalized.pos'

In [82]:
# FILE_TO_LIST - turns the wsj files into the form of the 'small_corpus'
# to prepare it to be the input parameter of the 'tokenize()' function.
def file_to_list(filename):
    with open(filename, 'r') as content_file:
        content = content_file.read()
        no_tags = re.sub('(<[A-Z$]{2,4}>)|(/[A-Z$]{2,4})|(\.\s+[a-z])|(/[,$])|[\\,/\'`]', '', content)
        return re.split('\.\.|[\n]',no_tags)
    
# NOTE: This function assumes that 'wsj.pos' and 'wsj-normalised.pos' have been unzipped.

### Demonstrate Filtering On 'wsj.pos'

In [83]:
# Top 10 values of 'wsj.pos'
wsj_filtered = file_to_list('wsj.pos')
wsj_filtered[:10]

[u'Digital Equipment Corp. reported a 32 % decline in net income on a modest revenue gain in its fiscal first quarter  causing some analysts to predict weaker results ahead than they had expected ',
 u'',
 u'Although the second-largest computer maker had prepared Wall Street for a poor quarter  analysts said they were troubled by signs of flat U.S. orders and a slowdown in the rate of gain in foreign orders ',
 u' The Maynard  Mass.  company is in a transition in which it is trying to reduce its reliance on mid-range machines and establish a presence in workstations and mainframes ',
 u'',
 u'Net for the quarter ended Sept. 30 fell to $ 150.8 million  or $ 1.20 a share  from $ 223 million  or $ 1.71 a share  a year ago ',
 u' Revenue rose 6.4 % to $ 3.13 billion from $ 2.94 billion ',
 u'',
 u'Digital said a shift in its product mix toward low-end products and strong growth in workstation sales yielded lower gross margins ',
 u' A spokesman also said margins for the company s service b

### Demonstrate Filtering On 'wsj-normalized.pos'

In [84]:
# Top 10 values of 'wsj-normalized.pos'
wsj_normalized_filtered = file_to_list('wsj-normalized.pos')
wsj_normalized_filtered[:10]

[u'   reported a  % decline in net income on a modest revenue gain in its fiscal first quarter  causing some analysts to predict weaker results ahead than they had expected ',
 u'',
 u'Although the second-largest computer maker had prepared   for a poor quarter  analysts said they were troubled by signs of flat  orders and a slowdown in the rate of gain in foreign orders ',
 u' The     company is in a transition in which it is trying to reduce its reliance on mid-range machines and establish a presence in workstations and mainframes ',
 u'',
 u'Net for the quarter ended   fell to $    or $  a share  from $    or $  a share  a year ago ',
 u' Revenue rose  % to $   from $   ',
 u'',
 u'Digital said a shift in its product mix toward low-end products and strong growth in workstation sales yielded lower gross margins ',
 u' A spokesman also said margins for the company s service business narrowed somewhat because of heavy investments made in that sector ']

<font color="red">Self assessment:</font>

This section could have been done better. I was working under the assumption that with 10 MB of data, paying attention to things like the upper/lower case of the letters would not be important. I also did not remove all punctuation. I tried many combinations of of splitting the data. The assignment description does not explicitly state how any of these things should be done, only that the POS tags should be removed. So, although I think I could have done things better, I do believe that I followed the assignment instructions and should not be marked down.

### Maximum Likelihood
Now, start by producing code to compute maximum-likelihood estimate probabilities. Your code should be configurable with respect to the n-gram order- i.e., you should be able to set it to compute bigram, trigram, 4-gram, etc. probabilities. Refer to J&M and the lecture slides for definitions as needed. If you would like to write your own n-gram tokenization code, feel free to do so, but you may also use the ngrams.py utility class which contains a routine to take a list of tokens and produce a stream of n-grams with appropriate padding for the start and end of sentences.

#### Tip: 
* Start with a very small "toy" corpus of just a couple of sentences for debugging. 

* As discussed in class, I strongly recommend using nested defaultdicts as the foundational data structure for your language model, where the "outer" key is the prefix, and the value retrieved by that prefix is a second defaultdict  containing possible suffices for that prefix, each of which is an "inner" key. E.g., p("TRUTHS" | "HOLD THESE") would be retrieved by first looking up "HOLD THESE" and then from the resulting dictionary, looking up "TRUTHS": prob = trigrams[("HOLD","THESE")]["TRUTHS"] . Note that this arrangement makes it very easy to e.g. find out the number of times a given history occurs, the total probability mass assigned to all of a history's continuations, etc., all of which will be extremely helpful in the next part of the assignment.

* Use tuples to represent prefixes. E.g., instead of the string "HOLD THESE", use the tuple ("HOLD", "THESE"). Note that, in Python, lists are mutable, and therefore may not be used as keys in dictionaries- but tuples are immutable, and so make excellent keys.

* Don't forget about numerical underflow issues! You'll want to represent probabilities as negative base-2 log probabilities, and modify your arithmetic accordingly. I recommend experimenting with [the bitweight Python library](https://github.com/stevenbedrick/bitweight) (see its unit tests for example usage).
* 

#### What to turn in:
* your code 
* use your code to create a simple language model for small_corpus named as small_lm and show me that your output is correct(This is a small coupus so you could manully calculate the probalility).
* use your code to create language model for wsj.pos.gz named as wsj_lm

### Simple Counting, Maximum Likelihood and Utility Functions

In [85]:
# TOKENIZE - converts lists of sentences, like in the small_corpus
# to list of list of tokens. All of the other functions will require
# their parameters in the form of the outupt of the tokenize function.
def tokenize(corpus):
    tokens = [sentence.split(' ') for sentence in corpus]
    return tokens


# COUNT_BUILDER - generates count models where 'corpus' is in 
# the form output by 'tokenize()'. Order is the 'n' in n-gram.
def count_builder(corpus, order):
    
    #ngram
    ng = ngrams(corpus, order)

    # describe model datatype
    model = defaultdict(lambda: defaultdict(lambda: 0))
    
    # loop to build embedded defaultdict    
    for gram in ng: 
        if not gram[1] in model[gram[0]]:
            model[gram[0]][gram[1]] = 1
        else:
            model[gram[0]][gram[1]] += 1

    # Count Model
    return model


# MAX_LIKELIHOOD - converts a count model into its MLE form.
# 'count_model' is in the form output by 'count_builder()'
def max_likelihood(count_model):
    
    # Container for MLE model with BitWeight probabilities
    # Returns 0 for unseen values.
    model = defaultdict(lambda: defaultdict(lambda: BitWeight(0)))
    
    # for prefixes in model...
    for prefix, suffix_dict in count_model.iteritems():
        w_minus = BitWeight(0) # used to count total tokens
        
        # for words with hist prefix ...
        for suffix, count in suffix_dict.iteritems():
            w_minus += BitWeight(count) # add to total number of tokens
        
        # again, for words with hist prefix ...
        for suffix, count in suffix_dict.iteritems():
            model[prefix][suffix] = BitWeight.__itruediv__(BitWeight(count),w_minus) # set output probabilities
    
    # MLE Probabilities
    return model


# MODEL_PRINTER: Utility to print models with 'BitWeight' values
def model_printer(model):
    for prefix, suffix_dict in model.iteritems():
        for suffix, value in suffix_dict.iteritems():
            print(prefix, " : ", suffix, " : ", value.real())


### Demonstrate MLE Model Build on 'small_corpus'

In [86]:
# Create a trigram MLE language model for 'small_corpus' and show result
small_tokens = tokenize(small_corpus)
small_count = count_builder(small_tokens, 2)
small_lm = max_likelihood(small_count)
model_printer(small_lm)
print('\n\n')


(u'examples',)  :  </S_0>  :  1.0
(u'few',)  :  tokens  :  0.5
(u'few',)  :  other  :  0.5
(u'in',)  :  example  :  1.0
(u'We',)  :  never  :  1.0
(u'Why',)  :  dont  :  1.0
(u'end',)  :  there  :  1.0
(u'start',)  :  with  :  0.75
(u'start',)  :  here  :  0.25
(u'other',)  :  examples  :  1.0
(u'here',)  :  </S_0>  :  1.0
(u'words',)  :  that  :  1.0
(u'an',)  :  example  :  1.0
(u'we',)  :  start  :  0.666666666667
(u'we',)  :  end  :  0.333333333333
(u'dont',)  :  we  :  1.0
(u'there',)  :  </S_0>  :  1.0
('<S_0>',)  :  Tokens  :  0.2
('<S_0>',)  :  We  :  0.2
('<S_0>',)  :  Let  :  0.2
('<S_0>',)  :  Why  :  0.4
(u'so',)  :  few  :  1.0
(u'us',)  :  start  :  1.0
(u'a',)  :  few  :  1.0
(u'example',)  :  docs  :  0.5
(u'example',)  :  with  :  0.5
(u'docs',)  :  </S_0>  :  1.0
(u'Tokens',)  :  can  :  1.0
(u'never',)  :  start  :  1.0
(u'Let',)  :  us  :  1.0
(u'can',)  :  be  :  1.0
(u'be',)  :  words  :  1.0
(u'with',)  :  a  :  0.25
(u'with',)  :  an  :  0.25
(u'with',)  :  so  

### Demonstrate MLE Model Build on 'wsj.pos'

In [87]:
# Create trigram MLE language model for 'wsj.pos' and show subset
wsj_tokens = tokenize(wsj_filtered)
wsj_count = count_builder(wsj_tokens, 3)
wsj_lm = max_likelihood(wsj_count)

# grab 10 keys
top_keys = wsj_lm.keys()[:10]

# subset of language model
sub_wsj = defaultdict(lambda: defaultdict(lambda: BitWeight(0)))
for key in top_keys: sub_wsj[key] = wsj_lm[key]
                      
# print subsample of wsj_model
model_printer(sub_wsj)


(u'have', u'made')  :    :  0.0434782608696
(u'have', u'made')  :  some  :  0.0434782608696
(u'have', u'made')  :  it  :  0.130434782609
(u'have', u'made')  :  trading  :  0.0217391304348
(u'have', u'made')  :  use  :  0.0434782608696
(u'have', u'made')  :  for  :  0.0217391304348
(u'have', u'made')  :  no  :  0.0652173913043
(u'have', u'made')  :  leveraged  :  0.0217391304348
(u'have', u'made')  :  their  :  0.0434782608696
(u'have', u'made')  :  much  :  0.0217391304348
(u'have', u'made')  :  China  :  0.0217391304348
(u'have', u'made')  :  health  :  0.0217391304348
(u'have', u'made')  :  available  :  0.0217391304348
(u'have', u'made')  :  them  :  0.0217391304348
(u'have', u'made')  :  his  :  0.0217391304348
(u'have', u'made')  :  big  :  0.0217391304348
(u'have', u'made')  :  metrics  :  0.0217391304348
(u'have', u'made')  :  nearly  :  0.0217391304348
(u'have', u'made')  :  excellent  :  0.0217391304348
(u'have', u'made')  :  such  :  0.0217391304348
(u'have', u'made')  :  him

<font color="red">Self assessment:</font>

### Smoothing

Once you’ve got an unsmoothed model working, move on to implementing Witten-Bell smoothing. Refer to the slides and J&M for details on how that ought to work.

#### Tip: 
* You can modify an already-populated defaultdict to change its default value (for example, to store a default backoff value for a particular history) by changing the object’s default_factory attribute. Consult the documentation for examples of how this works.
* As defined, W-B smoothing is highly recursive; you may find it more efficient to re-frame the algorithm in iterative terms.
* As in the previous section, start small.
* [This may offer you some help on how to implement Witten-Bell smoothing](http://www.ee.columbia.edu/~stanchen/e6884/labs/lab3/x207.html)


#### What to turn in:
* your code 
* use your code to create a simple smoothed language model based on small_lm  and show me that your output is correct(This is a small coupus so you could manully calculate the probalility).
* use your code to create a smoothed language model based on wsj_lm

### Whitten-Bell Ngram Model Builder Functions

In [118]:
# build list of counts from order of grams
def count_list_builder(corpus, order):
    
    # holds count models of each order
    count_list = []
    
    # tokenize the corpus
    tokens = tokenize(corpus)
    
    # for each order, add count model to count list
    for n in range(order):
        count_list.append(count_builder(tokens, n+1))
        
    # return the count list for use in the calc wb function
    return count_list
    
# take input list of counts and calculate wb
def calculate_wb(prefix, suffix, count_list):
    
    # unigram calculations
    ch = BitWeight(sum(count_list[0][()].values()))
    N_one = BitWeight(len(count_list[0][()].keys()))
    lam = BitWeight.__truediv__(ch,(ch+N_one))
    one_min_lam = BitWeight.__truediv__(N_one,(ch+N_one))
    
    # unigram maximum likelihood
    Pmle = BitWeight.__truediv__(BitWeight(count_list[0][()][suffix]), ch)
    
    # unigram witten bell probability
    pb = (lam * Pmle) + BitWeight.__truediv__(BitWeight(1),(ch+N_one))

    
    # if order is greater than 1 get values from other 
    for x in range(1,len(prefix)):
        
        ch = BitWeight(sum(count_list[x][prefix[-x:]].values()))
        N_one = BitWeight(len(count_list[x][prefix[-x:]].keys()))

        lam = BitWeight.__truediv__(ch,(ch+N_one))
        one_min_lam = BitWeight.__truediv__(N_one,(ch+N_one))
        
        Pmle = BitWeight.__truediv__(BitWeight(count_list[x][prefix[-x:]][suffix]), ch)
        
        pb += (lam*Pmle) + (one_min_lam*pb)
    
    return pb

def wb_model_builder(corpus, order):
    wb_model = defaultdict(lambda: defaultdict(lambda: BitWeight))
    counts = count_list_builder(corpus, order)
    for prefix, suffix_dict in counts[len(counts)-1].iteritems():
        for suffix, value in suffix_dict.iteritems():
            wb_model[prefix][suffix] = calculate_wb(prefix,suffix,counts)
    return wb_model

In [114]:
calculate_wb((),"limited",x)

<bitweight.BitWeight at 0x125186720>

### Demonstrate WB Model Build on 'small_corpus'

In [119]:
# Create a trigram Witten-Bell language model for 'small_corpus' and show result
small_wb_lm = wb_model_builder(small_corpus, 3)
model_printer(small_wb_lm)

(u'us', u'start')  :  with  :  0.60101010101
(u'few', u'other')  :  examples  :  0.545454545455
(u'start', u'with')  :  a  :  0.170454545455
(u'start', u'with')  :  an  :  0.170454545455
(u'start', u'with')  :  in  :  0.170454545455
('<S_1>', u'We')  :  never  :  0.544117647059
('<S_0>', '<S_1>')  :  Tokens  :  0.0597014925373
('<S_0>', '<S_1>')  :  We  :  0.0597014925373
('<S_0>', '<S_1>')  :  Let  :  0.0597014925373
('<S_0>', '<S_1>')  :  Why  :  0.089552238806
(u'start', u'here')  :  </S_1>  :  0.0223880597015
(u'Let', u'us')  :  start  :  0.610294117647
(u'that', u'we')  :  start  :  0.502941176471
(u'example', u'with')  :  so  :  0.169117647059
(u'example', u'docs')  :  </S_1>  :  0.0220588235294
(u'in', u'example')  :  docs  :  0.294117647059
(u'we', u'end')  :  there  :  0.544117647059
(u'a', u'few')  :  other  :  0.294117647059
(u'with', u'an')  :  example  :  0.566176470588
(u'with', u'so')  :  few  :  0.566176470588
(u'words', u'that')  :  we  :  0.588235294118
(u'end', u'the

### Demonstrate WB Model Build on 'wsj.pos'

In [123]:
# Create trigram Witten-Bell language model for 'wsj.pos' and show subset
wsj_wb_lm = wb_model_builder(tl[:3000], 2)

# grab 10 keys
top_keys = wsj_wb_lm.keys()[:50]

# subset of language model
sub_wsj_wb = defaultdict(lambda: defaultdict(lambda: BitWeight(0)))
for key in top_keys: sub_wsj_wb[key] = wsj_wb_lm[key]
                      
# print subsample of wsj_wb_model
model_printer(sub_wsj_wb)

(u'10.1',)  :  %  :  0.00391575299097
(u'resolve',)  :  problems  :  0.000316592795014
(u'frequent',)  :  flier  :  3.33255573699e-05
(u'frequent',)  :  junkets  :  3.33255573699e-05
(u'Unless',)  :  the  :  0.0363075897692
(u'Signal',)  :    :  0.12240477222
(u'Signal',)  :  stock-quote  :  3.33255573699e-05
(u'grueling',)  :  period  :  0.000316592795014
(u'two-income',)  :  couple  :  0.00026660445896
(u'two-income',)  :  family  :  0.00019995334422
(u'Scania',)  :  truck  :  0.000149965008165
(u'fraudulent',)  :  telemarketing  :  6.66511147399e-05
(u'transportation',)  :    :  0.12240477222
(u'transportation',)  :  logistics  :  4.99883360549e-05
(u'transportation',)  :  1230.80  :  3.33255573699e-05
(u'transportation',)  :  deregulation  :  8.33138934249e-05
(u'transportation',)  :  system  :  0.000349918352384
(u'transportation',)  :  etc  :  3.33255573699e-05
(u'transportation',)  :  rates  :  0.000899790048989
(u'transportation',)  :  at  :  0.00401572966308
(u'40-a-share',)  

<font color="red">Self assessment:</font>



This section contained many mistakes and I have still not fully resolved these issues. My solution didn't take into account the fact that the `ngram()` function returned a less than ideal form. When using the function, the return is in the from `<S_0><S_1><S_2>...</S_2><S_1></S_0>`. This results in every single sentence having multiple unseen grams. If we reverse the ordering of the padding to `<S_2><S_1><S_0>...</S_0><S_1></S_2>` then we drastically reduce this problem and only have a single unseen word per sentence, `(</S_1>)|<S_2>` in this case. This can be done by modifying the ngram code:

In [3]:
    left_pad = ['<S_{}>'.format((order-1)-i) for i in xrange(order - 1)] 
    right_pad = ['</S_{}>'.format((order-1)-i) for i in xrange(order - 2, -1, -1)]

The next major mistake, which resulted in a perplexity jump from 1 to 2 gram was in line 33:

In [None]:
 # if order is greater than 1 get values from other 
    for x in range(1,len(prefix)):

This line caused early termination of the $P_wb$ calculation. The result was that what we thought was a trigram was actually a bigram calculation and what we thought was the 4gram was actually the trigram calculation. For the bigram calculation, it was bigger than the unigram because it was basically counting the whole bigram as if it were a unigram. So the number of calculations was bigger but they were all being evaluated as unigram probabilities. This bug was resolved by changing the code to access the intended order:

In [None]:
 # if order is greater than 1 get values from other 
    for x in range(1,len(prefix)+1):

Next, I went back and read how the algorithm works realized that there is supposed to be a cutoff when there is a 0 history count. In my submission I allowed calculations up through the order of the smoothing. In order to change this to be consistent with taking the previous $P_wb$ directly after finding a 0 history count I changed the following lines of code:

In [None]:
    for x in range(1,len(prefix)):

        ch = BitWeight(sum(count_list[x][prefix[-x:]].values()))
        N_one = BitWeight(len(count_list[x][prefix[-x:]].keys()))

        lam = BitWeight.__truediv__(ch,(ch+N_one))
        one_min_lam = BitWeight.__truediv__(N_one,(ch+N_one))
        
        Pmle = BitWeight.__truediv__(BitWeight(count_list[x][prefix[-x:]][suffix]), ch)
        
        pb += (lam*Pmle) + (one_min_lam*pb)

The addition of a return statement on the occouranc of a 0 history count:

In [None]:
    for x in range(1,len(prefix)):
        
        if(len(count_list[x][prefix[-x:]].keys()) == 0):
            return pb
        
        ch = BitWeight(sum(count_list[x][prefix[-x:]].values()))
        N_one = BitWeight(len(count_list[x][prefix[-x:]].keys()))

        lam = BitWeight.__truediv__(ch,(ch+N_one))
        one_min_lam = BitWeight.__truediv__(N_one,(ch+N_one))
        
        Pmle = BitWeight.__truediv__(BitWeight(count_list[x][prefix[-x:]][suffix]), ch)
        
        pb += (lam*Pmle) + (one_min_lam*pb)

After making the above modifications we get results that are more consistent with what would be expected:

<img src="perplex0.png">

From this you can see that the model begins to overfit at 4gram. This could be the result of bad filtering in the above step or other bugs. Infact, I noticed when doing these fixes that the changes resulted in some of the probabilities in the calculation above resulting in probabilities above 1. The line I believed was causing this, and a line I found to be mathematically incorrect was:

In [None]:
  pb += (lam*Pmle) + (one_min_lam*pb)

This line was causing a sum where I am thinking that it is not needed. By changing this line to the following I was able to eliminate the probabilities being over 1:

In [None]:
  pb = (lam*Pmle) + (one_min_lam*pb)

Unfortunately, while I do believe that this line is more mathematically correct it resulted in much worse performance on the testing data:

<img src='perplex1.png'>

As we can see, the model overfits at 2gram now. While I believe the previous graph is more plausable I believe that the function that resulted in this graph is more mathematically correct. I think that both graphs may be the result of a bug somewhere in the logic for the $P_wb$ calculations. I have unfortunately run out of time to work through these bugs, though I am very curious where they are.

### Evaluation via Perplexity
Explore the effects of n-gram order using perplexity. Perform ten-fold cross-validation on the WSJ corpus. On each iteration, this will give you a different 90/10 training/test split; train a smoothed language model on the 9 training sections, and compute the average per-token perplexity of the tenth section. The slides from the language modeling lecture give the equation for perplexity computation (as does J&M chapter 4); you'll need to modify the equation a bit, since we're using log-probabilities. 

Now, try this for unigram, bigram, trigram, and 4-gram models. 

#### What to turn in
* your cross-validation function. You are not suppose to use any cross-validation function from any module. You should implement it by yourself.
* your perplexity function
* cross-validation result for unigram, bigram, trigram, and 4-gram models on wsj.pos.gz
* cross-validation result for unigram, bigram, trigram, and 4-gram models on wsj-normalized.pos.gz.
* Answer following 2 questions: 
    * How does perplexity change as the model order size increases?
    * How does perplexity change as the data changed?

### Functions for N Fold Cross Validation and Perplexity Calculation

In [178]:

#N_FOLD: generalized n_fold cross validation
#
#   # INPUTS
#   data: the dataset as a list
#   n: number of folds
#   scoring_function: function used to evaluate
#   param_list: list of parameters for scoring_function
#
#   # OUTPUTS
#   score_list: a list of the scores for the different test runs
def n_fold(data, n, scoring_function, param_list):
    print(n, " FOLD CROSS VALIDATION")
    # get lenght of corpus and use it to get test set size
    size_of_data = len(data)
    size_of_test_set = int(size_of_data/n)
    
    # score list holds onto results
    score_list = []
    
    ###  n folds  ##
    for x in range(n):
        print("-------  FOLD" , x, " ------------")
        # build test set and trainging set
        test_set = data[x*size_of_test_set:(x+1)*size_of_test_set]
        training_set = data[:x*size_of_test_set]+data[(x+1)*size_of_test_set:]        
        
        # evaluate
        score = scoring_function(test_set, training_set, *param_list)
        print("Score: ",score)
        score_list.append(score)
    

    print("\nMean Score: ",sum(score_list)/float(len(score_list)))
    
    return score_list

def n_fold_on_training(data, n, scoring_function, param_list):
    print(n," FOLDING ON TRAINING")
    # get lenght of corpus and use it to get test set size
    size_of_data = len(data)
    size_of_test_set = int(size_of_data/n)
    
    # score list holds onto results
    score_list = []
    
    print("Starting")
    
    ###  n folds  ##
    for x in range(0,n):
        print("-------  FOLD" , x, " ------------")
        # build test set and trainging set
        test_set = data[(x)*size_of_test_set:(x+1)*size_of_test_set]
        training_set = data[:x*size_of_test_set]+data[(x+1)*size_of_test_set:]        
        
        # evaluate
        score = scoring_function(test_set, training_set, *param_list)
        print("Score: ",score)
        score_list.append(score)
            
    print("\nMean Score: ",sum(score_list)/float(len(score_list)))  

    return score_list




In [179]:
def perplex(test_set, training_set, order):
    
    # holds the cascading product
    PP = BitWeight(1)
    
    # count the number of grams
    N = 0.0
    
    # build model
    count_list = count_list_builder(training_set, order)
    
    # tokenize and then get the grams
    toks = tokenize(test_set)
    grams = ngrams(toks, order)
    
    # for each gram, multiply into the cascading ghrams
    for lines in grams:
        N += 1.0
        PP *= calculate_wb(lines[0], lines[1], count_list)
    
    # return the decimal form
    return 2**((1/N)*PP.log())

### Whitten-Bell Perplexity 10 Fold Cross validation for 'wsj.pos'
##### Unigram, Bigram, Trigram, and 4-gram

In [185]:
# 10 Fold Cross Validation for 'wsj.pos' showing results
# Whitten-Bell Unigram
tl = file_to_list('wsj.pos')
_ = n_fold(tl[:10000],10, perplex, [1])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  965.033203404
-------  FOLD 1  ------------
Score:  1030.53236006
-------  FOLD 2  ------------
Score:  992.010260191
-------  FOLD 3  ------------
Score:  1013.64542823
-------  FOLD 4  ------------
Score:  992.170878986
-------  FOLD 5  ------------
Score:  983.736779422
-------  FOLD 6  ------------
Score:  961.503379617
-------  FOLD 7  ------------
Score:  965.316053679
-------  FOLD 8  ------------
Score:  1029.93735751
-------  FOLD 9  ------------
Score:  1051.86799403

Mean Score:  998.575369513


In [187]:
# Whitten-Bell Biigram
_ = n_fold(tl[:10000],10, perplex, [2])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  1280.46972396
-------  FOLD 1  ------------
Score:  1357.63879836
-------  FOLD 2  ------------
Score:  1323.3727721
-------  FOLD 3  ------------
Score:  1337.25791609
-------  FOLD 4  ------------
Score:  1307.25999127
-------  FOLD 5  ------------
Score:  1304.9579425
-------  FOLD 6  ------------
Score:  1270.29549874
-------  FOLD 7  ------------
Score:  1266.71459653
-------  FOLD 8  ------------
Score:  1368.70247816
-------  FOLD 9  ------------
Score:  1378.21174042

Mean Score:  1319.48814581


In [188]:
# Whitten-Bell Trigram
_ = n_fold(tl[:10000],10, perplex, [3])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  492.227562192
-------  FOLD 1  ------------
Score:  515.103302105
-------  FOLD 2  ------------
Score:  487.353028234
-------  FOLD 3  ------------
Score:  489.967960501
-------  FOLD 4  ------------
Score:  433.990069721
-------  FOLD 5  ------------
Score:  472.804617976
-------  FOLD 6  ------------
Score:  494.371105643
-------  FOLD 7  ------------
Score:  463.332879768
-------  FOLD 8  ------------
Score:  491.542129482
-------  FOLD 9  ------------
Score:  540.704746056

Mean Score:  488.139740168


In [189]:
# Whitten-Bell 4-gram
_ = n_fold(tl[:10000],10, perplex, [4])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  317.427139679
-------  FOLD 1  ------------
Score:  329.899679553
-------  FOLD 2  ------------
Score:  312.119505282
-------  FOLD 3  ------------
Score:  315.100664104
-------  FOLD 4  ------------
Score:  267.38591489
-------  FOLD 5  ------------
Score:  300.399039953
-------  FOLD 6  ------------
Score:  320.904056414
-------  FOLD 7  ------------
Score:  299.338542142
-------  FOLD 8  ------------
Score:  309.736566037
-------  FOLD 9  ------------
Score:  346.929954181

Mean Score:  311.924106223


### Whitten-Bell Perplexity 10 Fold Cross validation for 'wsj-normalize.pos'
##### Unigram, Bigram, Trigram, and 4-gram

In [190]:
# 10 Fold Cross Validation for 'wsj-normalized.pos' showing results
# Whitten-Bell Unigram
tln = file_to_list('wsj-normalized.pos')
_ = n_fold(tln[:10000],10, perplex, [1])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  385.314744887
-------  FOLD 1  ------------
Score:  362.588721062
-------  FOLD 2  ------------
Score:  345.103155197
-------  FOLD 3  ------------
Score:  325.047961849
-------  FOLD 4  ------------
Score:  292.448955251
-------  FOLD 5  ------------
Score:  301.972188781
-------  FOLD 6  ------------
Score:  334.213259241
-------  FOLD 7  ------------
Score:  372.325382191
-------  FOLD 8  ------------
Score:  333.449559403
-------  FOLD 9  ------------
Score:  287.906906725

Mean Score:  334.037083459


In [191]:
# Whitten-Bell Bigram
_ = n_fold(tln[:10000],10, perplex, [2])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  536.668123777
-------  FOLD 1  ------------
Score:  504.52841606
-------  FOLD 2  ------------
Score:  487.908528265
-------  FOLD 3  ------------
Score:  455.026915521
-------  FOLD 4  ------------
Score:  410.815966191
-------  FOLD 5  ------------
Score:  426.592805702
-------  FOLD 6  ------------
Score:  466.590636597
-------  FOLD 7  ------------
Score:  512.722089735
-------  FOLD 8  ------------
Score:  470.998593417
-------  FOLD 9  ------------
Score:  403.667070549

Mean Score:  467.551914581


In [192]:
# Whitten-Bell Trigram
_ = n_fold(tln[:10000],10, perplex, [3])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  243.518151366
-------  FOLD 1  ------------
Score:  233.307136954
-------  FOLD 2  ------------
Score:  218.783410549
-------  FOLD 3  ------------
Score:  208.789728743
-------  FOLD 4  ------------
Score:  180.233674194
-------  FOLD 5  ------------
Score:  193.155898579
-------  FOLD 6  ------------
Score:  223.024881755
-------  FOLD 7  ------------
Score:  229.468134118
-------  FOLD 8  ------------
Score:  213.999522079
-------  FOLD 9  ------------
Score:  196.872233605

Mean Score:  214.115277194


In [193]:
# Whitten-Bell 4-gram
_ = n_fold(tln[:10000],10, perplex, [4])

10  FOLD CROSS VALIDATION
-------  FOLD 0  ------------
Score:  165.363927334
-------  FOLD 1  ------------
Score:  159.277126192
-------  FOLD 2  ------------
Score:  149.274982091
-------  FOLD 3  ------------
Score:  143.877264043
-------  FOLD 4  ------------
Score:  121.034998493
-------  FOLD 5  ------------
Score:  132.26827235
-------  FOLD 6  ------------
Score:  154.762422334
-------  FOLD 7  ------------
Score:  156.724796221
-------  FOLD 8  ------------
Score:  145.243576228
-------  FOLD 9  ------------
Score:  137.176892952

Mean Score:  146.500425824


### Questions

##### 1. How does perplexity change as the model order size increases?
It can be seen from the results of both the 'wsj.pos' and the 'wsj-normalized.pos' that as the order of the model increases the perplexity decreases. This indicates that the higher order models we see here are more accurate than the lower ones.

##### 2. How does perplexity change as the data changed?
It can be seen in the results that the model was more acurate in prediction the outcomes of the normalized file.

<font color="red">Self assessment:</font>

Although my perplexity calculations are not completly correct, I believe the code in this section is. I believe the inconsistencies in the output with the provided solution are a result of the $P_wb$ calculation and not the result of the code in this section.