# Give your Notebook a title
Say what it is about in general

In [1]:
# imports

# you can use this for averaging etc if you want
import numpy

# import whatever you want from your own code, for instance:
from corpusreader import CorpusReader
from generate import *
from model import BigramModel

## Build the model
We begin by building our model from the training set. We use this throughout

In [2]:
# set path to training set
path = "./train"
reader = CorpusReader(path)

In [3]:
# build the model
my_model = BigramModel(reader.sents())

Adding boundaries: 100%|█████████████████████████████████| 11909/11909 [00:00<00:00, 1294879.23it/s]
Making and counting Unigrams: 100%|███████████████████████| 11909/11909 [00:00<00:00, 445413.15it/s]
Making and counting Bigrams: 100%|█████████████████████████| 11909/11909 [00:00<00:00, 69566.07it/s]


## Experimenting with smoothing constants

Smoothing constants are used both for generating samples and for computing the perplexity of text. To investigate the effect of smoothing constants on generating samples and computing perplexity of samples, we (provide functions? write code?) that allow us to try a range of smoothing constants.

### Generating a sample with a particular smoothing constant

(e.g. maybe you could make a function called `generate_sample` that gives us a sample corpus generated by the model, using the given smoothing constant, and `compute_corpus_perplexity` to calculate its perplexity, using a possibly different constant.)

In [4]:
def generate_sample(model: BigramModel, sample_size: int, smoothing_constant: float) -> list:
    """
    Generate a sample of size sample_size from model, using smoothing_constant for LaPlace smoothing
    @param model: BigramModel
    @param sample_size: int
    @param smoothing_constant: float 
    @return: list of lists of strings
    """
    print("\ngenerating with constant", smoothing_constant)
    list = []
    for i in range(sample_size):
        list.append(generate_sentence(model, smoothing_constant))
    return list



In [5]:
from statistics import mean
def compute_corpus_perplexity(model: BigramModel, sample: list, smoothing_constant: float) -> float:
    """
    returns the average perplexity of the given sample with the model, using the smoothing constant for add-k smoothing
    @param model: BigramModel
    @param sample: list
    @param smoothing_constant: float
    @return: float
    """
    perplexes = []
    for sent in sample:
        perplexes.append(model.perplexity(sent, smoothing_constant))
    return mean(perplexes)


To try this out, we generate a sample of (some number of) sentences using raw probabilities (no smoothing) and test it with LaPlace smoothing (k = 1.0)

In [6]:
n = 1
raw_sample = generate_sample(my_model, n, 0.0)


generating with constant 0.0


Choosing successor for: <s>: 100%|████████████████████████████████| 824/824 [00:13<00:00, 61.73it/s]


ValueError: Total of weights must be greater than zero

We can see some examples below

In [37]:
def token_list2text(sent: list) -> str:
    string = ''
    for x in sent:
        string = string + ' ' + x
    return string

In [38]:
print("1 examples:\n")
for s in raw_sample[:1]:
    print(token_list2text(s))

1 examples:

 <s> ix </s>


To calculate the perplexity of this sample, we can choose a smoothing constant -- let's make it 1.0

In [39]:
laplace_constant = 1.0
print(compute_corpus_perplexity(my_model, raw_sample, laplace_constant))

341.15541553091344


In [40]:
import nltk
tokens = nltk.word_tokenize("letters written respectively in st michael timidly to control your philosophies and prolonged cheering as short well-brushed hair clustered as marriage or destroyed nations who blackmails him considering that floating poetry which assembles in snakes does teething hurt whom indeed exceptionally lively eye roamed again blunderingly this filial devotion said startlingly becoming quite indistinctly indeed well-featured fellow agile like cross-examining a quill pen or motorist among tropic birds whirred and fingering his coat-of-arms nobody eats him lifeless conveniences and brun cleared landscape lay littered floor as")
print(my_model.perplexity(tokens, smoothing_constant=1.0))

inf


To see how the perplexity calculation of this sample varies when we change the smoothing constant we use to calculate it, we can test with 10 fractional constants

In [29]:
for i in range(0, 11):  # should this be 0 or 1?
    c = i/10  # 0.0, 0.1, 0.2, ..., 1.0
    print("\nTest with constant", c)
    print(compute_corpus_perplexity(my_model, raw_sample, c))


Test with constant 0.0
21.800418088134233

Test with constant 0.1
inf

Test with constant 0.2
inf

Test with constant 0.3
inf

Test with constant 0.4
inf

Test with constant 0.5
inf

Test with constant 0.6
inf

Test with constant 0.7
inf

Test with constant 0.8
inf

Test with constant 0.9
inf

Test with constant 1.0
inf


### Analysis

The following pattern holds / No pattern is discernible here. Explain how so...

## Varying the smoothing constant used for generation

Above we used raw probabilities. Now we will see what happens when we vary the constant used to generate the sample

In [None]:
for i in range(0,11):
    ...

### Analysis

The following pattern holds / No pattern is discernible here. Explain how so...

Argue whether you should try all smoothing constant combinations (for the generation and for the perplexity calculation)

## Discussion

Any thoughts about why your results came out the way they did?

## Conclusion

Summary, concerns, questions...