# Give your Notebook a title
Say what it is about in general

In [2]:
# imports

# you can use this for averaging etc if you want
import numpy

# import whatever you want from your own code, for instance:
from corpusreader import CorpusReader
from generate import *
from model import BigramModel

## Build the model
We begin by building our model from the training set. We use this throughout

In [3]:
# set path to training set
path = "./train"
reader = CorpusReader(path)

In [4]:
# build the model
my_model = BigramModel(reader.sents())

Adding boundaries: 100%|██████████████████████████████████████████████| 11909/11909 [00:00<?, ?it/s]
Making and counting Unigrams: 100%|███████████████████████| 11909/11909 [00:00<00:00, 762525.06it/s]
Making and counting Bigrams: 100%|█████████████████████████| 11909/11909 [00:00<00:00, 86595.86it/s]


## Experimenting with smoothing constants

Smoothing constants are used both for generating samples and for computing the perplexity of text. To investigate the effect of smoothing constants on generating samples and computing perplexity of samples, we (provide functions? write code?) that allow us to try a range of smoothing constants.

### Generating a sample with a particular smoothing constant

(e.g. maybe you could make a function called `generate_sample` that gives us a sample corpus generated by the model, using the given smoothing constant, and `compute_corpus_perplexity` to calculate its perplexity, using a possibly different constant.)

In [5]:
def generate_sample(model: BigramModel, sample_size: int, smoothing_constant: float) -> list:
    """
    Generate a sample of size sample_size from model, using smoothing_constant for LaPlace smoothing
    @param model: BigramModel
    @param sample_size: int
    @param smoothing_constant: float 
    @return: list of lists of strings
    """
    print("\ngenerating with constant", smoothing_constant)
    list = []
    for i in range(sample_size):
        list.append(generate_sentence(model, smoothing_constant))
    return list



In [6]:
from statistics import mean
def compute_corpus_perplexity(model: BigramModel, sample: list, smoothing_constant: float) -> float:
    """
    returns the average perplexity of the given sample with the model, using the smoothing constant for add-k smoothing
    @param model: BigramModel
    @param sample: list
    @param smoothing_constant: float
    @return: float
    """
    perplexes = []
    for sent in sample:
        perplexes.append(model.perplexity(sent, smoothing_constant))
    return mean(perplexes)


To try this out, we generate a sample of (some number of) sentences using raw probabilities (no smoothing) and test it with LaPlace smoothing (k = 1.0)

In [7]:
n = 5
raw_sample = generate_sample(my_model, n, 0.0)


generating with constant 0.0


Choosing successor for: <s>: 100%|████████████████████████████████| 824/824 [00:12<00:00, 68.44it/s]
Choosing successor for: marco: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 180.81it/s]
Choosing successor for: has: 100%|████████████████████████████████| 132/132 [00:01<00:00, 71.13it/s]
Choosing successor for: soaked: 100%|█████████████████████████████████| 4/4 [00:00<00:00, 66.42it/s]
Choosing successor for: through: 100%|██████████████████████████████| 57/57 [00:00<00:00, 75.22it/s]
Choosing successor for: dissipation: 100%|████████████████████████████| 1/1 [00:00<00:00, 44.95it/s]
Choosing successor for: rather: 100%|█████████████████████████████| 166/166 [00:02<00:00, 71.97it/s]
Choosing successor for: flashy: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 60.52it/s]
Choosing successor for: hotel: 100%|████████████████████████████████| 14/14 [00:00<00:00, 66.06it/s]
Choosing successor for: saint: 100%|██████████████████████████████████| 6/6 [00:00<00:00, 9

We can see some examples below

In [8]:
def token_list2text(sent: list) -> str:
    string = ''
    for x in sent:
        string = string + ' ' + x
    return string

In [9]:
print("5 examples:\n")
for s in raw_sample[:5]:
    print(token_list2text(s))

5 examples:

 <s> marco has soaked through dissipation rather flashy hotel saint louis xi the shutters together seemed alike remote watering-place scheme succeeds except dress designed specially orthodox either delete this bank went sailing west virginia washington west virginia wisconsin and cross-questioned the locality of bated bewilderment exactly confirm his grip and aching light china-blue eyes brightened into uncontrollable fits of witnesses was speaking sparely and weedy his hostel </s>
 <s> p. hirsch affair should fall soft sand-hills philip wishing he lingered much riper experience does teething hurt yourselves have woken up chiswick bank </s>
 <s> now dr oman dark-gloved hand extended once indispensable and lingering lowland scotch intonation if proudly as wedding-bells </s>
 <s> encountering the jumps being approved by impact </s>
 <s> victory swayed thrice slid down ludgate hill folk lore </s>


To calculate the perplexity of this sample, we can choose a smoothing constant -- let's make it 1.0

In [10]:
laplace_constant = 1.0
print(compute_corpus_perplexity(my_model, raw_sample, laplace_constant))

3990.898614582776


To see how the perplexity calculation of this sample varies when we change the smoothing constant we use to calculate it, we can test with 10 fractional constants

In [11]:
for i in range(0, 11):  # should this be 0 or 1?
    c = i/10  # 0.0, 0.1, 0.2, ..., 1.0
    print("\nTest with constant", c)
    print(compute_corpus_perplexity(my_model, raw_sample, c))


Test with constant 0.0
20.37758027761653

Test with constant 0.1
969.6628390447838

Test with constant 0.2
1568.4226756664095

Test with constant 0.3
2048.664634409533

Test with constant 0.4
2450.5227996876633

Test with constant 0.5
2794.6434386398914

Test with constant 0.6
3093.9973344590303

Test with constant 0.7
3357.530703048718

Test with constant 0.8
3591.7603283782614

Test with constant 0.9
3801.609661362965

Test with constant 1.0
3990.898614582776


### Analysis

The following pattern holds: The higher the smoothing constant used when calculating the perplexity, the higher the perplexity becomes. Because the higher the smoothing constant, the more nul-bigrams are added to the model, which means the probability of bigrams becomes lower, which means the perplexity becomes higher.

## Varying the smoothing constant used for generation

Above we used raw probabilities. Now we will see what happens when we vary the constant used to generate the sample

In [13]:
sents = []
for i in range(0,11):
    sample = generate_sentence(my_model, i)
    sents.append(sample)

for sent in sents:
    print(token_list2text(sent))
    print(f"Perplexity of the sentence: {my_model.perplexity(sent)}")

Choosing successor for: <s>: 100%|████████████████████████████████| 824/824 [00:11<00:00, 69.14it/s]
Choosing successor for: footmen: 100%|████████████████████████████████| 2/2 [00:00<00:00, 91.16it/s]
Choosing successor for: came: 100%|█████████████████████████████████| 76/76 [00:01<00:00, 67.08it/s]
Choosing successor for: stepping: 100%|███████████████████████████████| 8/8 [00:00<00:00, 71.73it/s]
Choosing successor for: swiftly: 100%|████████████████████████████████| 9/9 [00:00<00:00, 71.74it/s]
Choosing successor for: would: 100%|██████████████████████████████| 141/141 [00:01<00:00, 72.89it/s]
Choosing successor for: horrify: 100%|████████████████████████████████| 1/1 [00:00<00:00, 62.56it/s]
Choosing successor for: <s>: 100%|████████████████████████████████| 824/824 [00:11<00:00, 70.14it/s]
Choosing successor for: maurice: 100%|████████████████████████████████| 1/1 [00:00<00:00, 50.96it/s]
Choosing successor for: brun: 100%|███████████████████████████████████| 9/9 [00:00<00:00, 7

 <s> footmen came stepping swiftly would horrify </s>
Perplexity of the sentence: 3660.418257236423
 <s> maurice brun were all right hand </s>
Perplexity of the sentence: 3660.418257236423
 <s> x </s>
Perplexity of the sentence: 3660.418257236423
 <s> then i came probably captured every trust in half with fiery crypts the same elemental terror announced that one took that this garden beds one word thy hand out for a repulsion which is revolting fancy left arm into sub-headings which might i got one poor fellow believes in his mind is arbitrary distinctions of salads </s>
Perplexity of the sentence: 3660.418257236423
 <s> in light of mr greenwood usher suddenly twisted river in the most morbid fears even discovered anything else will care not true asked a fire </s>
Perplexity of the sentence: 3660.418257236423
 <s> syme strolled across five reconciled let his feet or thirty feet and dilemmas with frivolity of dark avenue between yea there are are betrayed by disk you would give anything

### Analysis

No pattern is discernible here. Using


## Discussion

Any thoughts about why your results came out the way they did?

## Conclusion

Summary, concerns, questions...