In [13]:
# Initialize Otter
import otter
grader = otter.Notebook("ps3.ipynb")

# STATS 507

## Problem 1: Counting Word Bigrams (6 points)
Let us write a function for counting word bigrams. That is, for each pair of words, say, cat and dog, we want to count how many times the word “cat” occurred immediately before the word “dog”. We will represent this bigram by a tuple, `(’cat’, ’dog’)`.

So, as an example, the fragment of poem,

```
Half a league, half a league, Half a league onward,
All in the valley of Death Rode the six hundred.
```

includes the bigrams `(’half’, ’a’)` and `(’a’, ’league’)` both three times, the bigram `(’league’, ’half’)` appears twice, while the bigram `(’in’, ’the’)` appears only once.

**Note:** For our purposes, we will ignore all spaces, newlines, punctuation and capitalization in our counting.

### Part A (2 points)
Write a function `count_bigrams_in_file` that takes a ﬁlename `file` as its only argument. Your function should read from the given ﬁle, and return a dictionary whose keys are bigrams (given in the tuple form above), and values are the counts for those bigrams.

Again, your function should ignore punctuation, spaces, newlines and capitalization.

#### Hints

- You will ﬁnd the Python function `str.strip()`, along with the string constants deﬁned in the string documentation (https://docs.python.org/3/library/string.html), useful in removing punctuation. 
- Be careful to check that your function handles newlines correctly. For example, in the poem above, one of the `(’league’, ’half’)` bigrams spans a newline, but should be counted nonetheless.
- For this function to be executed at a reasonable speed, you should attempt to remove all punctuation from the file in the text __before__ splitting it into words. For reference, our solution only uses one for loop over all the words in the given file's text.
- Be careful that your function does not accidentally count the empty string as a word (this is a common bug if you aren’t careful about splitting the input text). Solutions that merely delete “bad” keys from the dictionary at the end will not receive full credit, as all edge cases can handled by correctly splitting the input.
- Please replace instances of '--' with a space ' '. So that e.g. 'Bob--David' would be parsed as 'Bob David' (two separate words). Any other punctuation should be removed and treated the same, so for example 'Bob/David' would be parsed as 'BobDavid'. This includes punctuation inside words so e.g. "they're" would become "theyre"

In [14]:
import string


def count_bigrams_in_file(file):
    with open(file, 'r') as f:
        f_ct = f.read()

    #replace -- with ' '
    f_ct = f_ct.replace('--', ' ')

    #delete punctuation
    f_ct = f_ct.translate(str.maketrans('','', string.punctuation))

    #all whitespace character to ' '
    f_ct = f_ct.translate(str.maketrans(string.whitespace, ' ' * len(string.whitespace)))

    #lower
    f_ct = f_ct.lower()

    words = f_ct.split()

    bigram = {}
    for i in range(len(words)-1):
        bigram[(words[i], words[i+1])] = bigram.get((words[i], words[i+1]), 0) + 1

    return bigram

In [15]:
grader.check("q1a")

### Part B (1 point)  
The ﬁle `WandP.txt`, which was included along with this jupyter notebook in `ps3.zip`, is an `ASCII` copy of all of Tolstoi’s novel War and Peace. Run your function on this ﬁle, and pickle the resulting dictionary in a ﬁle called `mb.bigrams.pickle`.

In [16]:
import pickle

out = count_bigrams_in_file('WandP.txt')
with open('mb.bigrams.pickle', 'wb') as f:
    pickle.dump(out, f)

In [17]:
grader.check("q1b")

### Part C (2 points)

We say that word `A` is _collocated_ with word `B` in a text if words `A` and `B` occur immediately one after another (in either order). That is, words `A` and `B` are collocated if and only if either of the tuples `(A, B)` or `(B, A)` are present in the text.

Write a function `collocations` that takes a ﬁlename `file` as its only argument and returns a dictionary. Your function should read from the given ﬁle and return a dictionary whose keys are all the strings appearing in the ﬁle and the value of word `A` is a Python set containing all the words collocated with `A`.

Follow the same conventions as above problems; in ignoring case and stripping away all spaces, newlines and punctuation.

Again using the poem fragment above as an example, the string `’league’` should appear as a key, and should have as its value the set `{’a’, ’half’, ’onward’}`, while the string `’in’` should have the set `{’all’, ’the’}` as its value.

#### Hints

- For this function to be executed at a reasonable speed, you should attempt to remove all punctuation from the file in the text __before__ splitting it into words. For reference, our solution only uses one for loop over all the words in the given file's text.

In [18]:
def collocations(file):
    with open(file, 'r') as f:
        f_ct = f.read()

    #replace -- with ' '
    f_ct = f_ct.replace('--', ' ')

    #delete punctuation
    f_ct = f_ct.translate(str.maketrans('','', string.punctuation))

    #all whitespace character to ' '
    f_ct = f_ct.translate(str.maketrans(string.whitespace, ' ' * len(string.whitespace)))

    #lower
    f_ct = f_ct.lower()

    words = f_ct.split()

    colloc = {}
    for i in range(len(words)-1):
        t1 = colloc.get(words[i], set())
        t1.add(words[i+1])
        colloc[words[i]] = t1

        t2 = colloc.get(words[-i-1], set())
        t2.add(words[-i-2])
        colloc[words[-i-1]] = t2

    return colloc

In [19]:
grader.check("q1c")

### Part D (1 point)
Run your function on the ﬁle `WandP.txt` and pickle the resulting dictionary in a ﬁle called `mb.colloc.pickle.`

In [20]:
import pickle

out = collocations('WandP.txt')
with open('mb.colloc.pickle', 'wb') as f:
    pickle.dump(out, f)

In [21]:
grader.check("q1d")

##  Problem 2: List Comprehensions and Generator Expressions (5 points)
In this problem you’ll write a few simple list comprehensions and generator expressions.

### Part A (1 point)
Write a list comprehension where each element equals $3^{n}−1$ for $n = 1, 2, 3,..., 20$.

For ease of grading, please assign this list comprehension to a variable called `pow3minus1`.

In [22]:
"""Write the list comprehension here. Make sure to assign it
to a variable `pow3minus1`"""
pow3minus1 = [3**n-1 for n in range(1,21)]

In [23]:
grader.check("q2a")

### Part B (2 point)
Write a generator expression for the **pyramid numbers**. The $n$-th pyramid number $P_{n}$ (for $n = 1, 2, ...$) counts the number of spheres in a pyramid with an $n$-by-$n$ based (see https://en.wikipedia.org/wiki/Square_pyramidal_number), and is given by:
$$
P_{n} = \sum_{k=1}^{n}k^2 = \frac{n (n+1) (2n+1)}{6}
$$
For ease of grading, please assign this generator expression to a variable called `pyramid`.

**Hint:** You may ﬁnd it useful to first deﬁne a generator for the positive integers.

In [24]:
"""Your code for the pyramid generator goes here."""

def positive():
    a = 1
    while True:
        yield a
        a += 1

def Pn():
    for n in positive():
        yield int(n*(n+1)*(2*n+1)/6)

pyramid =  (pn for pn in Pn())

In [25]:
grader.check("q2b")

### Part C (2 point)
Write a generator expression that enumerates the _octahedral numbers_. The $n$-th octahedral number ($n = 1, 2, ...$) is given by

$$O_{n} = \frac{n(2n^{2}+1)}{3},$$

and counts the number of spheres in an octahedron with $n$ spheres to each edge (see https://en.wikipedia.org/wiki/Octahedral_number).

For ease of grading, please assign this generator expression to a variable called `octa`.

**Note:** You can solve this problem any way you see fit. However, a particularly clever solution uses the fact that the $n$-th octahedral number can be expressed as $O_{n} = P_{n} + P_{n-1}$ where $P_{n}$ denotes the $n$-th pyramidal number (which you implemented in the previous subproblem).

In [26]:
"""Code for `octa` generator here."""

#we define P_0=0
def Pn_1():
    yield 0
    yield from Pn()

octa = (i+j for i,j in zip(Pn_1(), Pn()))

In [27]:
grader.check("q2c")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Upload this .zip file to Gradescope for grading.

In [28]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)