<a href="https://colab.research.google.com/github/IchbinHansou/weread2notion/blob/main/LDS_ESSLLI_2024_Monday_Worksheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lingusitic Data Science - Monday

In this worksheet we will look at doing author identification using simple statistics

Our first step is to get some data. We will use texts from Project Gutenburg that we will access using NLTK

In [None]:
import nltk
nltk.download("gutenberg")
nltk.download("punkt") # to enable tokenization, we will need this later
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


We can examine the files in this corpus by using the `fileids()` function.

In Python we can use square brackets to access `[]` an element in an array. Note that array elements start from `0`.

__Question:__ What is the 5th document in the corpus?

In [None]:
doc = gutenberg.fileids()[0]
print(doc)

austen-emma.txt


Now we can access the content of the documents in several ways:

* `gutenberg.raw(doc)`
* `gutenberg.words(doc)`
* `gutenberg.sents(doc)`

__Question:__ Try each of these and compare the results. Try accessing the 5th element of each to compare the results.

In [None]:
gutenberg.sents(doc)

[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]

We can build up a count of the words using a built-in class called `Counter` from Python

In [None]:
# Import the class into the environment
from collections import Counter

# Create a new word frequency counter
word_frequencies = Counter()

# Iterate through each word in the document
for word in gutenberg.words(doc):
  # Note that in Python indentation indicates we are inside the loop
  # We increment the count by 1
  word_frequencies[word] = word_frequencies[word] + 1

__Question:__ Can you change the code above to calculate the word frequency over all words in all documents.

The function `lower` can be used to convert a string to lowercase

```python
"ExAmPle".lower() == "example"
```

__Question:__ Can you further modify your code to be case-invariant?

In order to compare two frequency distributions we will use _cosine similarity_ that can be defined as follows

$
cos(x,y) = \frac{\sum_i x_i \times y_i}{\sqrt{\sum_i x_i^2 \sum_i y_i^2}}
$

I give an implementation of this below (don't worry if you don't quite understand this!)

In [None]:
from math import sqrt

def cosine(x, y):
  xy = 0
  xx = 0
  yy = 0
  for w1, f1 in x.items():
    xy += f1 * y[w1]
    xx += f1 * f1
  for _, f2 in y.items():
    yy += f2 * f2
  if xx == 0 or yy == 0:
    return 0
  return xy / sqrt(xx * yy)

We will now divide our distribution into two. One for works written by Shakespeare and one for works not written by Shakespeare

In [None]:
shakespeare = Counter()
not_shakespeare = Counter()

for file_id in gutenberg.fileids():
  for word in gutenberg.words(file_id):
    if file_id.startswith("shakespeare"):
      shakespeare[word] += 1
    else:
      not_shakespeare[word] += 1

__Question:__ Using the code above compute the cosine similarity between `shakespeare` and `not_shakespeare` for the documents `shakespeare-caesar.txt` and `milton-paradise.txt`.

What does this tell you?

In [None]:
caesar = Counter()
paradise = Counter()


for file_id in ["shakespeare-caesar.txt","milton-paradise.txt"]:
  for word in gutenberg.words(file_id):
    if file_id.startswith("shakespeare"):
      caesar[word] += 1
    else:
      paradise[word] += 1

print("Is Caesar            by Shakespeare:", cosine(caesar, shakespeare))
print("Is Caesar        not by Shakespeare:", cosine(caesar, not_shakespeare))
print("Is Paradise Lost     by Shakespeare:", cosine(paradise, shakespeare))
print("Is Paradise Lost not by Shakespeare:",cosine(paradise, not_shakespeare))

Is Caesar            by Shakespeare: 0.9902101010373634
Is Caesar        not by Shakespeare: 0.8785200083947318
Is Paradise Lost     by Shakespeare: 0.8646398762071504
Is Paradise Lost not by Shakespeare: 0.894796702901323


__Question:__ Finally, we notice that we are using the two texts to calculate the frequencies for `caesar` and `paradise`. Modify the code above to remove these documents? Does the result still work?

### Buzz groups
Discuss any further improvements to this methodology that you can think of.