# Week 3

## Reporting: Different Kinds of Frequencies

When reporting your findings from last week, you've mainly been using "absolute frequency" (AF). There are, however, many ways report word frequencies in a corpus. As we go over these frequencies, consider the trade-offs and advantages of each.

### Absolute ("raw") frequency

Brezina defines AF as "a count of all tokens in the text or corpus that belong to a particular word type" [@Brezina2018 42]. He uses the example of the 6,041,234 occurrences of the token "the" in the British National Corpus (BNC). Since Greek inflects the definite article, we can't simply count the occurrences of a single token.

> Discuss: When should you use absolute frequency in reporting?

As usual, let's load up our text.

In [7]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-eng2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-eng2", resource=f)

And let's count the occurrences of the definite article across the whole work.

In [8]:
from lxml import etree
from MyCapytain.common.constants import Mimetypes

urns = []
raw_xmls = []
unannotated_strings = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

In [9]:
# this will take a while

import spacy

nlp = spacy.load("en_core_web_sm", disable=["ner"])

raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

In [None]:
definite_article = [t for t in pausanias_df['nlp_docs'].explode() if t.lemma_ == "the"]

len(definite_article)

Thus we have 26,932 occurrences of the definite article in Pausanias.

When might we want to use absolute frequency? As we've already started to see when working with Pausanias, absolute frequency can be useful for sorting tokens and lemmata in a single corpus. When comparing multiple corpora, however, absolute frequency does not provide a good comparison: consider the problem when comparing a relatively small corpus, like Aeschylus' seven extant plays, to all of Pausanias: absolute frequencies as a point of comparison would be essentially meaningless. Or to take a more ready-to-hand example, what would it mean to compare the frequencies in a single book of Pausanias to the entire work? For this kind of analysis, we need to use **relative frequency**.

### Relative ("normalized") frequency

Relative frequency [@Brezina2018 43] is easy to calculate: take the absolute frequency of the target word and divide it by the total words in the corpus. By convention, we then multiple it by a constant, the "basis for normalization" -- this converts the measurement from a percentage of tokens in the corpus to an expression of tokens per thousand -- or million, whatever you decide to make your constant.

To practice, let's calculate the number of definite articles per million words in Pausanias.


In [None]:
n_definite_article = len(definite_article)
n_tokens = len([t for t in pausanias_df['nlp_docs'].explode()])
basis = 1_000_000

rf_definite_article_in_pausanias = (n_definite_article / n_tokens) * basis

print(rf_definite_article_in_pausanias)

As you've probably surmised from looking at this calculation, the relative frequency "can ... be considered as ... the mean of the frequencies of the word in hypothetical samples of _x_ tokens from the corpus, where _x_ is the basis for normalization" [@Brezina2018 43].

In other words, if we were to divide Pausanias into equal 1-million-word chunks, counting the occurrences of the definite article in each chunk and then averaging them, we would arrive at the same number.

Keep this intuition in mind for later.

### Hapax legomena ("once-saids")

_Hapax legomena_ (singular _hapax legomenon_ or simply _hapax_) are words that occur only once in the corpus. We can get a sense of hapaxes in Pausanias by counting the occurrences of lemmata like we did last week.

In [None]:
from collections import Counter

counts = Counter([t.lemma_ for t in pausanias_df['nlp_docs'].explode() if t.is_alpha])
hapaxes = [h for (h, i) in counts.items() if i == 1]

hapaxes

> Discuss: Do you notice any potential issues in the list of hapaxes above? How should we report them?

### Zipf's Law

Using the `counts` that we calculated above, we can examine Zipf's law in the context of Pausanias. Zipf's law, to borrow Brezina's summary, "tells us that when we start with the most frequent item in the wordlist (regardless of the size of the corpus), the second most frequent item will have only half of the frequency of the first item. The third most common word will have one-third of the frequency of the first item; and so on" [@Brezina2018 44].

In [None]:
counts.most_common(3)

The ratios aren't exact, but they're pretty close! Zipf's law might be easier to see if we visualize it.

In [None]:
%pip install "altair[all]"

In [None]:
import altair as alt

top_5000_words = counts.most_common(100)

top_5000_words.sort(key=lambda x: x[1], reverse=True)

zipfs_df = pd.DataFrame([{'lemma': h, 'frequency': i, 'rank': idx} for idx, (h, i) in enumerate(top_5000_words)])

chart = alt.Chart(zipfs_df)

chart.mark_point().encode(x='rank', y='frequency')

As you can see, the rapid decrease in frequencies maps nicely on the Zipf's law visualization at @Brezina2018 [45].

> In-class exercise: We can see the visualization well enough here, but how can we improve this chart? Use the [Altair documentation](https://altair-viz.github.io/index.html) to make the chart more readable. As a bonus, see if you can figure out how to show the lemma for each rank when you hover over the point.

## Dispersion

> Discuss: In your own words, describe the so-called "whelk problem" [@Brezina2018 46--47]. Who coined the phrase, and why?

> Generally, dispersion tells us about the distribution of words or phrases throughout the corpus. For example, the definite article the is not only a highly frequent word, it also is fairly evenly distributed in text. This is because the is a grammatical word and we usually cannot put sentences together without using it. Other words which are specific to a particular context (e.g. whelk, hashtag, corpus) will be less evenly distributed. [@Brezina2018 47]

### Range<sub>2</sub> (R)

Range<sub>2</sub> simply tells us in how many parts of a corpus a given word appears, regardless of the size of each part. It can also be expressed as a percentage, e.g., if a word appears in 8 out of 10 books of Pausanias, it would have an _R_ value of 8/10 or 80%.

> In-class exercise: Divide the quotation above into sentences and determine the R<sub>2</sub> dispersion of forms of the word "be". Bonus: do the same, but divide the quotation into five-word chunks (you should have one one-word chunk).

In [None]:
# note: typically you will want to use a lemmatizer, but as we are working with sample data, we can 
# just create a tuple of forms to work with here
forms_of_be = ("be", "am", "are", "is", "was", "were", "been") 

quotation = """
Generally, dispersion tells us about the distribution of words or phrases throughout the corpus. 
For example, the definite article the is not only a highly frequent word, it also is fairly evenly distributed in text. 
This is because the is a grammatical word and we usually cannot put sentences together without using it. 
Other words which are specific to a particular context (e.g. whelk, hashtag, corpus) will be less evenly distributed.
"""

# Your code below. Helpful built-in functions: str.split(), str.splitlines().

quotation_sents = [l.split() for l in quotation.splitlines() if l != ""]

#"l.split" splints string into list

#if l != "" gets rid of the empty line at the top of the quote
quotation_r_2 = sum([1 for l in quotation_sents if len(set(l).intersection(forms_of_be)) > 0]) /len(quotation_sents)

#set is like a list, but item only appears once
#set is way to see unique things in list
#set(l) gives unique lists in list l
#intersection compares forms_of_be to list l and shows the onces they have in common
#we do 1 for l to show its presence or not, then we can add up all the 1s to figure out how many intersect, with sum()
#then we find the range by dividing by the number of sentences with len(quotation_sents)

quotation_r_2
#we are calculating # of parts with a form of be divided by the number of parts


##bonus, didn't finish in class

# quoation_chunks = [quoatation.split()i:i+5] for i in range(0, len)


### Standard Deviation

The **standard deviation** (σ) attempts to answer how much variation around the mean occurs in the data. You've probably seen this measurement of dispersion outside of corpus linguistics, but here it can show, for example, variation in the relative frequency of a word in different parts of a corpus.

Standard deviation is expressed mathematically as `sqrt(sum of squared distances from the mean / total # of corpus parts)`.'

> Discuss: Why do we square the differences from the mean if we're also going to take the square root of the ratio?

#### Sample standard deviation

Sample standard deviation differs from standard deviation (σ) only in the divisor, which is here `total # of corpus parts - 1`.

> In-class exercise: calculate the standard deviation and the sample standard deviation of the relative frequency of forms of "be" in the quotation from Brezina.

In [None]:
# Your code here

rel_freqs_be = []

for line in quotation_sents:
    n_be = len([t for t in line if t in forms_of_be])
    rel_freqs_be.append((n_be * len(line)) /10 )
print(rel_freqs_be)

average_be_per_10_tokens = sum(rel_freqs_be) / len(rel_freqs_be)

print (average_be_per_10_tokens)

import math

def std_dev(samples):
    mean = sum(samples) / len(samples)

    return math.sqrt(sum([n-mean ** 2 for n in samples])) / len(samples)

def sample_std_dev(samples):
    mean = sum(samples) / len(samples)

    return math.sqrt(sum[(n-mean ** 2 for n in samples)]) / len(samples -1)
std_dev_be = std_dev(rel_freqs_be)

sample_std_dev_be = sample_std_dev(rel_freqs_be)

print(f"σ = {std_dev_be} | SD = {sample_std_dev_be}")


### Coefficient of Variation (CV)

"The coefficient of variation (CV) describes the amount of variation relative to the mean relative frequency of a word or phrase in the corpus" [@Brezina2018 50].

We calculate the coefficient of variation by dividing the standard deviation by the mean: `CV = std. deviation / mean`.

> In-class exercise: What is the coefficient of variation for forms of the word "be" in the quotation from Brezina when dividing by sentence?


In [None]:
# Your code here

cv_brezina_quotation = "?"

### Juilland's _D_

> Juilland’s _D_ is a measure of dispersion that builds on the coefficient of variation. It is a number between 0 and 1, with 0 signifying extremely uneven distribution and 1 perfectly even distribution. [@Brezina2018 51]

You can think of Juilland's _D_ as the inverse of the coefficient of variation: "While CV tells us about the amount of variation in the corpus (larger CV means more variation in the frequencies), Juilland’s D tells us about homogeneity of the distribution (larger Juilland’s D means a more even distribution and less variation)" [@Brezina2018 51].

Juilland's _D_ is calculated by `CV / sqrt(# corpus parts - 1)`.

> In-class exercise: What is the Juilland's _D_ for forms of the word "be" in the quotation from Brezina (split into sentences)?

In [None]:
# Your code here

j_d_brezina_quotation = "?"

### Deviation of Proportions (DP)

**Deviation of Proportions** is similar to Juilland's _D_ insofar as it measures dispersion in a corpus; but it uses the reverse scale, where 0 indicates perfectly even dispersion and 1 indicates an extremely uneven distribution.

It is calculated by taking the `sum(| observed - expected proportions |) / 2`.

The `expected proportions` are calculated by dividing the sizes of each corpus part (in tokens) divided by the total size of the corpus.

In [None]:
brezina_sents = [l.strip().split(" ") for l in quotation.splitlines() if l != ""]
total_brezina_tokens = sum(len(s) for s in brezina_sents)
expected_proportions = [len(s) / total_brezina_tokens for s in brezina_sents] 


The `observed proportions` are then calculated by taking the absolute frequency of a token in each part divided by the absolute frequency of the token in the whole corpus.

Calculate the DP for the Brezina quotation below, using the `expected_proportions` provided above.

In [None]:
# Your code here

dp_brezina_quotation = "?"

## Average Reduced Frequency (ARF)

The key idea behind **Average Reduced Frequency** is that we can discard occurrences of a word that are close together to get a better picture of the word's significance to the corpus as a whole.

**ARF** is calculated as follows (in pseudo-code):

```
w = word
v = total corpus tokens / absolute frequency of w

ARF = 1/v * (min(distance_1, v) + min(distance_2, v) + ... min(distance_n, v))
```

If we think of the corpus as a circle rather than a line, we can imagine repeating the `min(distance, v)` procedure for every occurrence of `w`, wrapping around the text at the end.

ARF, in other words, is a "reduction" of the word's frequency that based on the dispersion of its occurrences throughout the corpus. [@Brezina2018 53--57]

> Discuss: How can ARF be used to address the whelk problem mentioned above?

## Lexical Diversity

Lexical diversity helps us measure whether a corpus uses a wide or limited range of vocabulary.

### Type/Token Ratio (TTR)

One of the simplest ways to calculate lexical diversity is the **type/token ration (TTR)**. This calculation is driven by the intuition that a corpus with a relatively high number of word forms (types) compared to total words (tokens) exhibits a wider range of expression than a corpus of the same size with a lower number of types.

```
TTR = no. types / no. tokens
```

> In-class exercise: Calculate the TTR for the Brezina quotation. You will need to use the SpaCy lemmatizer.


In [None]:
%run -m spacy download en_core_web_sm

In [None]:
import spacy

eng_nlp = spacy.load("en_core_web_sm")
doc = nlp(quotation)

brezina_ttr = "?"

> Discuss: What problems emerge from this simple TTR calculation?

### Standardized Type/Token Ratio (STTR)

As its name implies, **Standardized Type/Token Ratio (STTR)** divides the text into standardized chunks of, e.g., 1000 tokens, discarding the last chunk. It then calculates the TTR for each chunk and reports the mean of all chunks' TTRs.

### Moving Average Type/Token Ratio (MATTR)

Similarly, **Moving Average Type/Token Ratio (MATTR)** calculates the mean of multiple TTRs as a _moving average_ (i.e., an overlapping window) of chunks through the corpus.

> Discuss: How does Brezina's "transformation of Zipf's law to express rank" [-@Brezina2018 60] work? How can we use it for Pausanias?


## Homework

1. Find a word that appears no more than 20 times in all of Pausanias, and calculate it's R<sub>2</sub> for Pausanias' text when divided by book.
2. Find the standard deviation of the relative frequencies of forms of ποιέω in the books of Pausanias.
3. Calculate the deviation of proportions for a word of your choosing in the books of Pausanias.
4. Calculate the TTR for each book of Pausanias, and for Pausanias as a whole.
5. Calculate the MATTR of Pausanias using a sliding window of 5000 tokens.