# Week 3

## Reporting: Different Kinds of Frequencies

When reporting your findings from last week, you've mainly been using "absolute frequency" (AF). There are, however, many ways report word frequencies in a corpus. As we go over these frequencies, consider the trade-offs and advantages of each.

### Absolute ("raw") frequency

Brezina defines AF as "a count of all tokens in the text or corpus that belong to a particular word type" [@Brezina2018 42]. He uses the example of the 6,041,234 occurrences of the token "the" in the British National Corpus (BNC). Since Greek inflects the definite article, we can't simply count the occurrences of a single token.

> Discuss: When should you use absolute frequency in reporting?

As usual, let's load up our text.

In [1]:
from MyCapytain.resources.texts.local.capitains.cts import CapitainsCtsText

with open("../tei/tlg0525.tlg001.perseus-grc2.xml") as f:
    text = CapitainsCtsText(urn="urn:cts:greekLit:tlg0525.tlg001.perseus-grc2", resource=f)

And let's count the occurrences of the definite article across the whole work.

In [2]:
from lxml import etree
from MyCapytain.common.constants import Mimetypes

urns = []
raw_xmls = []
unannotated_strings = []

for ref in text.getReffs(level=len(text.citation)):
    urn = f"{text.urn}:{ref}"
    node = text.getTextualNode(ref)
    raw_xml = node.export(Mimetypes.XML.TEI)
    tree = node.export(Mimetypes.PYTHON.ETREE)
    s = etree.tostring(tree, encoding="unicode", method="text")

    urns.append(urn)
    raw_xmls.append(raw_xml)
    unannotated_strings.append(s)

import pandas as pd

d = {
    "urn": pd.Series(urns, dtype="string"),
    "raw_xml": raw_xmls,
    "unannotated_strings": pd.Series(unannotated_strings, dtype="string")
}
pausanias_df = pd.DataFrame(d)

In [3]:
# this will take a while

import spacy

nlp = spacy.load("grc_proiel_trf", disable=["ner"])

raw_texts = [t for t in pausanias_df['unannotated_strings']]
annotated_texts = nlp.pipe(raw_texts, batch_size=100)

pausanias_df['nlp_docs'] = list(annotated_texts)

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  self._model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):


In [4]:
definite_article = [t for t in pausanias_df['nlp_docs'].explode() if t.lemma_ == "ὁ"]

len(definite_article)

30698

Thus we have 30,698 occurrences of the definite article in Pausanias.

### Relative ("normalized") frequency

### Hapax legomena ("once-saids")

### Zipf's Law

## Dispersion

### Range_2 (R)

### Standard Deviation

#### Sample standard deviation

### Coefficient of Variation (CV)

### Julliand's _D_

### Deviation of Proportions (DP)

## Average Reduced Frequency (ARF)

## Lexical Diversity

### Type/Token Ratio (TTR)

### Standardized Type/Token Ratio (STTR)

### Moving Average Type/Token Ratio (MATTR)
