# Semantics and Discourse

(Brezina 2018: ch. 3, pp. 66–75)

## Collocations

Definitions (@Brezina2018, 67): 

- **Collocation**: a group of two or more words "that habitually co-occur in texts and corpora."
- **Collocation measures**: "statistical meausres that calculate the strength of association between words based on different aspects of the co-occurrence relationship."
- **Node**: "word that we want to search for and analyse."
- **Collocates**: "words that co-occur with the node in a specifically defined **span** around the node, which we call the **collocation window**."
- **Observed frequency of collocation**: Number of times that a **collocate** appears with a **node**.

## The simple approach

> Discuss: Why might one avoid a basic ranked list of collocates?


## A more sophisticated approach

- **Expected frequency of collocation**

```python
expected_collocate_freq = (node_freq * collocate_freq * window_size) / n_tokens_in_corpus
```

> Discuss: Explain Brezina's example of "my" and "love" in Robert Burns' "A Red, Red Rose."

> Discuss: What problems does one encounter if one uses this approach blindly?


## Association Measures

Let's prepare to explore these collocation measures by loading up a dataframe of Pausanias.

In [None]:
# I've created a utils.py file for frequently reused functionality -- you can import from it like so
from utils import load_pausanias

pausanias_df = load_pausanias('eng') # you can use `load_pausanias('eng')` to load the English version

When calculating co-occurrences in Greek, it is generally insufficient to use the L and R windows that Brezina uses for English (@Brezina2018 67–70). Instead, we'll look for a dependency relationship between the **node** and its **collocates**. Below, you can see that we can access the dependencies of a token through its `children` property.

In [None]:
test_token = pausanias_df['nlp_docs'][0][1]

# we use a list comprehension to evaluate the generator at `test_token.children`
f"token: '{test_token}, {test_token.lemma_}', dependencies: {[(c, c.lemma_) for c in test_token.children]}"

Notice that we're also accessing the `lemma_` property here. Because Greek is heavily inflected, we'll tend to focus on collocations of lemmata, rather than types -- but you might find in your own work that it is interesting to look at type collocations instead. Just be sure to note which kind of "word" you're examining.

### Frequency of co-occurrence

The frequency of co-occurrence reports the presence of both a **node** (`w1`) and a **collocate** (`w2`). Given a DataFrame like `pausanias_df`, we can calculate the frequency of co-occurrence in two different ways. 

We can either count when the collocate is a dependency of the node, like so:

In [None]:
node = 'statue'
collocate = 'the'

def count_dependency_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1

pausanias_df['agalma_megas_dependencies'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(node, collocate))

observed_dep_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_dependencies'] > 0].shape[0]

observed_dep_freq_agalma_megas

Or we can count when the collocate and node co-occur within a given window, as follows:

In [None]:
def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
    lemmata = [t.lemma_ for t in x]

    indexes = [i for i, lemma in enumerate(lemmata) if lemma == w1]

    cooccurrences = 0

    for i in indexes:
        left = max(i - l_size, 0)
        right = min(i + r_size + 1, len(lemmata))

        window = lemmata[left:right]

        if w2 in window:
            cooccurrences += 1
            
    return cooccurrences

pausanias_df['agalma_megas_1l-1r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

observed_1l_1r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_1l-1r'] > 0].shape[0]

observed_1l_1r_freq_agalma_megas


We can see above that a 1L, 1R window detects 17 collocations of ἄγαλμα and μέγας, 4 *more* than we detected as dependencies. Remember, these collocations aren't necessarily related grammatically anymore, but it's interesting to see how the count changes. Let's try with a larger window:

In [None]:
pausanias_df['agalma_megas_2l-2r'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate, 2, 2))

observed_2l_2r_freq_agalma_megas = pausanias_df[pausanias_df['agalma_megas_2l-2r'] > 0].shape[0]

observed_2l_2r_freq_agalma_megas

You can experiment in your own notebooks by adjusting the `l_size` and `r_size` args passed to the `count_ngram_collocations` function.

### Evert's μ (Mu)

(Brezina never, as far as I can tell, defines this term, and confusingly refers to it with capital letters. I'm still trying to figure out why he does this.)

Stephanie<sup>*</sup> @Evert2005 [54] defines μ as follows:

> For μ > 1 we speak of positive association (where the components are more likely to occur together than if they were independent), and for μ < 1 we speak of negative association (where the components are less likely to occur together than if they were independent).

She adds the following in a note:

> The letter μ is intended to be reminiscent of _mutual information_, since the quantity log(μ) can be interpreted as point-wise mutual information. I have avoided using this term for μ, though, so as not to confuse information theory with population parameters.

In other words, μ says that f the ratio is greater than 1, the words co-appear more frequently than expected.

We calculate μ by taking the ratio of the **observed** frequency (represented by O11 in a contingency table) and **expected** frequency (E11).


<sup>*</sup> Note that you might find references to Stephanie Evert's work (including the 2005 doctoral thesis cited above) under her former name, Stefan Evert.

#### Observed Frequency

As discussed above, we're calculating observed frequency of collocation by looking at dependency trees. This is a somewhat more complicated procedure than simply looking to the left and right of a word, and we'll need to account for it when we calculate the **random co-occurrence baseline**, the result of which is the **expected frequency of collocation**.

#### Expected Frequency 

Expected frequency is calculated by taking the frequency of the **node** in the entire corpus times the frequency of the **collocate**, divided by the number of tokens in the corpus.

Without any corrections, this method assumes that the tokens co-occur right next to each other (either before or after, but not both). To correct for the greater probability of the tokens co-occurring when we are not looking at immediate adjacency, we mutliply the numerator in this equation by the **window size**.

_However_, the use of syntactic dependency trees obviates this correction: window size would mean something like the number of adjacent treebanks to check for a collocation, which would be linguistically meaningless.

Window size is thus a correction for doing n-gram–based analyses of collocations.

In [None]:

from collections import Counter
import pandas as pd

def expected_frequency_of_collocation(df: pd.DataFrame, node: str, collocate: str, window_size: int = 1):
    """
    `node` and `collocate` should be the string representations
    of the associated lemmata
    """

    lemmata = [t.lemma_ for t in df['nlp_docs'].explode()]
    counter = Counter(lemmata)
    node_count = counter[node]
    collocate_count = counter[collocate]

    return (node_count * collocate_count * window_size) / len(lemmata)


expected_freq_agalma_megas = expected_frequency_of_collocation(pausanias_df, node, collocate)

mu_deps = observed_dep_freq_agalma_megas / expected_freq_agalma_megas
mu_1l_1r = observed_1l_1r_freq_agalma_megas / expected_freq_agalma_megas

print(f"μ for statue with dependency the: {mu_deps}\n\nμ for statue and the in a 1L, 1R window: {mu_1l_1r}")

### Mutual Information (MI)

Mutual information measures how much the appearance of one word in our collocate pair suggests the appearance of the other word. We can calculate it by taking the log<sub>2</sub> of `mu` (observed / expected frequency).

In [None]:
import math

mutual_information_agalma_megas_deps = math.log(mu_deps, 2)

mutual_information_agalma_megas_deps

In [None]:
mutual_information_agalma_megas_1l_1r = math.log(mu_1l_1r, 2)
mutual_information_agalma_megas_1l_1r

A value greater than 1 indicates that the presence of the node -- in this demo, statue -- implies the presence of the collocate -- here, the.

Does the same hold true in the other direction, that is, when statue is a dependent of the?

In [None]:
pausanias_df['megas_agalma_collocations'] = pausanias_df['nlp_docs'].apply(count_dependency_collocations, args=(collocate, node))

observed_freq_megas_agalma = pausanias_df[pausanias_df['megas_agalma_collocations'] > 0].shape[0]

## Note that the expected frequency does not change depending on which direction the dependency goes
mu = observed_freq_megas_agalma / expected_freq_agalma_megas

mutual_information_megas_agalma = math.log(mu, 2)

mutual_information_megas_agalma

Here we find that the appearance of μέγας slightly implies the non-appearance of ἄγαλμα among its dependencies. This makes sense: μέγας is an adjective, and we wouldn't expect it to govern a noun.

The other association measures covered by Brezina tend to be variations on this theme, using mostly the same inputs with adjustments to the weights to make the calculation more or less sensitive to the collocate pair's **exclusivity**.

> Discuss: Define "exclusivity" in the context of collocations.


## Directionality and Dispersion

If we want to measure the **directionality** of the association, however, we need to use a calculation like **Delta P**, which reports two statistics: one for the predictability of the node with respect to the collocate, and one for the predictability of the collocate to the node.

### Delta P

Translated into English, Delta P looks for:

- The observed frequency of the collocate pair in the corpus (O11), divided by the frequency of the node in the corpus (R1)
  - minus the observed frequency of _the collocate **without** the node_ in the corpus (O21), divided by the tokens that are not the node in the corpus (R2)

AND

- The observed frequency of the collocate pair (O11), divided by the frequency of the *collocate* in the corpus (C1)
  - minus the observed frequency of _the node **without** the collocate_ (O12), divided by the tokens that are not the collocate (C2)

Notice that it does not take expected frequencies into account.



In [None]:
node = "temple"
collocate = "a"

def count_ngram_collocations(x, w1, w2, l_size: int = 1, r_size: int = 1):
    lemmata = [t.lemma_ for t in x]

    indexes = [i for i, lemma in enumerate(lemmata) if lemma == w1]

    cooccurrences = 0

    for i in indexes:
        left = max(i - l_size, 0)
        right = min(i + r_size + 1, len(lemmata))
        window = lemmata[left:right]

        if w2 in window:
            cooccurrences += 1
            
    return cooccurrences

pausanias_df['o11_temple_a_ngrams'] = pausanias_df['nlp_docs'].apply(count_ngram_collocations, args=(node, collocate))

o11 = pausanias_df['o11_temple_a_ngrams'].sum()

all_tokens = pausanias_df['nlp_docs'].explode()

r1 = len([t for t in all_tokens if t.lemma_ == node])
r2 = len(all_tokens) - r1
c1 = len([t for t in all_tokens if t.lemma_ == collocate])
o21 = c1 - o11

(o11 / r1) - (o21 / r2)




### Dispersion

Dispersion can be measured in the pretty much the same way as in Section 2.4 of @Brezina2018: 

1. Divide the corpus into chunks (size doesn't matter as long as you normalize by corpus length). 
2. Calculate the expected proportions: (# tokens in chunk / window size) / (# tokens in corpus / window size)
3. Calculate observed proportions of collocation: (# of cols. in chunk) / (# of cols. in corpus)

From here, you have all the information you need to calculate the Deviation of Proportions (DP).

You can also calculate Cohen's *d*:

- Let X be the mean of the frequencies of the collocation in each chunk
- Let Y be the mean of the frequencies in each chunk where either the node or collocate is absent
  - i.e., Y = the mean of the frequencies of the node or collocate minus the number of collocations in a given chunk
- Let S<sub>X</sub> be the standard deviation of the frequencies used in X
- Let S<sub>Y</sub> be the standard deviation of the frequencies used in Y


d = (X - Y) / sqrt((S<sub>X</sub><sup>2</sup> + S<sub>Y</sub><sup>2</sup>) / 2)


## Exercises

You may work in groups for these exercises. They are due at the beginning of next class. You can submit them as a link to a Colab notebook or GitHub CodeSpace.

0. Use the corpora that you assembled last week (Pausanias++):
1. Using programming techniques from the course so far, find other potential collocates for a word of your choice.
2. Calculate the μ and Mutual Information scores for at least 5 of these collocate pairs. How do your results change depending on your definition of a collocation? What might these changes mean? (Write your answers to these questions down.)
3. Calculate the Delta P for these same five pairs. Do any results stand out? Why? What might they tell us about your corpus.