# Semantics and Discourse

(Brezina 2018: ch. 3, pp. 66–75)

## Collocations

Definitions (@Brezina2018, 67): 

- **Collocation**: a group of two or more words "that habitually co-occur in texts and corpora."
- **Collocation measures**: "statistical meausres that calculate the strength of association between words based on different aspects of the co-occurrence relationship."
- **Node**: "word that we want to search for and analyse."
- **Collocates**: "words that co-occur with the node in a specifically defined **span** around the node, which we call the **collocation window**."
- **Observed frequency of collocation**: Number of times that a **collocate** appears with a **node**.

## The simple approach

> Discuss: Why might one avoid a basic ranked list of collocates?


## A more sophisticated approach

- **Expected frequency of collocation**

```python
expected_collocate_freq = (node_freq * collocate_freq * window_size) / n_tokens_in_corpus
```

> Discuss: Explain Brezina's example of "my" and "love" in Robert Burns' "A Red, Red Rose."

> Discuss: What problems does one encounter if one uses this approach blindly?


## Association Measures

Let's prepare to explore these collocation measures by loading up a dataframe of Pausanias.

In [None]:
# I've created a utils.py file for frequently reused functionality -- you can import from it like so
from utils import load_pausanias

pausanias_df = load_pausanias()

When calculating co-occurrences in Greek, it is generally insufficient to use the L and R windows that Brezina uses for English (@Brezina2018 67–70). Instead, we'll look for a dependency relationship between the **node** and its **collocates**. Below, you can see that we can access the dependencies of a token through its `children` property.

In [None]:
test_token = pausanias_df['nlp_docs'][0][1]

# we use a list comprehension to evaluate the generator at `test_token.children`
f"token: '{test_token}, {test_token.lemma_}', dependencies: {[(c, c.lemma_) for c in test_token.children]}"

Notice that we're also accessing the `lemma_` property here. Because Greek is heavily inflected, we'll tend to focus on collocations of lemmata, rather than types -- but you might find in your own work that it is interesting to look at type collocations instead. Just be sure to note which kind of "word" you're examining.

### Frequency of co-occurrence

The frequency of co-occurrence reports the presence of both a **node** (`w1`) and a **collocate** (`w2`). Given a dataframe like `pausanias_df`, we can calculate the frequency of co-occurrence using something like the following:

In [None]:
w1 = 'ἄγαλμα'
w2 = 'εἰμί'

def count_collocations(x, w1, w2):
    w2_is_child_of_w1 = len([t for t in x if t.lemma_ == w1 and w2 in [tt.lemma_ for tt in t.children]])
    w1_is_child_of_w2 = len([t for t in x if t.lemma_ == w2 and w1 in [tt.lemma_ for tt in t.children]])

    return w2_is_child_of_w1 + w1_is_child_of_w2

pausanias_df['agalma_eimi_collocations'] = pausanias_df['nlp_docs'].apply(count_collocations, args=(w1, w2))

pausanias_df[pausanias_df['agalma_eimi_collocations'] > 0]



### Mutual Uninformation (MU)

The ratio of the **observed** frequency (O11) and **expected** frequency (E11).

If the ratio is greater than 1, the words co-appear more frequently than expected.

### Mutual Information (MI)

`log<sub>2</sub>(O11/E11)`

#### MI2


#### MI3


### Log-likelihood (LL)


### Z-score<sub>1</sub>


### T-score


### Dice


### Log Dice


### Log ratio


### Minimum Sensitivity (MS)


### Delta P


### Cohen's *d*


## Directionality and Dispersion