# Intro to Quantitative Textual Analysis - Week 8: Correlations and Classifications

## Brezina 2018 ch. 5: Correlations

### Pearson's correlation (pp. 142–146)

Pearson's correlation (r) is expressed as follows:

```math
r = \frac{\text{covariance}}{SD_1 \times SD_2}
```

Covariance, in turn, is expressed:

```math
\text{covariance} = \frac{\text{sum of multiplied distances from } mean_1 \text{ and } mean_2}{\text{total no. of cases} - 1}
```

For example, suppose that we have five documents with $N_a$ adjectives and $N_n$ nouns:

In [6]:
docs = [(5, 10), (12, 15), (14, 25), (15, 26), (20, 30)]

def covariance(corpus: list[(int, int)]):
    mean_1 = sum([a for a, _ in corpus]) / len(corpus)
    mean_2 = sum([b for _, b in corpus]) / len(corpus)

    return sum([(mean_1 - a) * (mean_2 - b) for a, b in corpus]) / (len(corpus) - 1)

docs_covariance = covariance(docs)

We can calculate the standard deviation for each variable (number of adjectives and number of nouns) using the **sample standard deviation** from Brezina 2018 p. 50:

```math
\text{standard deviation}_\text{sample} = \sqrt{\frac{\text{sum of squared distances from the mean}}{\text{total no. of cases - 1}}}
```

In [5]:
import math

def sd_sample(arr: list[int]):
    mean = sum(arr) / len(arr)

    return math.sqrt(sum([(mean - x)**2 for x in arr]) / (len(arr) - 1))

sd_1 = sd_sample([a for a, _ in docs])
sd_2 = sd_sample([b for _, b in docs])

docs_covariance / (sd_1 * sd_2)

0.9384978052288936

In this case, Pearson's correlation indicates a _very_ strong positive correlation in between the number of adjectives and the number of nouns in these (made-up) documents.

Pearson's correlation will always range between -1 and +1: negative numbers indicate a negative correlation, and positive numbers indicate a positive correlation.

### What to report with correlation measures

As you can probably guess, it's important to report a p-value or confidence interval wiith your correlation statistics in order to give your readers a sense of statistical significance (which, as it turns out, is directly correlated to the number of observations).

Note that the functions that we wrote don't care about the length of the input arrays, as long as their respective type signatures are obeyed (a list of 2-tuples of `int`s for `covariance` and a list of `int`s for `sd_sample`).

## Factor Analysis

Brezina 2018 (164) defines **factor analysis** as "a complex mathematical procedure that reduces a large numbber of linguistic variables. This is done by considering correlations betwen variables...; those that correlate -- both positively and negatively -- are considered components of the same factor because they have a connection.... A **factor** is thus a group of related linguistic variables summarizing a more general tendency ... in the data."

See Brezina 2018 (165, f. 5.18) for an illustration of the usefulness of factor analysis.