# Meeting 1

## Review Resources

- [Anaconda Crash Course](https://learning.anaconda.cloud/path/intro-to-python)
- [Learn Python the Hard Way](https://learncodethehardway.com/client/#/product/learn-python-the-hard-way-5e-2023/)

## Installing JupyterLab

You don't need to install anything locally, as we'll be working from [Colab](https://colab.research.google.com/#) notebooks -- like Google Docs, but for code.

## What is quantitative textual analysis?

> Unlike other sources of information such as mythology, philosophy or art, **science** relies on the systematic collection of empirical data and testing of theories and hypotheses. [@Brezina2018 2; emphasis original]

More tactfully (citing Popper 2005 [1935]):

> A scientific statement or theory [is] something that can in principle be falsified.... In other words, we can call a statement or theory scientific only if it can be tested empirically. [@Brezina2018: 2]

-   How do we qualify these statements?
-   Are there problems with viewing texts in this way?
-   Conversely, are there virtues in understanding textual data in this light?

## What is corpus linguistics?

> **Corpus linguistics** is a scientific method of language analysis. It requires the analyst to provide empirical evidence in the form of data drawn from language corpora in support of any statement made about language. [@Brezina2018: 2]

## Getting started

In this directory, you will find three text files containing the openings to Xenophon's *Apology*, Caesar's *De Bello Gallico*, and Jane Austen's *Pride and Prejudice*.

Let's start with *Pride and Prejudice*. Run the code block below to load the contents of the file into memory.

In [1]:
with open("austen-pride-and-prejudice.txt") as f:
    austen = f.readlines()

# notice that `austen` is still available outside of the
# `with` block.
print([l.strip()for l in austen if l.strip() != ""]) # taking out blank lines in austen (\n)

['It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife.', 'However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.']


## Exercise 

1. Describe what each `token` in the above short code snippet is doing. (You might first need to decide what a `token` is.)
2. What data type is the variable `austen`?
3. How can we remove empty lines from `austen`?
4. How would you update the code to print the excerpt from Caesar instead?

In [2]:
## write the code for reading the lines of Caesar below:
with open('phi0448.phi001.perseus-lat1/1.1.1-1.1.2.txt') as f:
   caesar = f.readlines()
print(caesar)

['Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. ']


These exercises are just the first steps towards using corpus linguistics in your interpretive practice. The main textbook that we'll be using, @Brezina2018, doesn't do a great job of providing hands-on exercises and instead wants to focus on the statics side of things. We're going to try to cover both the hands-on programming side and the statistical side in this course.

To that end, let's turn our attention to some terms and techniques that we'll need to cover.

- Corpus/sample: Collections of data. Most of the time, a "corpus" is meant to be large, like all Greek literature before 300 AD. But relatively small corpora -- like, say, all of 5th-century Athenian tragedy -- can prove useful as well.
- Dataset: Collection of findings within the data of a corpus.
- Variable: 
  - Linguistic variables: These are, generally, the things we want to measure.
  - Explanatory ("independent") variables: Descriptors for where we find linguistic variables (see [@Brezina2018 6--7]).


### Different kinds of variables

Variables come in three varieties: nominal, ordinal, and scale [@Brezina2018 7]:

- **Nominal** variables "represent different categories into which the cases in a dataset can be grouped; there is no order or hierarchy between the categories."
  - Ex. speaker's gender
- **Ordinal** variables, like nominal variables, can be used for grouping data, but they "can be ordered according to some inherent hierarchy."
  - Ex. speaker's foreign language proficiency
- **Scale** variables "[show] the quantity of a particular feature; ... [they] can be added, subtracted, multiplied, and divided, because they represent measurable quantities, not just rank orders."
  - Ex. relative frequency of first-person pronouns in a speaker's speech.


## Measures of central tendency

- Frequency distributions and averages help us determine outliers in our data.
- What are the measures of central tendency with which you're familiar?
- How are they calculated?
- What is a normal distribution?


## Dispersion measures

Define the following:

- Range_1
- Interquartile range
- Standard deviation

## Statistical tests

- How do we determine if a result is statistically signficant?

## Biases

According to @Brezina2018 [17--18], with what biases should we be concerned?

What biases might be present in our tiny test corpus (`austen`) right now?

## Homework

1. @Brezina2018 1.7 Exercises (pp. 32--36); you can skip Exercise 1.
2. Practice loading the included Greek and Latin texts into a Python notebook. What issues do you encounter?


In [3]:
arr = [1, 2, 3, 4, 5]

sum(arr) / len(arr)

3.0