# Intro to Quantitative Textual Analysis - Week 10

## Diachronic corpora: Change over time (Brezina 2018, ch. 7)

### Key terms

- longitudinal study
- diachronic corpora
- lockwords
- bootstrapping test

### Visualization techniques

- Line graph
- Candlestick plot
- Sparkline

## Warm-up: N-grams

By now, we're all familiar with some charts created through Google's [Ngram Viewer](https://books.google.com/ngrams/). But what do these charts actually show us, and what are there limitations?

Search for a few terms of your choosing in the Ngram Viewer. Try changing the time scale or zooming in and out.

With a partner or in small groups, discuss the following questions:

1. What do these n-grams show us?
2. What corpus are they using?
3. What are the limitations of this kind of charting?
4. What data are we missing for more sophisticated analyses?

## Colors over time

In the ./data directory, you'll find 'colours-data.csv', a CSV dataset provided by Brezina. Each row has the year, followed by the relative frequencies of several colors for that year. We'll use these data to practice visualizing change over time.

In [1]:
# install dependencies
%pip install altair pandas

Collecting altair
  Using cached altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting jinja2 (from altair)
  Downloading jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting jsonschema>=3.0 (from altair)
  Using cached jsonschema-4.23.0-py3-none-any.whl.metadata (7.9 kB)
Collecting narwhals>=1.14.2 (from altair)
  Using cached narwhals-1.14.2-py3-none-any.whl.metadata (7.5 kB)
Collecting typing-extensions>=4.10.0 (from altair)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.1.3-cp312-cp312-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting attrs>=22.2.0 (from jsonschema>=3.0->altair)
  Downl

In [2]:
# import altair for visualization
import altair as alt

# import pandas for data-wrangling
import pandas as pd

In [None]:
# load the data
colors_df = pd.read_csv("data/colours-data.csv")
colors_df

Unnamed: 0,Year,red,blue,green,yellow,orange,grey
0,1600,38.13,0.15,33.34,6.28,6.28,1.05
1,1601,64.43,1.01,41.61,23.16,23.16,0.00
2,1602,45.30,1.20,21.79,7.38,7.38,0.34
3,1603,33.03,0.45,25.79,7.99,7.99,1.66
4,1604,24.60,0.96,21.33,3.46,3.46,2.69
...,...,...,...,...,...,...,...
95,1695,44.78,5.80,23.54,12.79,12.79,2.73
96,1696,37.30,5.27,31.43,15.35,15.35,3.20
97,1697,48.87,10.09,27.53,20.72,20.72,1.95
98,1698,57.86,5.20,28.54,13.52,13.52,2.47


In [8]:
%pip install vega_datasets

Collecting vega_datasets
  Downloading vega_datasets-0.9.0-py3-none-any.whl.metadata (5.5 kB)
Downloading vega_datasets-0.9.0-py3-none-any.whl (210 kB)
Installing collected packages: vega_datasets
Successfully installed vega_datasets-0.9.0
Note: you may need to restart the kernel to use updated packages.


In [10]:
from vega_datasets import data

source = data.stocks()
source


Unnamed: 0,symbol,date,price
0,MSFT,2000-01-01,39.81
1,MSFT,2000-02-01,36.35
2,MSFT,2000-03-01,43.22
3,MSFT,2000-04-01,28.37
4,MSFT,2000-05-01,25.45
...,...,...,...
555,AAPL,2009-11-01,199.91
556,AAPL,2009-12-01,210.73
557,AAPL,2010-01-01,192.06
558,AAPL,2010-02-01,204.62


In [48]:
# pivot the table for easier charting
value_vars = colors_df.columns.to_list()[1:]
by_color = pd.melt(
    colors_df, id_vars=["Year"], value_vars=value_vars, value_name="relative frequency"
).rename(columns={"variable": "color"})

by_color['Year'] = by_color['Year'].astype(str)

In [54]:
# chart the data
alt.Chart(by_color).mark_line(point=alt.OverlayMarkDef(filled=False, fill="white")).encode(
    x="Year:T",
    y="relative frequency:Q",
    color=alt.Color("color").scale(None)
).properties(width=1000)

### Your turn

> Discuss: What kinds of information can you glean from this chart? How can you make the chart more useful?

1. Using the [Altair](https://altair-viz.github.io/user_guide/data.html) docs, add a tooltip showing the year, color, and relative frequency when you hover over a point.
2. The visualization helps very little for grey and blue -- can you figure out how to "zoom in" and get a meaningful sense of their change over time?
3. Can you regroup the data by decade (e.g., `[1600, 1609], [1610, 1619], etc.`) and plot the results as a [box plot](https://altair-viz.github.io/user_guide/marks/boxplot.html)?

## Bootstrap test

In this section, we'll use the same dataset to explore "bootstrapping," a "process of multiple resampling" that "gives an insight into the amount of variation in the data and gives us the confidence to generalize from this sample" [@Brezina2018 231].