<a href="https://colab.research.google.com/github/CANAL-amsterdam/Foundations-of-Cultural-and-Social-Data-Analysis/blob/main/03-vector-space-model/03_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install "numpy<2,>=1.13" "pandas~=1.1" "matplotlib<4,>=2.1" "scipy<2,>=0.18" "scikit-learn>=0.19" "mpl-axes-aligner<2,>=1.1"



In [4]:
!git clone https://github.com/CANAL-amsterdam/Foundations-of-Cultural-and-Social-Data-Analysis
%cd Foundations-of-Cultural-and-Social-Data-Analysis/03-vector-space-model
!ls

Cloning into 'Foundations-of-Cultural-and-Social-Data-Analysis'...
remote: Enumerating objects: 981, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 981 (delta 20), reused 7 (delta 1), pack-reused 921[K
Receiving objects: 100% (981/981), 194.56 MiB | 8.87 MiB/s, done.
Resolving deltas: 100% (99/99), done.
Updating files: 100% (956/956), done.
/content/Foundations-of-Cultural-and-Social-Data-Analysis/02-getting-data/Foundations-of-Cultural-and-Social-Data-Analysis/03-vector-space-model
03_chapter.ipynb  03_exercises.ipynb  data


## Exercises

In this chapter's exercises, we will employ the vector space model to explore a rich and
unique collection of '<span class="index">chain letters</span>', which were collected,
transcribed, and digitised by {cite:t}`vanarsdale:2019`. Here, we focus on one of the
largest chain letter categories: "luck chain letters". The recipients of these letters are
warned against sin, and the letters often contain prayers and emphasize good behavior according to Christian beliefs. The most characteristic and equally intriguing aspect of these chain letters is their explicit demand to be copied and redistributed to a number of successive recipients. If the recipient does not obey the letter's demands, and thus breaks the chain, he or she will be punished and bad fortune will be inevitable.

The following code block loads the corpus into memory. Two lists are created, one for the contents of the letters and one for their dating. The letters are loaded in chronological order.

In [None]:
import csv

letters, years = [], []
with open("data/chain-letters.csv") as f:
    reader = csv.DictReader(f)
    for row in reader:
        letters.append(row["letter"])
        years.append(int(row["year"]))

### Easy
1. Use the preprocessing functions from section **Text Preprocessing** to create (i) a tokenized version of the
   corpus, and (ii) a list representing the vocabulary of the corpus. How many unique
   words (i.e., word types) are there?
2. Transform the tokenized letters into a document-term matrix, and convert the matrix
   into a two-dimensional NumPy array. How many word tokens are there in the corpus?
3. What is the average number of words per letter? (Hint: use NumPy's `sum()` and `mean()` to
   help you with the necessary arithmetic.)

### Moderate
1. The length of the chain letters has changed considerably over the years. Compute the
   average length of letters from before 1950, and compare that to the average length of
   letters after 1950. (Hint: convert the list of years into a NumPy array, and use
   boolean indexing to slice the document-term matrix.)
2. Make a scatter plot to visualize the change in letter length over time. Add a label to
   the X and Y axis, and adjust the opacity of the data points for better
   visibility. Around what year do the letters suddenly become much longer?
3. Not only the length of the letters has changed, but also the contents of the letters.
   Early letters in the corpus still have strong religious undertones, while newer
   examples put greater emphasis on superstitious beliefs. (The Luck chain letter is
   generally believed to stem from the 'Himmelsbrief' (Letter from Heaven), which might
   explain these religious undertones.) {cite:t}`vanarsdale:2019` points to an interesting
   development of the postscript "It works!". The first attestation of this phrase is in
   1979, but in a few years time, all succeeding letters end with this statement. Extract
   and print the summed frequency of the words *Jesus* and *works* in letters written
   before and written after 1950.

### Challenging
1. Compute the cosine distance between the oldest and the youngest letter in the
   corpus. Subsequently, compute the distance between two of the oldest letters (any two
   letters from 1906 will do). Finally, compute the distance between the youngest two
   letters. Describe your results.
2. Use SciPy's `pdist()` function to compute the cosine distances between all letters in the
   corpus. Subsequently, transform the resulting condensed distance matrix into a regular
   square-form distance matrix. Compute the average distance between letters. Do the same
   for letters written before 1950, and compare their mean distance to letters written
   after 1950. Describe your results.
3. The function `pyplot.matshow()` in Matplotlib takes a matrix or an array as argument and
   plots it as an image. Use this function to plot a square-form distance matrix for the entire letter collection. To enhance your visualization, add a color bar using the function
   `pyplot.colorbar()`, which provides a mapping between the colors and the cosine
   distances. Describe the resulting plot. How many clusters do you observe?