<a href="https://colab.research.google.com/github/CANAL-amsterdam/Foundations-of-Cultural-and-Social-Data-Analysis/blob/main/05-statistics-essentials/05_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install "numpy<2,>=1.13" "pandas~=1.1" "matplotlib<4,>=2.1" "scipy<2,>=0.18" "scikit-learn>=0.19" "mpl-axes-aligner<2,>=1.1"

In [None]:
!git clone https://github.com/CANAL-amsterdam/Foundations-of-Cultural-and-Social-Data-Analysis
%cd Foundations-of-Cultural-and-Social-Data-Analysis/05-statistics-essentials
!ls


## Exercises

The <span class="index">Tate galleries</span> consist of four art museums in the United Kingdom. The museums --
Tate Britain, Tate Modern in London, Tate Liverpool, and Tate St. Ives in Cornwall --
house the United Kingdom's national collection of British art, as well as an international
collection of modern and contemporary art. Tate has made available metadata for
approximately 70,000 of its artworks.  In the following set of exercises, we will explore
and describe this dataset using some of this chapter's summary statistics.

A CSV file of these metadata is stored in the `data` folder, `tate.csv`, in compressed
form `tate.csv.gz`. We decompress and load it with the following lines of code:

In [None]:
import pandas as pd

tate = pd.read_csv("data/tate.csv.gz")
# remove objects for which no suitable year information is given:
tate = tate[tate['year'].notnull()]
tate = tate[tate['year'].str.isdigit()]
tate['year'] = tate['year'].astype('int')

### Easy
1. The dataset provides information about the dimensions of most artworks in the
   collection (expressed in millimeters). Compute the mean and median width (column `width`), height (column
   `height`), and total size (i.e., the length times the height) of the artworks. Is the
   median a better guess than the mean for this sample of artworks?
2. Draw histograms for the width, height, and size of the artworks. Why would it make sense
   to take the logarithm of the data before plotting?
3. Compute the *range* of the width and height in the collection. Do you think the range
   is an appropriate measure of dispersion for these data? Explain why you think it is or
   isn't.

### Moderate
1. With the advent of postmodernism, the sizes of the artworks became more varied and
   extreme. Make a scatter plot of the artworks' size (Y axis) over time (X axis). Add a
   line to the scatter plot representing the mean size per year. What do you observe?
   (Hint: use the column `year`, convert the data to a logarithmic scale for better
   visibility, and reduce the opacity (e.g., `alpha=0.1`) of the dots in the scatter
   plot.)
2. To obtain a better understanding of the changes in size over time, create two box plots
   which summarize the distributions of the artwork sizes from before and
   after 1950. Explain the different components of the box plots. How do the two box plots
   relate to the scatter plot in the previous exercise?
3. In this exercise, we will create an alternative visualization of the changes in shapes
   of the artworks. The following code block implements the function `create_rectangle()`,
   with which we can draw rectangles given a specified width and height [^credits].

   ```python
   import matplotlib
   
   def create_rectangle(width, height):
       return matplotlib.patches.Rectangle(
           (-(width / 2), -(height / 2)), width, height,
           fill=False, alpha=0.1)
    
   fig, ax = plt.subplots(figsize=(6, 6))
   row = tate.sample(n=1).iloc[0]  # sample an artwork for plotting
   ax.add_patch(create_rectangle(row['width'], row['height']))
   ax.set(xlim=(-4000, 4000), ylim=(-4000, 4000))
   ```

   Sample 2,000 artworks from before 1950, and 2,000 artworks created after 1950. Use the
   code from above to plot the shapes of the artworks in each period in two separate
   subplots. Explain the results.

### Challenging

1. The `artist` column provides the name of the artist of each artwork in the
   collection. Certain artists occur more frequently than others, and in this exercise, we
   will investigate the diversity of the Tate collection in terms of its artists. First,
   compute the entropy of the artist frequencies in the entire collection. Then, compute
   and compare the entropy for artworks from before and after 1950. Describe and interpret
   your results.
2. For most of the artworks in the collection, the metadata provides information about
   what subjects are depicted. This information is stored in the column `subject`. Works
   of art can be assigned to one or more categories, such as "nature", "literature and
   fiction", and "work and occupations". In this exercise we investigate the associations
   and dependence between some of the categories. First calculate the mutual information
   between the categories "emotions" and "concepts and ideas". What does the relatively
   high mutual information score mean for these concepts? Next, compute the mutual
   information between "nature" and "abstraction". How should we interpret the information
   score between these categories? (Hint: to compute the mutual information between
   categories, it might be useful to first convert the data into a document-term matrix.)
3. In the blog post, [The Dimensions of
   Art](https://web.archive.org/web/20190708205952/https://ifweassume.blogspot.com/2013/11/the-dimensions-of-art.html),
   that gave us the inspiration for these exercises, James Davenport makes three
   interesting claims about the dimensions of the artworks in the Tate Collections. We
   quote the author in full:
   > 1. *On the whole, people prefer to make 4x3 artwork*: This may largely be driven by
   >    stock canvas sizes available from art suppliers.
   > 2. *There are more tall pieces than wide pieces*: I find this fascinating, and
   >    speculate it may be due to portraits and paintings.
   > 3. *People are using the Golden Ratio*: Despite any obvious basis for its use, there
   >    are clumps for both wide and tall pieces at the so-called "Golden Ratio",
   >    approximately 1:1.681 [...].
   
   Can you add quantitative support for these claims? Do you agree with James Davenport on
   all statements?

[^credits]: The idea for this exercise was taken from a blog post, [
The Dimensions of Art](https://web.archive.org/web/20190708205952/https://ifweassume.blogspot.com/2013/11/the-dimensions-of-art.html), by James Davenport.