# Chapter 2 - Reflection exercises

## Retrieval practice

### Exercise 1 - retrieval practice: analysis of 1 variable

Use the procedure for retrieval practice from the similar exercises in Module 1 to study the techniques for analysis and visualization of a single variable.

For each measurement level, provide:

- The appropriate measures for central tendency and dispersion (name, definitions and formulas)
- The appropriate graph types


## Measures for central tendency and dispersion

### Exercise 2 - sample mean and sample variance of a frequency table

Consider the formulas for the sample mean $\overline{x}$, the sample variance $s^2$ and the standard deviation $s$. How should these formulas be adapted to calculate these values when we are dealing with a frequency table? A frequency table gives an overview of how often each different value (of a qualitative variable) occurs in the sample.

Apply your formula to the data in the table below:

| Pins $s$ | Frequency $f_x$ |
| :---:    | :---:           |
| 0 |  2 |
| 1 |  1 |
| 2 |  2 |
| 3 |  0 |
| 4 |  2 |
| 5 |  4 |
| 6 |  9 |
| 7 | 11 |
| 8 | 13 |
| 9 |  8 |
|10 |  8 |

*While playing a skittles game, the number of pins that were knocked over with each throw is recorded. For each possible score x, the number of times this score was obtained during a throw was recorded.*

**Results (for your convenience):** $n = 60$, mean = 7, variance ≈ 5.83, standard deviation ≈ 2.41

#### Dit zijn de formules voor met een frequentietabel het te berekenen
- Steekproefgrootte: $$n = \sum_{x} f_x$$
- gemiddelde: $$\overline{x} = \frac{\sum_{x} x\cdot f_x}{\sum_{x} f_x}$$
- variantie: $$s^2 = \frac{\sum_{x} (x- \overline{x})^2 \cdot f_x}{\sum_{x} f_x-1}$$

In [1]:
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns    

In [18]:
x = np.array([0,1,2,3,4,5,6,7,8,9,10])
fx = np.array([2,1,2,0,2,4,9,11,13,8,8]) #numpy array want dan je array
#Gemiddelde je kan ook gewoon sum() gebruiken zit standaard in python
n = np.sum(fx)
gemiddelde = np.sum(x*fx)/n
print(f"Gemiddelde: {gemiddelde}")
#Variantie
variantie = np.sum(fx*(x-gemiddelde)**2)/(n-1)
print(f"Variantie: {variantie}")
print(f"Standaardafwijking {np.sqrt(variantie)}")

Gemiddelde: 7.0
Variantie: 5.830508474576271
Standaardafwijking 2.4146445855604237


### Exercise 3 - formula for sample variance

In the formula for the sample variance, the difference between the measurement values and the mean is squared. Why? Couldn't we devise a simpler formula that is an equally good measure of the dispersion of a dataset? Here are three proposals (the third one is the "real" formula):

1. $s_{1}^{2} = \frac{1}{n-1} \sum_{i=1}^{n} (\overline{x} - x_i)$
2. $s_{2}^{2} = \frac{1}{n-1} \sum_{i=1}^{n} \left| \overline{x} - x_i\right|$
3. $s_{3}^{2} = \frac{1}{n-1} \sum_{i=1}^{n} (\overline{x} - x_i)^{2}$

Apply each formula to the two data sets below. By comparing the results, you should be able to decide whether the formulas are suitable as a dispersion measure.

- $X = \left\{ 4,4,-4,-4 \right\}$
- $Y = \left\{ 7,1,-6,-2 \right\}$

In [None]:
x = np.array([4,4,-4,-4])
y = np.array([7,1,-6,-2])

### Exercise 4 - coefficient of variation

On your own, look up what the coefficient of variation for a sample is. How is it defined for a full population and what could you do with it?


##  Visualisation Techniques

### Exercise 5 - data visualisation hall of shame

Look for examples of bad graphs in news reports, articles, interest group publications, etc.

Why is the chosen graph "bad"? What mistakes are being made? What changes should be made to correct the graph? Who will find the most ridiculous example of a bad graph within the class group?