
<img src="https://image.ibb.co/gw4Gen/Index-GMIT.png" alt="Index-GMIT" border="0">

## Practical Assignment: ***Anscombe's quartet dataset***

* Assignment for Fundamentals of Data Analytics
* Start date: 12-10-2018 End date 11-11-2018
________________________________________________________________

### Assignment outine and objectives
1. Explain the background to the dataset – who created it, when it was created, and
any speculation you can find regarding how it might have been created.
2. Plot the interesting aspects of the dataset.
3. Calculate the descriptive statistics of the variables the dataset.
4. Explain why the dataset is interesting, referring to the plots and statistics above.
------------------------------------

### Background and overview of the dataset

#### "Graphs are essential to good statistical analysis"

Francis Anscombe was an English statistician, who in 1973 published a highly relevant and timeless paper called *Graphs in Statistical Analysis*, which was intended to counter the impression among statisticians that "numerical calcutions are exact but graphs are rough" [3]. 

Anscombe's quartet is a set of four small datasets where each produces almost identical summary statistics (mean, standard deviation, variance and correlations), which could lead most laypersons to infer that the datasets are very similar [1].

However, while the statistical properties prove to be near identical, visualizing (plotting) the data reveals instead that the datasets are in fact notably different.

Anscombe published this paper in order to demonstrate the importance of plotting data *before* analyzing it, and to show the effect outliers can have on statistical properties [2]. In his paper he makes the astute observation that very little attention is given to plots and that most people avoid making assumptions such as:

* Numerical calculations are very precise while graphs are "rough".
* For any kind of statistical data there exists only one set of calulations which results in an accurate analysis.
* And that performing calculations is something of a virtue while looking at the plotted data is "cheating" [2].

Anscombe asserts that computers should produce *both* calculations and plots and that both must be studied because each will contribute to understanding. Indeed, by looking at the qualities and features of Anscombe's dataset and by understanding his analyses, we can better appreciate his observational stance on the matter.

---------------------------------------

### Opening the dataset

We can open and format the Anscombe dataset in various ways, which display slightly different table formats. Below I demonstrate three methods:

In [1]:
# Opening the dataset via pandas

import pandas as pd
dataframe = pd.read_csv("anscombe.csv")
print(dataframe)

    Unnamed: 0  x1  x2  x3  x4     y1    y2     y3     y4
0            1  10  10  10   8   8.04  9.14   7.46   6.58
1            2   8   8   8   8   6.95  8.14   6.77   5.76
2            3  13  13  13   8   7.58  8.74  12.74   7.71
3            4   9   9   9   8   8.81  8.77   7.11   8.84
4            5  11  11  11   8   8.33  9.26   7.81   8.47
5            6  14  14  14   8   9.96  8.10   8.84   7.04
6            7   6   6   6   8   7.24  6.13   6.08   5.25
7            8   4   4   4  19   4.26  3.10   5.39  12.50
8            9  12  12  12   8  10.84  9.13   8.15   5.56
9           10   7   7   7   8   4.82  7.26   6.42   7.91
10          11   5   5   5   8   5.68  4.74   5.73   6.89


In [2]:
# Opening the dataset via the CSV reader

import csv
f = open("anscombe.csv")
csv_f = csv.reader(f)
for row in csv_f: # for loop that iterates over dataset rows.
    print('{:>6} {:>6} {:>6} {:>6} {:>6} {:>6} {:>6} {:>6} {:>6}'.format(*row)) 
    # using the format method to allign rows.
    # The > symbol right-aligns the rows, (< for left allign). 
    # The number in the curly braces specifies the column width. 
    # The curly braces can include a positional argument before the colon (:)

           x1     x2     x3     x4     y1     y2     y3     y4
     1     10     10     10      8   8.04   9.14   7.46   6.58
     2      8      8      8      8   6.95   8.14   6.77   5.76
     3     13     13     13      8   7.58   8.74  12.74   7.71
     4      9      9      9      8   8.81   8.77   7.11   8.84
     5     11     11     11      8   8.33   9.26   7.81   8.47
     6     14     14     14      8   9.96    8.1   8.84   7.04
     7      6      6      6      8   7.24   6.13   6.08   5.25
     8      4      4      4     19   4.26    3.1   5.39   12.5
     9     12     12     12      8  10.84   9.13   8.15   5.56
    10      7      7      7      8   4.82   7.26   6.42   7.91
    11      5      5      5      8   5.68   4.74   5.73   6.89


In [3]:
# Opening the dataset via Seaborn

import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
df = sns.load_dataset("anscombe")

print(df)

   dataset     x      y
0        I  10.0   8.04
1        I   8.0   6.95
2        I  13.0   7.58
3        I   9.0   8.81
4        I  11.0   8.33
5        I  14.0   9.96
6        I   6.0   7.24
7        I   4.0   4.26
8        I  12.0  10.84
9        I   7.0   4.82
10       I   5.0   5.68
11      II  10.0   9.14
12      II   8.0   8.14
13      II  13.0   8.74
14      II   9.0   8.77
15      II  11.0   9.26
16      II  14.0   8.10
17      II   6.0   6.13
18      II   4.0   3.10
19      II  12.0   9.13
20      II   7.0   7.26
21      II   5.0   4.74
22     III  10.0   7.46
23     III   8.0   6.77
24     III  13.0  12.74
25     III   9.0   7.11
26     III  11.0   7.81
27     III  14.0   8.84
28     III   6.0   6.08
29     III   4.0   5.39
30     III  12.0   8.15
31     III   7.0   6.42
32     III   5.0   5.73
33      IV   8.0   6.58
34      IV   8.0   5.76
35      IV   8.0   7.71
36      IV   8.0   8.84
37      IV   8.0   8.47
38      IV   8.0   7.04
39      IV   8.0   5.25
40      IV  19.0

With our dataset printed out we can now study it for any peculiarities or interesting features.

Rather quickly I noticed a very salient feature of the X variables, which is almost impossible to overlook. Indeed, every row in the X1, X2 and X3 columns are identical, whereas the floats in X4 are very different. 

-----------------------
### Calculating the descriptive statistics of the variables the dataset
-----------------------

In [4]:
# First we can get a glimpse of how many examples (rows) and how many attributes (columns) the Anscombe dataset 
# contains with the shape property:

print(dataframe.shape)

(11, 9)


Thus, we can see from the output above that the Anscombe dataset is comprised of 11 rows and 9 columns

In [7]:
# Next we can take a look at a summary of each Anscombe attribute. This includes the count, mean, min and max values as 
# well as some percentiles.

import numpy as np
print(np.round(dataframe.describe(), decimals=2)) # rounds descriptive output to 2 decimals

       Unnamed: 0     x1     x2     x3     x4     y1     y2     y3     y4
count       11.00  11.00  11.00  11.00  11.00  11.00  11.00  11.00  11.00
mean         6.00   9.00   9.00   9.00   9.00   7.50   7.50   7.50   7.50
std          3.32   3.32   3.32   3.32   3.32   2.03   2.03   2.03   2.03
min          1.00   4.00   4.00   4.00   8.00   4.26   3.10   5.39   5.25
25%          3.50   6.50   6.50   6.50   8.00   6.32   6.70   6.25   6.17
50%          6.00   9.00   9.00   9.00   8.00   7.58   8.14   7.11   7.04
75%          8.50  11.50  11.50  11.50   8.00   8.57   8.95   7.98   8.19
max         11.00  14.00  14.00  14.00  19.00  10.84   9.26  12.74  12.50


From the descriptive summary statistics table above, we can make a number of key observations.

The mean of all X values are identical at 9. The mean of all Y values are also identical at 7.5.
The standard deviation of all X values is 3.32, and the standard deviation for all Y values is similar at 2.03.

These results reveal almost identical statistical properties, which typically suggests an absence of significance in our data.

Similarly, by running a correlation table below we can see that the correlation coefficients (r values) between all X and Y values are identical at 0.816. 

## References


[1]. Matejka, J., Fitzmaurice, G. (2017).*Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing*. Autodesk Research, Toronto Canada. 

[2]. Anscombe, F.J. (1973). *Graphs in Statistical Analysis*. The American Statistician 27, 1, 17–21.

[3]. (Wikipedia, multiple authors), (2018, October). *Anscombe's quartet*. Retrieved from: https://en.wikipedia.org/wiki/Anscombe%27s_quartet