# An analysis of the Anscombes Quartet Dataset

### 1. Background to the dataset

#### Anscombe's quartet was constructed by the statistician Francis Anscombe in 1973 to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. <sup>[1]</sup>

It comprises four datasets with nearly identical simple descriptive statistics, yet they appear very different when graphed. Each dataset consists of eleven (x,y) points. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."<sup>[2]</sup>

Anscombes datasets are synthetic rather than observed values, insofaras they were created by Anscombe. Quite how he created them is unknown.



### Libraries

In [1]:
# lets have matplotlib show interactive plots in the notebook
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data

In [3]:
# Load Anscombes datasets from the repository and show the values
df = sns.load_dataset("anscombe")
df

Unnamed: 0,dataset,x,y
0,I,10.0,8.04
1,I,8.0,6.95
2,I,13.0,7.58
3,I,9.0,8.81
4,I,11.0,8.33
5,I,14.0,9.96
6,I,6.0,7.24
7,I,4.0,4.26
8,I,12.0,10.84
9,I,7.0,4.82


### 2 - Plot the interesting aspects of the dataset

In [4]:
# lets start by plotting the data in simple scatterplots:

df.plot(kind='scatter',x='x1',y='y1',color='blue')
df.plot(kind='scatter',x='x2',y='y2',color='green')
df.plot(kind='scatter',x='x3',y='y3',color='red')
df.plot(kind='scatter',x='x4',y='y4',color='cyan')
plt.show()

KeyError: 'x1'

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df,
           col_wrap=2, ci=None, palette="muted", 
           scatter_kws={"s": 50, "alpha": 1})
plt.show()

So, each of the datasets, when scatter plotted, appear very different indeed! 

### 3 - Calculate the descriptive statistics of the variables in the dataset

In [None]:
df.groupby("dataset").describe()

Having seen quite 4 differing plots for the datasets, its surprising to note how many of the summary statistics are close to identical:

* The mean x value is 9 for each dataset
* The mean y value is 7.50 for each dataset
* the standard deviation for x is 3.31 in each dataset
* the standard deviation for y is 2.03 in each dataset

Now lets look at the coefficient of correlation bewteen the x and y values. Because of the way dataset df is constructed, with dataset number, x and y labels, in order to calculate the correlation between x and y for each dataset, I need to restate each of the values of x1, y1 etc as a distinct array which can be called by the np.corrcoef(x, y) command:

In [None]:
#numpy array Anscombes data
x1 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = np.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
x2 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
x3 = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8])
y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])

In [None]:
# find the correlation between x and y for each dataset:
corrx1=np.corrcoef(x1, y1)
corrx1

In [None]:
corrx2=np.corrcoef(x2, y2)
corrx2

In [None]:
corrx3=np.corrcoef(x3, y3)
corrx3

In [None]:
corrx4=np.corrcoef(x4, y4)
corrx4

It's clear that the correlation between x and y is 0.816 for each dataset. This is a high correlation between the x and y values in each case. Ordinarily, this might infer a relationship between an independent variable and dependent variable, but in this case we know that the values were fabricated by Anscombe in order to make a point.

### 4 - why the dataset is interesting, referring to the plots and statistics above

Perhaps the most interesting feature of the Anscombe dataset is that while it consists of four datasets that have identical summary statistics (e.g., mean, standard deviation, and correlation), they produce very dissimilar data graphics (scatterplots).

On first examination of the datasets, what's noticeable is that the x values are the same in each of the first three, while the y values are seldom repeated. In the fourth dataset, all the x values are 8 except for one outlier at 19 while, once again, the y values are seldom repeated. In these respects, the four datasets seem unlikely to be the product of natural observations and more likely to be fabricated in some way.

On the scatter plots, whats evident is that while any apparent pattern suggested by the plotted values is quite distinct to each dataset, the line of best fit is actually identical for all four plots. This is so improbable as to suggest that their creator may have begun with the plots and then contrived datasets to fit them.





### REFERENCES

1. https://en.wikipedia.org/wiki/Anscombe%27s_quartet
2. https://en.wikipedia.org/wiki/Anscombe%27s_quartet
3. https://rstudio-pubs-static.s3.amazonaws.com/350044_cba17bbef9fc430d927a6ad99179ae2e.html
4. https://heapanalytics.com/blog/data-stories/anscombes-quartet-and-why-summary-statistics-dont-tell-the-whole-story
