## An Analysis of Anscombe's quartet dataset 

## 1. An explanation to the background of the dataset 

Francis John "Frank" Anscombe (13 May 1918 – 17 October 2001) was an English statistician.

Born in Hove in England, Anscombe was educated at Trinity College at Cambridge University. After serving in the Second World War, he joined Rothamsted Experimental Station for two years before returning to Cambridge as a lecturer. 

He later became interested in statistical computing, and stressed that "a computer should make both calculations and graphs", and illustrated the importance of graphing data with four data sets now known as Anscombe's quartet

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."
*https://en.wikipedia.org/wiki/Anscombe%27s_quartet

Perhaps the most elegant demonstration of the dangers of summary statistics is Anscombe’s Quartet. It’s a group of four datasets that appear to be similar when using typical summary statistics, yet tell four different stories when graphed. 

### Speculation on how Anscombe created the dataset

While very popular and eﬀective for illustrating the importance of visualizations, it is not known how Anscombe came up with his datasetUnfortunately, Anscombe does not report how the datasets were created, nor suggest any method to create new ones. 
###  Chatterjee and Firat 2007

They proposed a genetic algorithm based approach where 1,000 random datasets were created with identical summary statistics, then combined and mutated with an objective function to maximize the “graphical dissimilarity” between the initial and ﬁnal scatter plots. While the datasets produced were graphically dissimilar to the input datasets, they did not have any discernable structure in their composition.

#### Chatterjee, S. and Firat, A. (2007). Generating Data with Identical Statistics but Dissimilar Graphics. The American Statistician 61, 3, 248–254. 6
### Govindaraju and Haslet

Govindaraju and Haslett developed a method for regressing datasets towards their sample means while maintaining the same linear regression formula [7]. In 2009, the same authors extended their procedure to creating “cloned” datasets [8]. In addition to maintaining the same linear regression as the seed dataset, their cloned datasets also maintained the same means (but not the same standard deviations)
Govindaraju, K. and Haslett, S.J. (2008). Illustration of regression towards the means. International Journal of Mathematical Education in Science and Technology 39, 4, 544–550. 

While Chatterjee and Firat wanted to create datasets as graphically dissimilar as possible, Govindaraju and Haslett’s cloned datasets were designed to be visually similar, with a proposed application of conﬁdentializing sensitive data for publication purposes

Datasets which are identical over a number of statistical properties, yet produce dissimilar graphs, are frequently used to illustrate the importance of graphical representations when exploring data

The eﬀectiveness of Anscombe’s Quartet is not due to simply having four diﬀerent data sets which generate same statistical properties, it is that four clearly different and identifiably distinct datasets are producing the same statistical properties. 

Sometimes when analysing data you may find that looking at the calculations may not produce accurate results unless you understand the underlying data. It is important to visualise data to see what you’re dealing with. This can be achieved using graphs, in this case scatter plots, specifically.

Scatter plots are useful to perceive the broad features of data and look behind those features to see what is there. A good statistical analysis includes looking at the data from different points of view. To see why you need to visualise data we are going to look at Anscombe’s Quartet.

## 2. Plot the interesting aspects of the dataset

In [5]:
import pandas as pd # Data manipulation
from scipy import stats # linear regression
import numpy as np #Quick summary statistics

First of all, let us get the data set, that’s immediate to do as it comes ready loadable within the seaborn library:

In [8]:
import seaborn as sns
quartet = sns.load_dataset("anscombe")
quartet

Unnamed: 0,dataset,x,y
0,I,10.0,8.04
1,I,8.0,6.95
2,I,13.0,7.58
3,I,9.0,8.81
4,I,11.0,8.33
5,I,14.0,9.96
6,I,6.0,7.24
7,I,4.0,4.26
8,I,12.0,10.84
9,I,7.0,4.82


First, we can get a glimpse of how many examples (rows) and how many attributes (columns) the Anscombe dataset contains with the shape method

In [15]:
quartet.shape

(44, 3)

We can see from the output above that the Anscombe dataset is comprised of 11 rows and 9 columns. Next we can take a look at a summary of each Anscombe attribute. This includes the count, mean, min and max values as well as some percentiles:

In [24]:
np.round(quartet.describe()) # Rounds descriptive output data to 2 decimals

Unnamed: 0,x,y
count,44.0,44.0
mean,9.0,8.0
std,3.0,2.0
min,4.0,3.0
25%,7.0,6.0
50%,8.0,8.0
75%,11.0,9.0
max,19.0,13.0


In [20]:
np.round(quartet.corr(), decimals=3)

Unnamed: 0,x,y
x,1.0,0.816
y,0.816,1.0


In [2]:
# Import pandas.
import pandas as pd
df = pd.read_csv("Anscombe2.csv")
df

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,,Anscombe's Data,,,,,,,,,,,
1,,Observation,x1,y1,,x2,y2,,x3,y3,,x4,y4
2,,1,10,8.04,,10,9.14,,10,7.46,,8,6.58
3,,2,8,6.95,,8,8.14,,8,6.77,,8,5.76
4,,3,13,7.58,,13,8.74,,13,12.74,,8,7.71
5,,4,9,8.81,,9,8.77,,9,7.11,,8,8.84
6,,5,11,8.33,,11,9.26,,11,7.81,,8,8.47
7,,6,14,9.96,,14,8.1,,14,8.84,,8,7.04
8,,7,6,7.24,,6,6.13,,6,6.08,,8,5.25
9,,8,4,4.26,,4,3.1,,4,5.39,,19,12.5
