# The Anscombe's Quartet Database

The purpose of this Jupyter notebook is to anaylse Anscombe's quartet database. There are four tasks to be carried out in this notebook:

1. Explain the background to the dataset – who created it, when it was created, and any speculation you can ﬁnd regarding how it might have been created.
2. Plot the interesting aspects of the database
3. Calculate the descriptive statistics of the variables in the dataset. 
4. Explain why the dataset is interesting, referring to the plots and statistics above.

## Background

The proverb “A picture is worth 1000 words” is one you have probably heard more than once. A picture can also be worth 1000 data points. [[Natalia]( http://natalia.dbsdataprojects.com/2016/02/29/anscombes-quartet/)]

The Anscombe's Quartet Database was created by Francis John Anscombe. These were the subjects of a 1973 paper in “The American Statistician”. He wrote a paper about the importance of actually graphing your data, and not just depending on statistical analysis. 

He created four sets of XY data pairs, each with identical average X, average Y, variance in X and Y, mean X and Y, linear regression slope and intercept, and even correlation coefficients and RMSE values. In other words, these data sets seemed to be about the same—until they are graphed. [[Vernier](https://www.vernier.com/innovate/anscombes-quartet/)]

Graphs may not be as precise as statistics, but they provides a unique view onto data that can make it much easier to discover interesting structures than numerical methods. Graphs also provides the context necessary to make better choices and to be more careful when fitting models. Anscombe’s Quartet is a case in point, showing that four datasets that have identical statistical properties can indeed be very different [[Rstudio](https://rstudio-pubs-static.s3.amazonaws.com/52381_36ec82827e4b476fb968d9143aec7c4f.html)]

<img src="https://upload.wikimedia.org/wikipedia/en/d/d5/Francis_Anscombe.jpeg">

## Plotting the Database

In [1]:
# Import libraires pandas and seaborn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
# load the example dataset for Anscombe's quartet
df = pd.read_csv('anscombe.csv')
df

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4
0,10,8.04,10,9.14,10,7.46,8,6.58
1,8,6.95,8,8.14,8,6.77,8,5.76
2,13,7.58,13,8.74,13,12.74,8,7.71
3,9,8.81,9,8.77,9,7.11,8,8.84
4,11,8.33,11,9.26,11,7.81,8,8.47
5,14,9.96,14,8.1,14,8.84,8,7.04
6,6,7.24,6,6.13,6,6.08,8,5.25
7,4,4.26,4,3.1,4,5.39,19,12.5
8,12,10.84,12,9.13,12,8.15,8,5.56
9,7,4.82,7,7.26,7,6.42,8,7.91


In [3]:
# Print the mean of x
np.mean(df["x1"]) 
np.mean(df["x2"]) 
np.mean(df["x3"]) 
np.mean(df["x4"])

9.0

In [4]:
# Print the mean of y
np.mean(df["y1"]) 
np.mean(df["y2"]) 
np.mean(df["y3"]) 
np.mean(df["y4"])

7.50090909090909

In [5]:
# Print the standard deviation of x
print("Standard Deviation of X1:\t %f" % np.std(df["x1"]))
print("Standard Deviation of X2:\t %f" % np.std(df["x2"]))
print("Standard Deviation of X3:\t %f" % np.std(df["x3"]))
print("Standard Deviation of X4:\t %f" % np.std(df["x4"]))

Standard Deviation of X1:	 3.162278
Standard Deviation of X2:	 3.162278
Standard Deviation of X3:	 3.162278
Standard Deviation of X4:	 3.162278


In [6]:
# Print the standard deviation of y
print("Standard Deviation of Y1:\t %f" % np.std(df["y1"]))
print("Standard Deviation of Y2:\t %f" % np.std(df["y2"]))
print("Standard Deviation of Y3:\t %f" % np.std(df["y3"]))
print("Standard Deviation of Y4:\t %f" % np.std(df["y4"]))

Standard Deviation of Y1:	 1.937024
Standard Deviation of Y2:	 1.937109
Standard Deviation of Y3:	 1.935933
Standard Deviation of Y4:	 1.936081


In [7]:
df.groupby("dataset").describe()

KeyError: 'dataset'

In [None]:
# Show the results of a linear regression within each dataset
data = sns.load_dataset("anscombe")
for data_set in anscombe.dataset.unique():
     df = data.query("dataset == '{}'".format(data_set))
     slope, intercept, r_val, p_val, slope_std_error = stats.linregress(x=df.x, y=df.y)
     sns.lmplot(x="x", y="y", data=df);
     plt.title("Data set {}: y={:.2f}x+{:.2f} (p: {:.2f}, R^2: {:.2f})".format(data_set, slope, intercept, p_val, r_val))
     plt.show()


Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed.Each dataset consists of eleven (x,y) points (wiki)

## Linear Regression

Dataset 1 is the picture drawn by the mind's eye when a simple linear regression equation is reported. Yet, the same summary statistics apply to dataset 2, which shows a perfect curvilinear relation, and to dataset 3, which shows a perfect linear relation except for a single outlier.

The summary statistics also apply to datset 4, which is the most troublesome. Datasets 2 and 3 clearly call the straight line relation into question. Dataset 4 does not. A straight line may be appropriate in the fourth case. However, the regression equation is determined entirely by the single observation at x=19. Paraphrasing Anscombe, we need to know the relation between y and x and the special contribution of the observation at x=19 to that relation. (jerry)

Anscombe who helped computerize statistical analyses while seeking to avoid flawed interpretations of such data. In using computers to analyze statistical data, he drew on his expertise in the sampling of inspections for industrial quality control, the philosophical foundations of probability and the analysis of variance. (NY Times)



In [None]:
1. https://www.vernier.com/innovate/anscombes-quartet/

In [None]:
2.http://www.jerrydallal.com/lhsp/anscombe

In [None]:
3. https://rstudio-pubs-static.s3.amazonaws.com/52381_36ec82827e4b476fb968d9143aec7c4f.html

In [None]:
4. http://natalia.dbsdataprojects.com/2016/02/29/anscombes-quartet/