# Anscombe's quartet

__Purpose:__ The purpose of this lecture is to introduce a limitation of summary statistics known as Anscombe's quartet. 

__At the end of this lecture you will be able to:__ 
> 1. Understand the limitations of summary statistics.

### 1.1.1 Anscombe's Quartet:

__Overview:__
- The first stage of any Statistical Analysis should be graphing the data to visualize the properties of the Bivariate Data
- Without graphing data and relying solely on the numerical statistics, you may encounter errors that are best summarized in Anscombe's Quartet
- __[Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet):__ Ancombe's Quartet comprises of 4 datasets, each bivariate data (X and Y) and each containing 10 data points. It can easily be shown that each dataset has virtually identical numerical statistics but look drastically different when graphed 

__Helpful Points:__
1. The 4 datasets were first developed by [Francis Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe) and it is not known exactly how they were created, although similar datasets have also been developed 

__Practice:__ Examples of Ancombe's Quarter in Python 

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import math 
import random
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# read in data to analyze 
anc_quart = pd.read_csv("ancombes_quartet.csv")
anc_quart

### Example 1 (Mean of X and Y):

In [None]:
# mean of x
anc_quart.loc[:, ["x_1", "x_2", "x_3", "x_4"]].mean()

In [None]:
# mean of y
anc_quart.loc[:, ["y_1", "y_2", "y_3", "y_4"]].mean()

### Example 2 (Variance of X and Y):

In [None]:
# variance of x
anc_quart.loc[:, ["x_1", "x_2", "x_3", "x_4"]].var()

In [None]:
# variance of y
anc_quart.loc[:, ["y_1", "y_2", "y_3", "y_4"]].var()

### Example 3 (Correlation between X and Y):

In [None]:
np.corrcoef(anc_quart.loc[:, ["x_1", "y_1"]].T)

In [None]:
np.corrcoef(anc_quart.loc[:, ["x_2", "y_2"]].T)

In [None]:
np.corrcoef(anc_quart.loc[:, ["x_3", "y_3"]].T)

In [None]:
np.corrcoef(anc_quart.loc[:, ["x_4", "y_4"]].T)

If you looked at these numerical statistics alone and stopped here, would you conclude they are all the same data? Probably!!

### Example 4 (Plotting):

In [None]:
plt.figure(figsize=[15,5])

plt.suptitle('Ancombes Quartert',fontsize = 16)

plt.subplot(2,2,1) 
plt.scatter(anc_quart.x_1,anc_quart.y_1)
plt.title('Dataset 1')
plt.subplot(2,2,2) 
plt.scatter(anc_quart.x_2,anc_quart.y_2)
plt.title('Dataset 2')
plt.subplot(2,2,3) 
plt.scatter(anc_quart.x_3,anc_quart.y_3)
plt.title('Dataset 3')
plt.subplot(2,2,4) 
plt.scatter(anc_quart.x_4,anc_quart.y_4)
plt.title('Dataset 4')