# Anscombe Quartet Analysis

## Problem statement

The purpose of this assignment is to provide an analysis of Anscombe's quartet dataset. 

The analysis will take the form:

### Explain the background

This section will give a brief history of the dataset. This will include some information on it's creator, when it was created, and some research on the purpose of it's creation.

Initial research on Anscombe's Quartet shows that the quartet is a set of four datasets where each dataset looks quite similar in terms of descriptive statistics, but which each look very different when represented graphically.

Anscombe's quartet was introduced by Francis J. Anscombe in his paper titled: Graphs in Statistical Analysis published in 1973. The quartet is widely described across multiple sources as "four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points". The statistical properties referenced are numerous and include: mean; median; standard deviation; correlation coefficients; linear regression lines and intercepts.

Light research would indicate that in 1973, the use of graphical methods of analysis was still in its infancy relative to numerical methods of analysis and that there may have been scepticism about the concept of graphical analysis. There appears to be some consensus that Ancombe's view of traditional text books and methods of analysis was that they gave rise to notions that numerical calculations were "exact", and graphs were "rough".

It is likely that Anscombe created the quartet to dispel long held theories that simple numerical statistical analyses could tell the whole story of a set of data, by showing how graphical analysis techniques could highlight the hidden influence of outliers in the statistical analysis of any dataset. His motivation in creating 4 very different datasets with some carefully chosen heavy outliers to engineer or manipulate the uniformity of the more standard statistical analysis properties (mean; standard deviation and regression lines among others)becomes clearer with this logic.

### Plotting the dataset

This section will involve plotting the interesting aspects of the dataset.

Find the current directory, as our data file has been downloaded from github and stored locally.

In [31]:
pwd

'C:\\Users\\reidy\\Documents\\Python Fundamentals\\Anscombe-Quartet'

Now that we've identified our current directory and noted this is also the location of the data file we can simply read our file from this directory and produce a dataframe using pandas library. Returning our dataframe df as an output we can validate we are reading the correct dataset below.

In [87]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("Anscombe-Quartet.csv")
df

Unnamed: 0,d,dataset,x,y
0,0,I,10,8.04
1,1,I,8,6.95
2,2,I,13,7.58
3,3,I,9,8.81
4,4,I,11,8.33
5,5,I,14,9.96
6,6,I,6,7.24
7,7,I,4,4.26
8,8,I,12,10.84
9,9,I,7,4.82


Now that we are reading the correct data we can begin our plot. Using the matplotlib and matplotlib based seaborn libraries we have several options to visually represent our data. We can begin with a regular linear regression. 

In [109]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

df = sns.regplot(x="x", y="y", data=df)

TypeError: 'AxesSubplot' object is not subscriptable

This regular plot above returns all of our data together so it's best to find another way that adequately separates it.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

df = sns.FacetGrid(df,hue="dataset",size=6).map(plt.scatter, "x", "y").add_legend()

The FacetGrid above is another package in Seaborn that allows us to ...

In [108]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

df = sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df)

TypeError: 'AxesSubplot' object is not subscriptable

The above lmplot is a function that combines the regplot and FacetGrid in Seaborn. We can easily pass the argument to separate the dataset by classification or "dataset".

### Descriptive statistics

This section will focus on calculating descriptive statistics of the variables in the dataset.

### Discussion

This section will form the discussion for the analysis. 