# Anscobe's quartet
[Wikipedia entry](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)
## From Wikipedia
*Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."[1]*


In a nutshell, while the data looks the same based on its statistics, the values are vastly different and will graph differently. This demonstrates the value of graphing data.

## Getting the information
Be sure to import pandas, followed by using a `read_csv(filename)` command, as in:

In [1]:
import pandas as pd
import plotly.express as px
df = pd.read_csv("anscombe.csv")

## Display and access information about the dataset
[pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)

* display(dataframe) - Print a formatted table of the Dataframe
* .info - Print a concise summary of a DataFrame.
* .describe - Generate descriptive statistics of DataFrame columns.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   dataset  44 non-null     object 
 1   x        44 non-null     float64
 2   y        44 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1.2+ KB


In [11]:
df.describe()

Unnamed: 0,x,y
count,44.0,44.0
mean,9.0,7.500682
std,3.198837,1.958925
min,4.0,3.1
25%,7.0,6.1175
50%,8.0,7.52
75%,11.0,8.7475
max,19.0,12.74


## Getting data from specific columns or rows of data
[pandas Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)

Reference the data by a conditional expression. 
```python
df[conditional]
df[df["col"] == value]
```
It appears a bit redundent, however, it states...*"in dataframe df, find the values of the column in the dataframe where the specific column has this value"*

In [1]:
df[df["dataset"] == 'I']
print(f"

NameError: name 'df' is not defined

In [14]:
df[df["dataset"] == 'I'].describe()

Unnamed: 0,x,y
count,11.0,11.0
mean,9.0,7.500909
std,3.316625,2.031568
min,4.0,4.26
25%,6.5,6.315
50%,9.0,7.58
75%,11.5,8.57
max,14.0,10.84


## Assignment
The goal of this assignment is to understand why ploting data is so important. Anscombe's quartet is a great example of data which seems to be the same statistically, however, once graphed can be vastly different.

To confirm this perform the following steps:

**Title your notebook 04-Anscombe**

1. Use the Ex-Average-Calc to review how to calculate average (mean) and standard deviation on ALL of the data (44 samples)
2. Compare your results to those achieved by using the `describe()` command.
3. Perform step 1 and 2, for each dataset (I-IV), again comparing your results. Consider using the `for x in y:` capability in Python
4. Graph each dataset (I-IV) using px.scatter().
5. Compare your resultes with those found on the Wikipedia page from above.