# Anscombe's Quartet — Analysis & Visualisation (v2)

## What is Anscombe's Quartet?

Anscombe's Quartet (Anscombe, 1973) is a classic dataset of four groups (I–IV), each with 11 (x, y) pairs.  
All four groups share **nearly identical descriptive statistics** yet look **completely different** when plotted — a powerful reminder to always visualise your data.

---

## Changes in v2 (iteration-1)
- Scatter plot now shows **all four groups in a single chart** (no subplots)
- **Regression lines removed**
- Plot rebuilt using **Altair** for interactive, colourful output
- Output files versioned (`_v2`)

## Log of steps
1. Import libraries
2. Load dataset from `anscombe_quartet.tsv`
3. Compute descriptive statistics per group
4. Build combined Altair scatter plot
5. Save chart as `anscombe_scatter_v2.png`
6. Observations & conclusions

## Step 1 — Import libraries

- **pandas** — data loading and statistics
- **altair** — declarative, grammar-of-graphics charting library built on Vega-Lite

In [1]:
import pandas as pd
import altair as alt

## Step 2 — Load the dataset

`pd.read_csv` with `sep='\t'` reads the tab-separated file.  
We confirm the shape and preview the data.

In [2]:
df = pd.read_csv('anscombe_quartet.tsv', sep='\t')

print(f'Shape: {df.shape[0]} rows × {df.shape[1]} columns')
print(f'Groups: {sorted(df["dataset"].unique())}')
df.head(12)

Shape: 44 rows × 3 columns
Groups: ['I', 'II', 'III', 'IV']


Unnamed: 0,dataset,x,y
0,I,10,8.04
1,I,8,6.95
2,I,13,7.58
3,I,9,8.81
4,I,11,8.33
5,I,14,9.96
6,I,6,7.24
7,I,4,4.26
8,I,12,10.84
9,I,7,4.82


## Step 3 — Descriptive statistics per group

`groupby('dataset')` splits the data by group; `.describe()` computes count, mean, std, min, quartiles, and max.  
The compact summary below makes the near-identical statistics across groups obvious.

In [3]:
for group, gdf in df.groupby('dataset'):
    print(f'\n=== Group {group} ===')
    print(gdf[['x', 'y']].describe().round(4).to_string())


=== Group I ===
             x        y
count  11.0000  11.0000
mean    9.0000   7.5009
std     3.3166   2.0316
min     4.0000   4.2600
25%     6.5000   6.3150
50%     9.0000   7.5800
75%    11.5000   8.5700
max    14.0000  10.8400

=== Group II ===
             x        y
count  11.0000  11.0000
mean    9.0000   7.5009
std     3.3166   2.0317
min     4.0000   3.1000
25%     6.5000   6.6950
50%     9.0000   8.1400
75%    11.5000   8.9500
max    14.0000   9.2600

=== Group III ===
             x        y
count  11.0000  11.0000
mean    9.0000   7.5000
std     3.3166   2.0304
min     4.0000   5.3900
25%     6.5000   6.2500
50%     9.0000   7.1100
75%    11.5000   7.9800
max    14.0000  12.7400

=== Group IV ===
             x        y
count  11.0000  11.0000
mean    9.0000   7.5009
std     3.3166   2.0306
min     8.0000   5.2500
25%     8.0000   6.1700
50%     8.0000   7.0400
75%     8.0000   8.1900
max    19.0000  12.5000


In [4]:
summary = df.groupby('dataset')[['x', 'y']].agg(['mean', 'std']).round(4)
summary.columns = ['x_mean', 'x_std', 'y_mean', 'y_std']
print('Cross-group summary — means and standard deviations:')
print(summary.to_string())

Cross-group summary — means and standard deviations:
         x_mean   x_std  y_mean   y_std
dataset                                
I           9.0  3.3166  7.5009  2.0316
II          9.0  3.3166  7.5009  2.0317
III         9.0  3.3166  7.5000  2.0304
IV          9.0  3.3166  7.5009  2.0306


## Step 4 — Combined Altair scatter plot (all groups, no regression lines)

**How this Altair chart works:**
- `alt.Chart(df)` creates a chart from the full DataFrame in one go
- `.mark_circle(size=90)` draws filled circles
- `.encode(...)` maps data columns to visual channels:
  - `x` → x-axis
  - `y` → y-axis
  - `color` → group (Altair auto-assigns a qualitative colour palette)
  - `shape` → group (double-encodes group membership for accessibility)
  - `tooltip` → hover details
- `.properties(...)` sets the chart title and dimensions
- `.interactive()` enables pan & zoom in the notebook

In [5]:
chart = (
    alt.Chart(df)
    .mark_circle(size=100, opacity=0.85, stroke='white', strokeWidth=0.8)
    .encode(
        x=alt.X('x:Q', title='x', scale=alt.Scale(zero=False)),
        y=alt.Y('y:Q', title='y', scale=alt.Scale(zero=False)),
        color=alt.Color(
            'dataset:N',
            title='Group',
            scale=alt.Scale(scheme='set1')
        ),
        shape=alt.Shape('dataset:N', title='Group'),
        tooltip=[
            alt.Tooltip('dataset:N', title='Group'),
            alt.Tooltip('x:Q', title='x'),
            alt.Tooltip('y:Q', title='y', format='.2f'),
        ]
    )
    .properties(
        title="Anscombe's Quartet — All Groups",
        width=520,
        height=380
    )
    .interactive()
)

chart

## Step 5 — Save chart as PNG

`chart.save()` uses the `vl-convert-python` backend to render the Vega-Lite spec to a static PNG file.

In [6]:
chart.save('anscombe_scatter_v2.png', scale_factor=2)
print('Chart saved as anscombe_scatter_v2.png')

Chart saved as anscombe_scatter_v2.png


## Step 6 — Observations & conclusions

Plotting all four groups together on a single chart highlights their structural differences at a glance:

| Group | Pattern |
|-------|---------|
| **I** | Classic linear scatter with moderate noise |
| **II** | Clear upward curve — a quadratic relationship |
| **III** | Near-perfect line except for one extreme outlier |
| **IV** | All x values identical (x = 8) except one high-leverage point (x = 19) |

### Key takeaway
> Despite sharing almost identical means, standard deviations, and correlation coefficients, the four groups are fundamentally different.  
> **Always visualise your data — summary statistics alone will mislead you.**