![Course header](../assets/img/header.png)

# 02 ‚Äî Data Analysis Basics: Your EO Toolbox

**Runtime ‚âà 2‚Äì3 hours** ¬∑ pandas + NumPy + Matplotlib

In notebook 01 you learned core Python with standard-library tools.
Now we add the **three libraries** you will use every single day in EO analysis:

| Library | What it does |
|---------|--------------|
| **pandas** | Tables ‚Äî load, filter, group, export |
| **NumPy** | Arrays ‚Äî fast math on grids of numbers |
| **Matplotlib** | Plots ‚Äî histograms, scatter, images |

---

## How this notebook works

Each section follows this pattern:

1. **Question** ‚Äî a realistic EO question.
2. **Tool** ‚Äî the pandas / NumPy / Matplotlib feature that answers it.
3. **Result** ‚Äî run the code and see the answer.

Look for:
- **‚úÖ Try it** exercises
- **üß† Checkpoints** (combine concepts)
- **‚ö†Ô∏è Common mistakes** (avoid these)
- **üí° Tips**

## Table of contents

1. Setup & data loading
2. pandas essentials ‚Äî load, select, filter, sort, group, export
3. Datetime (tactical)
4. NumPy essentials ‚Äî arrays, shapes, dtypes, masking, raster thinking
5. Matplotlib essentials ‚Äî histogram, scatter, imshow
6. üß† Mini capstone ‚Äî build a processing shortlist
7. Wrap-up

---

## 1) Setup & data loading

We load two small datasets that represent a real EO workflow:

| File | What | Shape |
|------|------|---------|
| `eo_scene_catalog.csv` | Scene metadata ‚Äî 24 rows, 4 tiles, 2 platforms | table |
| `eo_ndvi_stack.npz` | NDVI values + cloud masks ‚Äî 24 scenes √ó 32√ó32 px | array |

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_DIR = Path('..') / 'data'
OUT_DIR  = Path('..') / 'outputs'
OUT_DIR.mkdir(exist_ok=True)

print('pandas:', pd.__version__)
print('numpy: ', np.__version__)

In [None]:
df = pd.read_csv(DATA_DIR / 'eo_scene_catalog.csv')
print(f'Loaded {len(df)} rows, {len(df.columns)} columns')
df.head()

In [None]:
npz = np.load(DATA_DIR / 'eo_ndvi_stack.npz', allow_pickle=True)
ndvi       = npz['ndvi']
cloud_mask = npz['cloud']
scene_ids  = npz['scene_id']
datetimes  = npz['datetime']

print('ndvi shape:      ', ndvi.shape, ndvi.dtype)
print('cloud_mask shape:', cloud_mask.shape, cloud_mask.dtype)
print('scene_ids:       ', scene_ids[:3], '...')

---

## 2) pandas essentials

pandas gives you a **DataFrame** ‚Äî a table with named columns.  
Think of it as a spreadsheet you control with code.

### 2.1) First look ‚Äî shape, dtypes, describe

**Question:** What columns does this catalog have?  What are their types?

In [None]:
print('Shape:', df.shape)
print()
df.dtypes

In [None]:
df.describe()

In [None]:
df.info()

### Key summary functions

| Method | What it tells you |
|--------|-------------------|
| `df.shape` | (rows, columns) |
| `df.dtypes` | data type per column |
| `df.describe()` | count, mean, std, min, max for numbers |
| `df.info()` | types + missing values |
| `df.head(n)` / `df.tail(n)` | first / last n rows |
| `df['col'].value_counts()` | frequency table |

In [None]:
# How many scenes per tile?
df['tile'].value_counts()

In [None]:
# How many scenes per platform?
df['platform'].value_counts()

### 2.2) Selecting columns

**Question:** What are the scene IDs and their cloud cover?

In [None]:
# One column ‚Üí Series
df['cloud_cover'].head()

In [None]:
# Multiple columns ‚Üí DataFrame
df[['scene_id', 'cloud_cover']].head()

### ‚ö†Ô∏è Common mistake ‚Äî single vs. double brackets

- `df['cloud_cover']` ‚Üí a **Series** (single column).
- `df[['cloud_cover']]` ‚Üí a **DataFrame** (table with 1 column).
- `df['cloud_cover', 'tile']` ‚Üí **KeyError!** (missing the inner `[]`).

### 2.3) Filtering rows ‚Äî boolean indexing

**Question:** Which scenes have cloud cover below 20%?

In [None]:
mask = df['cloud_cover'] < 20
print(f'{mask.sum()} scenes with cloud < 20%')
df[mask]

#### Multiple conditions

In pandas, use `&` (and), `|` (or), `~` (not).  
**Wrap each condition in parentheses.**

In [None]:
# Low cloud AND tile T32UQD
good = df[(df['cloud_cover'] < 25) & (df['tile'] == 'T32UQD')]
print(f'{len(good)} good scenes for T32UQD')
good

### ‚ö†Ô∏è Common mistake ‚Äî `and` vs `&`

In pandas you must use `&` / `|` / `~`, **not** `and` / `or` / `not`.  
And don't forget the parentheses: `(df['a'] > 5) & (df['b'] < 10)`.

### ‚úÖ Try it ‚Äî filter

Filter `df` to show scenes from platform `S2B` with cloud cover below 25%.
How many are there?

<details>
<summary>Show solution</summary>

```python
s2b_clear = df[(df['platform'] == 'S2B') & (df['cloud_cover'] < 25)]
print(f'{len(s2b_clear)} S2B scenes with cloud < 25%')
s2b_clear
```

</details>

In [None]:
# ‚úÖ Filter to S2B with cloud < 25%


### 2.4) Sorting

**Question:** Which scene has the lowest cloud cover?

In [None]:
df.sort_values('cloud_cover').head(5)

In [None]:
# Descending
df.sort_values('cloud_cover', ascending=False).head(5)

### 2.5) GroupBy ‚Äî per-tile statistics

**Question:** What is the average cloud cover per tile?  Which tile is cloudiest?

In [None]:
df.groupby('tile')['cloud_cover'].mean()

In [None]:
# Multiple aggregations
df.groupby('tile')['cloud_cover'].agg(['mean', 'min', 'max', 'count'])

In [None]:
# Group by two columns
df.groupby(['tile', 'platform'])['cloud_cover'].mean()

### ‚úÖ Try it ‚Äî group by platform

Group by `platform` and compute the mean, min, and max cloud cover.

<details>
<summary>Show solution</summary>

```python
df.groupby('platform')['cloud_cover'].agg(['mean', 'min', 'max'])
```

</details>

In [None]:
# ‚úÖ Group by platform, compute mean/min/max cloud cover


### 2.6) Adding columns ‚Äî computed fields

**Question:** Which scenes are "good" (cloud < 25%)?

In [None]:
df['is_good'] = df['cloud_cover'] < 25
df['is_good'].value_counts()

In [None]:
# Number of good scenes per tile
df.groupby('tile')['is_good'].sum()

### 2.7) Export

**Question:** Save the good scenes to a CSV for further processing.

In [None]:
out_path = OUT_DIR / 'good_scenes.csv'
good_df = df[df['is_good']].copy()
good_df.to_csv(out_path, index=False)
print(f'Wrote {len(good_df)} rows to {out_path}')

### 2.8) Gotchas ‚Äî copy, inplace, chained indexing

In [None]:
# Use .copy() when you want an independent subset
subset = df[df['tile'] == 'T32UQD'].copy()
subset['note'] = 'flagged'
# Original df is NOT changed:
print('note' in df.columns)  # False

### ‚ö†Ô∏è Common mistake ‚Äî chained indexing

```python
# BAD ‚Äî may silently do nothing:
df[df['cloud_cover'] < 25]['quality'] = 'good'

# GOOD ‚Äî use .loc[]:
df.loc[df['cloud_cover'] < 25, 'quality'] = 'good'
```

### 2.9) Merge ‚Äî combining tables

**Question:** We want to add per-tile metadata (center lat/lon) to our scene catalog.

`pd.merge()` joins two DataFrames on a shared column ‚Äî like a SQL JOIN.

In [None]:
tile_info = pd.DataFrame({
    'tile': ['T32UQD', 'T32UPD', 'T33UUP', 'T33UVP'],
    'center_lat': [49.79, 48.80, 50.30, 49.50],
    'center_lon': [9.95, 8.95, 11.95, 12.95],
})
tile_info

In [None]:
df_merged = pd.merge(df, tile_info, on='tile', how='left')
print(f'Shape after merge: {df_merged.shape}')
df_merged.head()

In [None]:
# Check for missing values after the merge
df_merged.isna().sum()

### üß† Checkpoint ‚Äî pandas

Combine what you have learned:

1. Filter `df` to scenes with cloud < 30 **and** tile `T33UVP`.
2. Sort by cloud cover ascending.
3. Select only `scene_id`, `datetime`, and `cloud_cover`.
4. Print the result.

<details>
<summary>Show solution</summary>

```python
result = (
    df[(df['cloud_cover'] < 30) & (df['tile'] == 'T33UVP')]
    .sort_values('cloud_cover')
    [['scene_id', 'datetime', 'cloud_cover']]
)
result
```

</details>

In [None]:
# üß† Filter, sort, select ‚Äî combine all pandas skills


---

## 3) Datetime ‚Äî tactical

**Question:** Which scenes fall in summer months (June‚ÄìAugust)?

pandas can parse dates and let you filter by month.

In [None]:
df['datetime'] = pd.to_datetime(df['datetime'])
print(df['datetime'].dtype)

In [None]:
df['month'] = df['datetime'].dt.month
df['month'].value_counts().sort_index()

In [None]:
# Filter to summer (June‚ÄìAugust)
summer = df[df['month'].isin([6, 7, 8])]
print(f'{len(summer)} summer scenes')
summer[['scene_id', 'tile', 'datetime', 'cloud_cover']].head()

### ‚úÖ Try it ‚Äî spring scenes

Filter `df` to March‚ÄìMay scenes and compute the mean cloud cover.

<details>
<summary>Show solution</summary>

```python
spring = df[df['month'].isin([3, 4, 5])]
print(f'{len(spring)} spring scenes, mean cloud = {spring["cloud_cover"].mean():.1f}%')
```

</details>

In [None]:
# ‚úÖ Filter to spring (March‚ÄìMay), compute mean cloud cover


---

## 4) NumPy essentials

NumPy gives you **arrays** ‚Äî fast grids of numbers.  
Every raster you work with (NDVI, reflectance, temperature) is stored as a NumPy array.

### 4.1) Arrays, shapes, dtypes

**Question:** What shape is our NDVI data?

In [None]:
print('shape:', ndvi.shape)
print('dtype:', ndvi.dtype)
print('ndim: ', ndvi.ndim)
print('size: ', ndvi.size, 'values')
print()
print('Interpretation: 24 scenes, each 32√ó32 pixels')

### Key array attributes

| Attribute | Meaning | Example |
|-----------|---------|--------|
| `.shape` | dimensions | `(24, 32, 32)` |
| `.dtype` | data type | `float32` |
| `.ndim` | number of axes | `3` |
| `.size` | total elements | `24576` |

### 4.2) Indexing and slicing

**Question:** What is the NDVI of pixel (10, 15) in the first scene?

In [None]:
# One pixel
print('NDVI at scene 0, row 10, col 15:', ndvi[0, 10, 15])

# One scene ‚Üí 2D array
scene_0 = ndvi[0]  # shape (32, 32)
print('scene_0 shape:', scene_0.shape)

# First 3 scenes
first3 = ndvi[:3]  # shape (3, 32, 32)
print('first3 shape:', first3.shape)

### 4.3) Vectorised operations ‚Äî no loops needed

**Question:** Scale the raw NDVI values from [0, 1] to [‚Äì1, 1]?

In [None]:
# Imagine our data is stored as 0‚Äì1 and we want -1 to 1
scaled = ndvi * 2 - 1
print('Original range:', ndvi.min(), '‚Äì', ndvi.max())
print('Scaled range:  ', scaled.min(), '‚Äì', scaled.max())

In [None]:
# Statistics per scene
scene_means = ndvi.mean(axis=(1, 2))  # mean over pixels
print('Mean NDVI per scene (first 5):', scene_means[:5])

### 4.4) Boolean masks ‚Äî flagging bad pixels

**Question:** What fraction of pixels is cloudy in each scene?

In [None]:
cloud_frac = cloud_mask.mean(axis=(1, 2)) * 100
print('Cloud fraction per scene (first 6):')
for i in range(6):
    print(f'  {scene_ids[i]}: {cloud_frac[i]:.1f}%')

In [None]:
# Mask cloudy pixels with NaN
ndvi_masked = ndvi.copy().astype('float32')
ndvi_masked[cloud_mask] = np.nan

# Cloud-free mean per scene
clean_means = np.nanmean(ndvi_masked, axis=(1, 2))
print('Cloud-free NDVI means (first 6):', clean_means[:6].round(3))

### dtype and astype

EO data often comes in different dtypes (uint16, float32, float64).  
Use `.astype()` to convert ‚Äî but be aware of precision and memory.

| dtype | Bytes | Use case |
|-------|-------|----------|
| `float32` | 4 | NDVI, reflectance |
| `float64` | 8 | high-precision math |
| `uint16` | 2 | raw satellite bands |
| `bool` | 1 | masks |

In [None]:
print('ndvi dtype:      ', ndvi.dtype)
print('as float64:      ', ndvi.astype('float64').dtype)
print('cloud_mask dtype:', cloud_mask.dtype)

### ‚úÖ Try it ‚Äî NumPy stats

1. Compute the overall mean and standard deviation of the NDVI array.
2. Compute the mean NDVI of the **last** scene only.

<details>
<summary>Show solution</summary>

```python
print('Overall mean NDVI:', ndvi.mean())
print('Overall std NDVI: ', ndvi.std())
print('Last scene mean:  ', ndvi[-1].mean())
```

</details>

In [None]:
# ‚úÖ Compute overall mean, std, and last-scene mean


### 4.5) Raster thinking ‚Äî images as arrays

Satellite images are just 2-D (or 3-D) NumPy arrays.
Building intuition for **shape**, **spatial subsets**, and **composites**
is the key skill that connects tabular data analysis to real raster work.

**Question:** How do we move from a flat grid of numbers to something that looks like a map?

In [None]:
# Our NDVI stack: (scenes, y, x)
print('NDVI shape:', ndvi.shape)
print()
print('Interpretation:')
print(f'  Axis 0  ‚Üí  {ndvi.shape[0]} scenes (time)')
print(f'  Axis 1  ‚Üí  {ndvi.shape[1]} rows   (y / latitude)')
print(f'  Axis 2  ‚Üí  {ndvi.shape[2]} cols   (x / longitude)')
print()

# Pick one scene ‚Üí a 2-D "image"
scene = ndvi[0]
print('Single scene shape:', scene.shape, '‚Üí a 2-D raster')

#### Spatial subsets (slicing a region)

Slicing a 2-D array with `[row_start:row_stop, col_start:col_stop]` extracts
a rectangular patch ‚Äî exactly how you would crop a satellite image to an area of interest.

In [None]:
# Extract a 16√ó16 patch from the centre of scene 0
scene = ndvi[0]
r, c = scene.shape[0] // 4, scene.shape[1] // 4  # top-left corner of patch
patch = scene[r:r+16, c:c+16]

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Full scene
axes[0].imshow(scene, cmap='YlGn', vmin=0, vmax=1)
axes[0].add_patch(plt.Rectangle((c, r), 16, 16,
                                fill=False, edgecolor='red', linewidth=2))
axes[0].set_title('Full scene (32√ó32)')

# Extracted patch
axes[1].imshow(patch, cmap='YlGn', vmin=0, vmax=1)
axes[1].set_title(f'Subset [{r}:{r+16}, {c}:{c+16}]')

plt.tight_layout()
plt.show()
print(f'Patch shape: {patch.shape}  |  mean NDVI: {patch.mean():.3f}')

#### Stacking arrays ‚Äî building composites

`np.stack()` combines several 2-D arrays into one 3-D array along a new axis.
In EO this is how you build multi-band composites or time stacks.

```python
np.stack([band_r, band_g, band_b], axis=-1)   # ‚Üí (y, x, 3)
```

> ‚ö†Ô∏è **Disclaimer:** The example below stacks three NDVI scenes as if they were RGB bands.
> This is **only** to demonstrate `np.stack` ‚Äî it is **not** how real composites are built.
> Real RGB composites use actual reflectance bands (e.g., B4, B3, B2 for Sentinel-2).

In [None]:
# Stack three scenes as if they were R-G-B bands
# (purely illustrative ‚Äî real composites use reflectance bands)
s0, s1, s2 = ndvi[0], ndvi[5], ndvi[10]
composite = np.stack([s0, s1, s2], axis=-1)  # shape (32, 32, 3)
print('Composite shape:', composite.shape)

# Normalise to [0, 1] for display
composite_norm = np.clip(composite, 0, 1)

fig, axes = plt.subplots(1, 4, figsize=(14, 3))
for ax, arr, title in zip(axes[:3],
                           [s0, s1, s2],
                           ['Scene 0 (‚Üí R)', 'Scene 5 (‚Üí G)', 'Scene 10 (‚Üí B)']):
    ax.imshow(arr, cmap='YlGn', vmin=0, vmax=1)
    ax.set_title(title)
    ax.axis('off')

axes[3].imshow(composite_norm)
axes[3].set_title('np.stack ‚Üí "RGB" composite')
axes[3].axis('off')
plt.tight_layout()
plt.show()

#### `np.where` ‚Äî masked visualisation

`np.where(condition, x, y)` returns `x` where the condition is True, `y` elsewhere.
This is how you blank out cloudy pixels or highlight specific land-cover classes.

In [None]:
# Blank out cloudy pixels with NaN, then display
scene_idx = 0
clean = np.where(cloud_mask[scene_idx], np.nan, ndvi[scene_idx])

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].imshow(ndvi[scene_idx], cmap='YlGn', vmin=0, vmax=1)
axes[0].set_title('NDVI (raw)')

axes[1].imshow(cloud_mask[scene_idx], cmap='Greys_r')
axes[1].set_title('Cloud mask')

axes[2].imshow(clean, cmap='YlGn', vmin=0, vmax=1)
axes[2].set_title('NDVI (clouds ‚Üí NaN)')

for ax in axes:
    ax.axis('off')
plt.tight_layout()
plt.show()

print(f'Pixels before masking: {np.isfinite(ndvi[scene_idx]).sum()}')
print(f'Pixels after masking:  {np.isfinite(clean).sum()}')

### ‚úÖ Try it ‚Äî raster operations

1. Extract a **10√ó10** patch from the bottom-right corner of scene 3.
2. Use `np.where` to set all pixels with NDVI < 0.3 to `np.nan` in scene 3, then compute the mean of the remaining values with `np.nanmean`.

<details>
<summary>Show solution</summary>

```python
# 1. Bottom-right 10√ó10 patch of scene 3
patch_br = ndvi[3, -10:, -10:]
print('Bottom-right patch shape:', patch_br.shape)
print('Patch mean NDVI:', patch_br.mean().round(3))

# 2. Mask low NDVI
high_ndvi = np.where(ndvi[3] >= 0.3, ndvi[3], np.nan)
print('Mean NDVI (‚â• 0.3 only):', np.nanmean(high_ndvi).round(3))
```

</details>

In [None]:
# ‚úÖ 1. Extract bottom-right 10√ó10 patch of scene 3

# ‚úÖ 2. Mask NDVI < 0.3 with np.where, then np.nanmean


### üß† Checkpoint ‚Äî NumPy

**Q1.** What does `ndvi.shape` tell you?

- (a) 24 bands, 32 rows, 32 columns
- (b) 24 scenes, each 32 √ó 32 pixels
- (c) 32 scenes, each 24 √ó 32 pixels

<details>
<summary>Show answer</summary>

**(b)** ‚Äî Axis 0 is scenes (time), axes 1 and 2 are rows and columns (spatial).

</details>

**Q2.** Why use `np.nanmean()` instead of `np.mean()`?

<details>
<summary>Show answer</summary>

`np.mean()` propagates `NaN` values ‚Äî if any pixel is `NaN` the result is `NaN`.
`np.nanmean()` ignores `NaN`s and computes the mean of the remaining valid pixels.

</details>

**Q3.** You want the bottom-right 5√ó5 pixels of the 10th scene. Which slice?

- (a) `ndvi[10, -5:, -5:]`
- (b) `ndvi[9, -5:, -5:]`
- (c) `ndvi[-5:, -5:, 9]`

<details>
<summary>Show answer</summary>

**(b)** ‚Äî The 10th scene has index 9 (0-based). Negative slicing `-5:` grabs the last 5 elements along that axis. Option (a) would be the 11th scene; option (c) has the axes in the wrong order.

</details>

---

## 5) Matplotlib essentials

Matplotlib makes plots.  We'll cover the three you use most in EO:
1. **Histogram** ‚Äî distribution of values
2. **Scatter** ‚Äî two variables against each other
3. **imshow** ‚Äî display a 2D array as an image

### 5.1) Histogram

**Question:** What does the cloud cover distribution look like?

In [None]:
fig, ax = plt.subplots(figsize=(6, 3))
ax.hist(df['cloud_cover'], bins=10, edgecolor='white')
ax.set_xlabel('Cloud cover (%)')
ax.set_ylabel('Number of scenes')
ax.set_title('Cloud cover distribution')
plt.tight_layout()
plt.show()

### 5.2) Scatter plot

**Question:** Is there a relationship between cloud cover and mean NDVI?

In [None]:
scene_mean_ndvi = ndvi.mean(axis=(1, 2))

fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(df['cloud_cover'], scene_mean_ndvi, c='seagreen', edgecolor='k', s=40)
ax.set_xlabel('Cloud cover (%)')
ax.set_ylabel('Mean NDVI')
ax.set_title('Cloud cover vs. NDVI')
plt.tight_layout()
plt.show()

### 5.3) imshow ‚Äî displaying rasters

**Question:** What does the NDVI of the clearest scene look like?

In [None]:
# Find the clearest scene
best_idx = df['cloud_cover'].idxmin()
best_scene_id = df.loc[best_idx, 'scene_id']
best_cloud = df.loc[best_idx, 'cloud_cover']
print(f'Clearest scene: {best_scene_id} (cloud={best_cloud}%)')

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

im0 = axes[0].imshow(ndvi[best_idx], cmap='YlGn', vmin=0, vmax=1)
axes[0].set_title(f'NDVI ‚Äî {best_scene_id}')
plt.colorbar(im0, ax=axes[0], shrink=0.8)

axes[1].imshow(cloud_mask[best_idx], cmap='gray_r')
axes[1].set_title('Cloud mask')

plt.tight_layout()
plt.show()

üí° **Tip ‚Äî imshow & geographic data**

By default `imshow` puts row 0 at the **top**, which matches image convention (origin = upper-left).
For geographic data you might want south at the bottom ‚Äî pass `origin='lower'`.

To add real-world coordinates to the axes, use the `extent=` parameter:

```python
ax.imshow(data, extent=[lon_min, lon_max, lat_min, lat_max], origin='lower')
```

We will revisit this when working with xarray and rasterio in later notebooks.

### ‚úÖ Try it ‚Äî plot the cloudiest scene

Find the cloudiest scene (highest `cloud_cover`). Display its NDVI and cloud mask side-by-side using `imshow`.

<details>
<summary>Show solution</summary>

```python
worst_idx = df['cloud_cover'].idxmax()
worst_id = df.loc[worst_idx, 'scene_id']

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
im = axes[0].imshow(ndvi[worst_idx], cmap='YlGn', vmin=0, vmax=1)
axes[0].set_title(f'NDVI ‚Äî {worst_id}')
plt.colorbar(im, ax=axes[0], shrink=0.8)

axes[1].imshow(cloud_mask[worst_idx], cmap='gray_r')
axes[1].set_title('Cloud mask')
plt.tight_layout()
plt.show()
```

</details>

In [None]:
# ‚úÖ Find cloudiest scene, plot NDVI + cloud mask side-by-side


### Bonus ‚Äî pandas `.plot()`

pandas has a quick `.plot()` method built on matplotlib.  
Great for fast exploration; use raw matplotlib when you need full control.

In [None]:
df.groupby('tile')['cloud_cover'].mean().plot.bar(
    ylabel='Mean cloud cover (%)',
    title='Mean cloud cover per tile',
    figsize=(6, 3),
)
plt.tight_layout()
plt.show()

---

## 6) üß† Mini capstone ‚Äî build a processing shortlist

Combine everything you learned.  
Imagine you are building an NDVI time-series for a study area.

> ‚ö†Ô∏è This exercise uses `df_merged` from ¬ß2.9. If you skipped that section, run the merge cell first.

### Task

1. Define an AOI point: `lat = 49.79, lon = 9.95`.
2. Filter the catalog to scenes with `cloud_cover < 30`.
3. For each tile, keep only the **most recent** good scene.
4. Compute a simple distance from the AOI to each tile center.
5. Sort by distance and print the shortlist.
6. Export to CSV.
7. Plot the NDVI of the single best scene.

üí° *Hint: use `sort_values` + `drop_duplicates(subset='tile', keep='last')`.*

<details>
<summary>Show solution</summary>

```python
# --- Step 1: AOI ---
aoi_lat, aoi_lon = 49.79, 9.95

# --- Step 2: Filter ---
clear = df_merged[df_merged['cloud_cover'] < 30].copy()
print(f'{len(clear)} scenes with cloud < 30%')

# --- Step 3: Most recent per tile ---
clear = clear.sort_values('datetime')
latest = clear.drop_duplicates(subset='tile', keep='last')
print(f'{len(latest)} tiles after dedup')
latest[['scene_id', 'tile', 'datetime', 'cloud_cover']]

# --- Step 4: Distance from AOI ---
latest = latest.copy()
latest['dist_deg'] = np.sqrt(
    (latest['center_lat'] - aoi_lat)**2 +
    (latest['center_lon'] - aoi_lon)**2
)
latest = latest.sort_values('dist_deg')

# --- Step 5: Print shortlist ---
print('Processing shortlist (closest first):')
print(latest[['scene_id', 'tile', 'datetime', 'cloud_cover', 'dist_deg']].to_string(index=False))

# --- Step 6: Export ---
export_path = OUT_DIR / 'processing_shortlist.csv'
latest.to_csv(export_path, index=False)
print(f'Exported to {export_path}')

# --- Step 7: Plot best scene ---
best_row = latest.iloc[0]
best_global_idx = best_row.name  # original df index

fig, ax = plt.subplots(figsize=(5, 4))
im = ax.imshow(ndvi[best_global_idx], cmap='YlGn', vmin=0, vmax=1)
ax.set_title(f"Best scene: {best_row['scene_id']}\ncloud={best_row['cloud_cover']}%")
plt.colorbar(im, ax=ax, shrink=0.8, label='NDVI')
plt.tight_layout()
plt.show()
```

</details>

In [None]:
# üß† Build your processing shortlist (steps 1‚Äì7 above)
# Requires: df_merged from ¬ß2.9


---

## 7) Wrap-up

### What you learned

| Tool | Key skills |
|------|------------|
| **pandas** | `read_csv`, `[]`, boolean filter, `sort_values`, `groupby`, `merge`, `to_csv` |
| **Datetime** | `pd.to_datetime`, `.dt.month`, `.isin()` |
| **NumPy** | shape, dtype, slicing, vectorised math, boolean masks, `np.nanmean`, spatial subsets, `np.stack`, `np.where` |
| **Matplotlib** | `hist`, `scatter`, `imshow`, subplots, colorbar |

### Common-mistake summary

| Mistake | Fix |
|---------|-----|
| `df['a', 'b']` | `df[['a', 'b']]` (double brackets) |
| `and` in pandas filter | use `&`, wrap in `()` |
| Modifying a view | use `.copy()` |
| Chained indexing | use `.loc[]` |

### What's next

In the next notebooks you will use **xarray** to work with labelled, multi-dimensional data cubes,
**rasterio** to read real satellite GeoTIFFs, and **STAC** to search for data in the cloud.