# eBird Data Exploration

This notebook explores the eBird dataset for Spain to understand its structure and prepare for model training.

## 1. Loading the Dataset

We have two datasets:
- `sample.csv`: A small sample for quick exploration
- `ebird_spain_2020-2025.txt`: The full Spain dataset (~32M rows)

In [None]:
import pandas as pd

In [None]:
# Load the sample dataset
sample_df = pd.read_csv("../data/raw/sample.csv", delimiter="\t")
print(f"Sample shape: {sample_df.shape}")
sample_df.head()

In [None]:
# Preview the full dataset structure (first 100 rows)
full_df = pd.read_csv("../data/raw/ebird_spain_2020-2025.txt", delimiter="\t", nrows=100)
print(f"Columns: {len(full_df.columns)}")
print(full_df.columns.tolist())

In [None]:
# Count total rows in the full dataset
with open("../data/raw/ebird_spain_2020-2025.txt", "r") as f:
    row_count = sum(1 for _ in f)
print(f"Total rows in full dataset: {row_count:,}")

## 2. Column Exploration

The eBird dataset has 53 columns. For our ML model, we need a subset focused on:
- **What**: Species identification (COMMON NAME, SCIENTIFIC NAME)
- **Where**: Location (LATITUDE, LONGITUDE, STATE)
- **When**: Time (OBSERVATION DATE, TIME OBSERVATIONS STARTED)
- **How**: Observation method (OBSERVATION TYPE, DURATION MINUTES, EFFORT DISTANCE KM)
- **Completeness**: Whether all species were reported (ALL SPECIES REPORTED)

In [None]:
# Columns we'll use for the model
cols_to_keep = [
    "COMMON NAME",
    "SCIENTIFIC NAME",
    "STATE",
    "LATITUDE",
    "LONGITUDE",
    "OBSERVATION DATE",
    "TIME OBSERVATIONS STARTED",
    "OBSERVATION TYPE",
    "DURATION MINUTES",
    "EFFORT DISTANCE KM",
    "ALL SPECIES REPORTED",
    "SAMPLING EVENT IDENTIFIER"
]

# Load a larger sample with just these columns
df = pd.read_csv(
    "../data/raw/ebird_spain_2020-2025.txt",
    delimiter="\t",
    usecols=cols_to_keep,
    nrows=1_000_000
)
print(f"Shape: {df.shape}")
df.head()

In [None]:
# What observation types exist?
print("Observation Types:")
print(df["OBSERVATION TYPE"].value_counts())

## 3. Data Quality

Check for missing values and data quality issues.

In [None]:
# Check for null values
print("Null counts per column:")
print(df.isnull().sum())

In [None]:
# Nulls in DURATION_MINUTES and EFFORT_DISTANCE_KM are expected:
# - Stationary protocols don't have distance
# - Incidental observations don't have duration or distance

print("\nNull analysis by observation type:")
for obs_type in df["OBSERVATION TYPE"].unique():
    subset = df[df["OBSERVATION TYPE"] == obs_type]
    print(f"\n{obs_type}:")
    print(f"  Duration nulls: {subset['DURATION MINUTES'].isnull().sum()} / {len(subset)}")
    print(f"  Distance nulls: {subset['EFFORT DISTANCE KM'].isnull().sum()} / {len(subset)}")

In [None]:
# ALL SPECIES REPORTED indicates complete checklists
# These are valuable for generating negative observations (species NOT seen)
print("Complete checklists (ALL SPECIES REPORTED):")
print(df["ALL SPECIES REPORTED"].value_counts())
print(f"\n{df['ALL SPECIES REPORTED'].mean() * 100:.1f}% are complete checklists")

## 4. Scope Decision

### Region: Andalucía

We're limiting the scope to Andalucía region for several reasons:
1. **API Rate Limits**: Open-Meteo's historical weather API has daily limits
2. **Manageable Scale**: Fewer coordinates to fetch weather for
3. **Rich Biodiversity**: Andalucía has diverse habitats and bird species

### Weather Data: 2025 Only

Originally planned 5 years (2020-2025), but reduced to 1 year (2025) due to:
1. **API Constraints**: 10k calls/day limit
2. **Faster Iteration**: Can train initial models sooner

In [None]:
# Check available regions (STATE column)
print("Observations by region:")
print(df["STATE"].value_counts())

In [None]:
# How many unique coordinates in Andalucía?
andalucia_df = df[df["STATE"] == "Andalucía"].copy()
andalucia_df["lat_rounded"] = andalucia_df["LATITUDE"].round(1)
andalucia_df["lon_rounded"] = andalucia_df["LONGITUDE"].round(1)

unique_coords = andalucia_df[["lat_rounded", "lon_rounded"]].drop_duplicates()
print(f"Unique coordinates in Andalucía (0.1 degree precision): {len(unique_coords)}")

In [None]:
# Date range in the data
print(f"Date range: {df['OBSERVATION DATE'].min()} to {df['OBSERVATION DATE'].max()}")