# üîç Exploratory Data Analysis ‚Äì Overview

**Project:** Gold Pathfinder ML Project  
**Notebook:** `01_eda_overview.ipynb`  
**Milestone:** 3 ‚Äì Data Analysis / Exploration

This notebook provides a **first look** at the prepared dataset:

- Load `gold_assays_final.csv`
- Inspect structure, columns, and dtypes
- Check basic statistics and missing values
- Summarize sample types and anomalies

It is the starting point for all deeper EDA and modeling work.

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 80)
pd.set_option('display.width', 140)
pd.set_option('display.float_format', lambda x: f'{x:0.4f}')

# Assume this notebook lives in 3_data_exploration/
PROJECT_ROOT = Path('..').resolve()
DATA_PROCESSED = PROJECT_ROOT / '1_datasets' / 'processed'
DATA_PROCESSED

## 1Ô∏è‚É£ Load Final Processed Dataset

We load the unified assay dataset created in Milestone 2:

```text
1_datasets/processed/gold_assays_final.csv
```

In [None]:
final_path = DATA_PROCESSED / 'gold_assays_final.csv'
final_path, final_path.exists()

In [None]:
df = pd.read_csv(final_path)
df.head()

## 2Ô∏è‚É£ Basic Shape & Structure

- Number of rows and columns  
- Column names and data types

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.columns.tolist()

## 3Ô∏è‚É£ Missing Values Overview

We check how many missing values each column contains.
This helps prioritize cleaning or imputation in later steps.

In [None]:
missing = df.isna().sum().sort_values(ascending=False)
missing[missing > 0]

## 4Ô∏è‚É£ Key Categorical Fields

We inspect value counts for important categorical fields such as:

- `sample_type` (core, rc, chip, trench, grab)
- `project_area`
- `anomaly_id`

In [None]:
categorical_cols = [c for c in ['sample_type', 'project_area', 'anomaly_id'] if c in df.columns]
for col in categorical_cols:
    print(f'\nValue counts for {col}:')
    print(df[col].value_counts(dropna=False))

## 5Ô∏è‚É£ Numerical Summary

We compute descriptive statistics for key numerical columns such as:

- `au_ppm` (gold)
- pathfinder elements: `as_ppm`, `sb_ppm`, `bi_ppm`, `cu_ppm`, `zn_ppm`, etc.

In [None]:
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols].describe().T.sort_values('mean')

## 6Ô∏è‚É£ Save a Lightweight Profile (Optional)

You can optionally save selected summary tables (e.g., missingness or
numeric stats) into CSVs for quick reference in `3_data_exploration/outputs/`.

In [None]:
OUTPUT_DIR = Path('.') / 'outputs'
OUTPUT_DIR.mkdir(exist_ok=True)

missing.to_csv(OUTPUT_DIR / 'missing_summary.csv', header=['n_missing'])
df[numeric_cols].describe().T.to_csv(OUTPUT_DIR / 'numeric_summary.csv')
OUTPUT_DIR

## ‚úÖ Summary

This notebook confirms that:

- The final dataset can be loaded successfully.
- The main structure (rows, columns, dtypes) is understood.
- We have an overview of missing values and key categorical breakdowns.

Next, we will move to **visual EDA**:

- Histograms and distributions of Au and pathfinder elements.
- Log-transformed views for skewed variables.
- First correlation checks.
