# üßº Data Cleaning Notebook ‚Äì Milestone 2

**Project:** Gold Pathfinder ML Project  
**Notebook:** `01_data_cleaning.ipynb`  

This notebook documents how the **raw ALS assay files** are loaded, inspected, and cleaned
before being saved into the **`1_datasets/cleaned/`** folder.

It mirrors the logic implemented in the Python script:

```text
2_data_preparation/scripts/data_preparation.py
```

but presents it in a step-by-step, human-readable form for ELO2 evaluation.


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

# Assume this notebook lives in 2_data_preparation/
PROJECT_ROOT = Path('..').resolve()
RAW_DIR = PROJECT_ROOT / '1_datasets' / 'raw'
CLEANED_DIR = PROJECT_ROOT / '1_datasets' / 'cleaned'

CLEANED_DIR.mkdir(parents=True, exist_ok=True)
PROJECT_ROOT, RAW_DIR, CLEANED_DIR

## 1Ô∏è‚É£ Inspect Available Raw Files

We first inspect which raw CSV files are present in `1_datasets/raw/`.


In [None]:
list(RAW_DIR.glob('*.csv'))

## 2Ô∏è‚É£ Helper Functions for Cleaning

We define helper functions to standardize column names and parse values
reported below detection limits, like `"<0.01"`.


In [None]:
def standardize_column_names(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = (
        df.columns
        .str.strip()
        .str.replace(' ', '_', regex=False)
        .str.lower()
    )
    return df


def parse_detection_limit(value):
    if pd.isna(value):
        return np.nan, False
    if isinstance(value, (int, float)):
        return float(value), False
    s = str(value).strip()
    if not s:
        return np.nan, False
    if s.startswith('<'):
        try:
            lod = float(s[1:])
            return lod, True
        except ValueError:
            return np.nan, True
    try:
        return float(s), False
    except ValueError:
        return np.nan, False


def convert_numeric_with_flag(df: pd.DataFrame, cols):
    df = df.copy()
    below_detection = pd.Series(False, index=df.index)
    for col in cols:
        if col not in df.columns:
            continue
        parsed_vals = []
        flags = []
        for v in df[col]:
            val, flag = parse_detection_limit(v)
            parsed_vals.append(val)
            flags.append(flag)
        df[col] = parsed_vals
        below_detection = below_detection | pd.Series(flags, index=df.index)
    return df, below_detection

## 3Ô∏è‚É£ Example: Clean One Raw File (Core Assays)

As an example, we demonstrate how to clean a core assay file
(e.g., `An1_Core.csv`). Adjust the filename if needed.


In [None]:
example_file = RAW_DIR / 'An1_Core.csv'  # adjust if different
example_file

In [None]:
core_df_raw = pd.read_csv(example_file)
core_df_raw.head()

### 3.1 Standardize Column Names

In [None]:
core_df = standardize_column_names(core_df_raw)
core_df.head()

### 3.2 Select and Rename Key Columns

Adjust the mappings below to match your cleaned schema.


In [None]:
col_map = {
    'field_id': 'sample_id',
    'sample_id': 'sample_id',
    'lab_id': 'lab_id',
    'x': 'easting',
    'y': 'northing',
    'elevation_from_m': 'elevation_from',
    'elevation_to_m': 'elevation_to',
    'au_ppm': 'au_ppm',
    'au': 'au_ppm',
    'as_ppm': 'as_ppm',
    'sb_ppm': 'sb_ppm',
    'bi_ppm': 'bi_ppm',
    'cu_ppm': 'cu_ppm',
    'zn_ppm': 'zn_ppm',
    'pb_ppm': 'pb_ppm',
    'ag_ppm': 'ag_ppm',
}

clean_core = pd.DataFrame(index=core_df.index)
for raw_col, std_col in col_map.items():
    if raw_col in core_df.columns:
        clean_core[std_col] = core_df[raw_col]

clean_core['sample_type'] = 'core'
clean_core['project_area'] = 'Shamkya'
clean_core['anomaly_id'] = 'An1'

clean_core.head()

### 3.3 Convert Geochemical Columns to Numeric

In [None]:
numeric_cols = [
    'au_ppm', 'as_ppm', 'sb_ppm', 'bi_ppm',
    'cu_ppm', 'zn_ppm', 'pb_ppm', 'ag_ppm',
]

clean_core, bdl_flag = convert_numeric_with_flag(clean_core, numeric_cols)
clean_core['below_detection'] = bdl_flag
clean_core.describe(include='all')

### 3.4 Save Cleaned Core Dataset

We now save the cleaned core data to:

```text
1_datasets/cleaned/core_assays_clean.csv
```


In [None]:
core_out = CLEANED_DIR / 'core_assays_clean.csv'
clean_core.to_csv(core_out, index=False)
core_out

## 4Ô∏è‚É£ Generalizing to Other Files

Repeat similar steps for:

- `An1_RC.csv`
- `An6_Chip.csv`
- `An7_Chip.csv`
- `An6-Trenchs_Result.csv`
- `An6-Grap.csv`
- `An7_Grap.csv`

In practice, we use the **Python script** in `2_data_preparation/scripts/`
to automate these steps. This notebook serves as documentation and
an educational walkthrough for Milestone 2.
