# üìä EDA ‚Äì Distributions & Correlations

**Project:** Gold Pathfinder ML Project  
**Notebook:** `02_eda_distributions_correlations.ipynb`  
**Milestone:** 3 ‚Äì Data Analysis / Exploration

This notebook focuses on:

- Visualizing distributions of Au and pathfinder elements
- Exploring log-transforms for skewed geochemical data
- Computing and plotting correlation matrices
- Identifying candidate pathfinder elements for gold


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', 80)
pd.set_option('display.width', 140)
pd.set_option('display.float_format', lambda x: f'{x:0.4f}')

plt.rcParams['figure.figsize'] = (8, 5)

# Paths
PROJECT_ROOT = Path('..').resolve()
DATA_PROCESSED = PROJECT_ROOT / '1_datasets' / 'processed'
final_path = DATA_PROCESSED / 'gold_assays_final.csv'
final_path, final_path.exists()

In [None]:
df = pd.read_csv(final_path)
df.head()

## 1Ô∏è‚É£ Select Gold and Candidate Pathfinder Elements

We focus on:

- `au_ppm` (gold)
- Candidate pathfinders: `as_ppm`, `sb_ppm`, `bi_ppm`, `cu_ppm`, `zn_ppm`, `pb_ppm`, `ag_ppm`

Columns are included only if they exist in the dataset.

In [None]:
candidates = ['au_ppm', 'as_ppm', 'sb_ppm', 'bi_ppm', 'cu_ppm', 'zn_ppm', 'pb_ppm', 'ag_ppm']
cols = [c for c in candidates if c in df.columns]
cols

In [None]:
sub = df[cols].copy()
sub.describe().T

## 2Ô∏è‚É£ Distributions (Linear Scale)

We start with simple histograms on the original scale.

In [None]:
for col in cols:
    plt.figure()
    sns.histplot(sub[col].dropna(), bins=40, kde=False)
    plt.title(f'Distribution of {col} (linear scale)')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

## 3Ô∏è‚É£ Log-Transformed Distributions

Geochemical data are often right-skewed. We apply a log10(x + Œµ) transform
to visualize distributions more clearly.

In [None]:
eps = 1e-6
log_sub = {}
for col in cols:
    series = sub[col]
    valid = series > 0
    log_series = pd.Series(np.nan, index=series.index)
    log_series[valid] = np.log10(series[valid] + eps)
    log_sub[f'log10_{col}'] = log_series

log_df = pd.DataFrame(log_sub)
log_df.describe().T

In [None]:
for col in log_df.columns:
    plt.figure()
    sns.histplot(log_df[col].dropna(), bins=40, kde=False)
    plt.title(f'Distribution of {col} (log10 scale)')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

## 4Ô∏è‚É£ Correlation Matrix (Linear Scale)

We compute Pearson correlation coefficients between gold and candidate pathfinders.

In [None]:
corr = sub.corr(method='pearson')
corr

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', square=True, cbar_kws={'shrink': 0.8})
plt.title('Correlation Matrix ‚Äì Linear Scale')
plt.tight_layout()
plt.show()

## 5Ô∏è‚É£ Correlation with Gold Only

We extract a sorted view of which elements correlate most strongly with `au_ppm`.

In [None]:
if 'au_ppm' in corr.columns:
    au_corr = corr['au_ppm'].sort_values(ascending=False)
    au_corr
else:
    au_corr = None
    print("Warning: 'au_ppm' not found in correlation matrix.")

## 6Ô∏è‚É£ Save EDA Outputs

We save key numeric results (correlation with gold) to CSV
for later use in modeling and reporting.

In [None]:
OUTPUT_DIR = Path('.') / 'outputs'
OUTPUT_DIR.mkdir(exist_ok=True)

if au_corr is not None:
    au_corr.to_csv(OUTPUT_DIR / 'correlation_with_gold.csv', header=['pearson_corr'])
OUTPUT_DIR

## ‚úÖ Summary

In this notebook we:

- Inspected distributions of gold and key pathfinder elements.
- Applied log10 transforms to handle skewed geochemical data.
- Computed correlation matrices and identified elements most strongly
  associated with `au_ppm`.

These findings will guide pathfinder selection for modeling
in `4_data_analysis/` and visual storytelling in Milestone 4.
