# CMM Data Tutorial

This notebook demonstrates how to use the `cmm_data` package to access Critical Minerals Modeling datasets.

## Installation

```bash
# Basic installation
pip install -e /path/to/cmm_data

# With visualization support
pip install -e "/path/to/cmm_data[viz]"

# Full installation (includes geospatial)
pip install -e "/path/to/cmm_data[full]"
```

In [None]:
import cmm_data
import pandas as pd

print(f"CMM Data version: {cmm_data.__version__}")

## 1. Data Catalog

View all available datasets and their status.

In [None]:
# Get the full data catalog
catalog = cmm_data.get_data_catalog()
catalog[['dataset', 'name', 'source', 'format', 'available']]

In [None]:
# List all available commodity codes
commodities = cmm_data.list_commodities()
print(f"Total commodities: {len(commodities)}")
print(f"\nFirst 20: {commodities[:20]}")

In [None]:
# List DOE critical minerals
critical = cmm_data.list_critical_minerals()
print(f"Critical minerals ({len(critical)}):")
print(critical)

## 2. USGS Commodity Data

Access world production and U.S. salient statistics for 80+ mineral commodities.

In [None]:
from cmm_data import USGSCommodityLoader

loader = USGSCommodityLoader()

# Describe the dataset
loader.describe()

### 2.1 World Production Data

In [None]:
# Load lithium world production
lithium = loader.load_world_production("lithi")
lithium

In [None]:
# Get top 10 lithium producers
top_lithium = loader.get_top_producers("lithi", top_n=10)
top_lithium[['Country', 'Prod_t_est_2022', 'Reserves_t']]

In [None]:
# Load cobalt data
cobalt = loader.load_world_production("cobal")
top_cobalt = loader.get_top_producers("cobal", top_n=10)
top_cobalt[['Country', 'Prod_t_est_2022', 'Reserves_t']]

In [None]:
# Load rare earths data
ree = loader.load_world_production("raree")
top_ree = loader.get_top_producers("raree", top_n=10)
top_ree[['Country', 'Prod_t_est_2022', 'Reserves_t']]

### 2.2 U.S. Salient Statistics

In [None]:
# Load lithium salient statistics (time series data)
lithium_stats = loader.load_salient_statistics("lithi")
lithium_stats

In [None]:
# Compare multiple critical minerals
minerals_to_compare = ['lithi', 'cobal', 'nicke', 'graph']

comparison_data = []
for code in minerals_to_compare:
    try:
        df = loader.load_salient_statistics(code)
        latest = df.iloc[-1]
        comparison_data.append({
            'Commodity': loader.get_commodity_name(code),
            'Code': code,
            'Year': latest.get('Year', 'N/A'),
            'US_Production': latest.get('USprod_t_clean', latest.get('USprod_t', 'N/A')),
            'Imports': latest.get('Imports_t_clean', latest.get('Imports_t', 'N/A')),
            'Net_Import_Reliance': latest.get('NIR_pct', 'N/A')
        })
    except Exception as e:
        print(f"Error loading {code}: {e}")

pd.DataFrame(comparison_data)

### 2.3 Using Convenience Functions

In [None]:
# Quick loading with convenience functions
df_world = cmm_data.load_usgs_commodity("lithi", "world")
df_salient = cmm_data.load_usgs_commodity("lithi", "salient")

print("World production columns:", list(df_world.columns))
print("\nSalient statistics columns:", list(df_salient.columns))

## 3. USGS Ore Deposits Database

Access geochemical analyses from ore deposits worldwide.

In [None]:
from cmm_data import USGSOreDepositsLoader

ore_loader = USGSOreDepositsLoader()

# List available tables
print("Available tables:")
print(ore_loader.list_available())

In [None]:
# Load the data dictionary
data_dict = ore_loader.load_data_dictionary()
print(f"Total fields: {len(data_dict)}")
data_dict[['FIELD_NAME', 'FIELD_DESC', 'FIELD_UNIT']].head(20)

In [None]:
# Load geology data
geology = ore_loader.load_geology()
print(f"Total deposits: {len(geology)}")
geology.head()

In [None]:
# Get REE samples
ree_samples = ore_loader.get_ree_samples()
print(f"REE sample columns: {len(ree_samples.columns)}")
ree_samples.head()

In [None]:
# Get statistics for specific elements
for element in ['La', 'Ce', 'Nd', 'Li']:
    try:
        stats = ore_loader.get_element_statistics(element)
        print(f"\n{element}:")
        print(f"  Valid samples: {stats['valid_samples']}")
        print(f"  Mean: {stats['mean']:.2f}" if stats['mean'] else "  Mean: N/A")
        print(f"  Max: {stats['max']:.2f}" if stats['max'] else "  Max: N/A")
    except Exception as e:
        print(f"\n{element}: {e}")

## 4. Preprocessed Document Corpus

Access the unified corpus of critical minerals documents.

In [None]:
from cmm_data import PreprocessedCorpusLoader

corpus_loader = PreprocessedCorpusLoader()

# Get corpus statistics
try:
    stats = corpus_loader.get_corpus_stats()
    print(f"Total documents: {stats.get('total_documents', 'N/A')}")
    print(f"Columns: {stats.get('columns', [])}")
    if 'text_stats' in stats:
        print(f"\nText statistics:")
        for k, v in stats['text_stats'].items():
            print(f"  {k}: {v:,.0f}" if isinstance(v, (int, float)) else f"  {k}: {v}")
except Exception as e:
    print(f"Corpus not available: {e}")

In [None]:
# Search the corpus
try:
    results = corpus_loader.search("lithium extraction", limit=5)
    print(f"Found {len(results)} documents matching 'lithium extraction'")
    if not results.empty:
        display(results.head())
except Exception as e:
    print(f"Search not available: {e}")

## 5. OECD Supply Chain Data

Access export restrictions and IEA critical minerals data.

In [None]:
from cmm_data import OECDSupplyChainLoader

oecd_loader = OECDSupplyChainLoader()

# List available datasets
print("Available OECD datasets:")
print(oecd_loader.list_available())

In [None]:
# Get minerals coverage information
coverage = oecd_loader.get_minerals_coverage()
for dataset, info in coverage.items():
    print(f"\n{dataset}:")
    print(f"  Description: {info['description']}")
    if 'key_minerals' in info:
        print(f"  Key minerals: {', '.join(info['key_minerals'][:5])}...")

In [None]:
# List PDF reports
try:
    export_pdfs = oecd_loader.get_export_restrictions_reports()
    print("Export Restrictions Reports:")
    for pdf in export_pdfs:
        print(f"  - {pdf.name}")
except Exception as e:
    print(f"Error: {e}")

try:
    iea_pdfs = oecd_loader.get_iea_minerals_reports()
    print("\nIEA Critical Minerals Reports:")
    for pdf in iea_pdfs:
        print(f"  - {pdf.name}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Get download URLs for manual data retrieval
urls = oecd_loader.get_download_urls()
print("Manual Download URLs:")
for name, url in urls.items():
    print(f"  {name}: {url}")

## 6. Geoscience Australia 3D Model

Access the chronostratigraphic model of Australia.

In [None]:
from cmm_data import GAChronostratigraphicLoader

ga_loader = GAChronostratigraphicLoader()

# Get model information
info = ga_loader.get_model_info()
print("GA 3D Chronostratigraphic Model:")
for k, v in info.items():
    print(f"  {k}: {v}")

In [None]:
# List available surfaces
surfaces = ga_loader.list_surfaces()
print("\nChronostratigraphic surfaces:")
for s in surfaces:
    print(f"  - {s}")

In [None]:
# Load a surface (XYZ format)
try:
    paleozoic = ga_loader.load("Paleozoic_Top", format="xyz")
    print(f"Paleozoic_Top surface: {len(paleozoic)} points")
    print(paleozoic.describe())
except Exception as e:
    print(f"Data not available: {e}")
    print("Download from: https://ecat.ga.gov.au/geonetwork/srv/eng/catalog.search#/metadata/149923")

## 7. NETL REE and Coal Database

Access REE data from coal and coal-related resources.

In [None]:
from cmm_data import NETLREECoalLoader

netl_loader = NETLREECoalLoader()

# List available layers
try:
    layers = netl_loader.list_available()
    print("Available layers:")
    for layer in layers:
        print(f"  - {layer}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# Get REE statistics (requires geopandas)
try:
    ree_stats = netl_loader.get_ree_statistics()
    print("REE Statistics from Coal:")
    for elem, stats in ree_stats.items():
        print(f"  {elem}: mean={stats['mean']:.2f}, max={stats['max']:.2f} ppm")
except Exception as e:
    print(f"REE statistics not available: {e}")
    print("Install geopandas: pip install cmm-data[geo]")

## 8. Visualizations

Create charts and maps from CMM data.

**Note:** Requires `pip install cmm-data[viz]`

In [None]:
# Check if matplotlib is available
try:
    import matplotlib.pyplot as plt
    %matplotlib inline
    VIZ_AVAILABLE = True
    print("Visualization libraries available")
except ImportError:
    VIZ_AVAILABLE = False
    print("Install visualization support: pip install cmm-data[viz]")

In [None]:
if VIZ_AVAILABLE:
    from cmm_data.visualizations.commodity import plot_world_production
    
    # Plot lithium producers
    lithium_data = cmm_data.load_usgs_commodity("lithi", "world")
    fig = plot_world_production(lithium_data, "Lithium", top_n=8)
    plt.show()

In [None]:
if VIZ_AVAILABLE:
    from cmm_data.visualizations.commodity import plot_production_timeseries
    
    # Plot lithium time series
    lithium_salient = cmm_data.load_usgs_commodity("lithi", "salient")
    fig = plot_production_timeseries(lithium_salient, "Lithium")
    plt.show()

In [None]:
if VIZ_AVAILABLE:
    from cmm_data.visualizations.commodity import plot_import_reliance
    
    # Plot cobalt import reliance
    cobalt_salient = cmm_data.load_usgs_commodity("cobal", "salient")
    fig = plot_import_reliance(cobalt_salient, "Cobalt")
    plt.show()

In [None]:
if VIZ_AVAILABLE:
    from cmm_data.visualizations.timeseries import plot_critical_minerals_comparison
    
    # Compare critical minerals
    fig = plot_critical_minerals_comparison(metric="Imports_t", top_n=12)
    plt.show()

## 9. Cross-Dataset Search

Search across multiple datasets at once.

In [None]:
from cmm_data.catalog import search_all_datasets

# Search for "lithium" across all datasets
results = search_all_datasets("lithium")
print(f"Found {len(results)} results for 'lithium'")
results

In [None]:
# Search for "rare earth"
results = search_all_datasets("rare earth")
print(f"Found {len(results)} results for 'rare earth'")
results

## 10. Configuration

Customize the package configuration.

In [None]:
from cmm_data import get_config, configure

# View current configuration
config = get_config()
print(f"Data root: {config.data_root}")
print(f"Cache enabled: {config.cache_enabled}")
print(f"Cache TTL: {config.cache_ttl_seconds} seconds")

In [None]:
# Validate configuration (check what's available)
status = config.validate()
print("\nDataset availability:")
for dataset, available in status.items():
    status_icon = "[OK]" if available else "[--]"
    print(f"  {status_icon} {dataset}")

In [None]:
# Example: Reconfigure with custom settings
# cmm_data.configure(
#     data_root="/custom/path/to/Globus_Sharing",
#     cache_enabled=True,
#     cache_ttl_seconds=7200  # 2 hours
# )

## Summary

The `cmm_data` package provides:

1. **Unified API** for accessing multiple critical minerals datasets
2. **7 specialized loaders** for different data sources
3. **Automatic data parsing** with handling for special codes (W, NA, ranges)
4. **Built-in caching** for improved performance
5. **Visualization functions** for quick data exploration
6. **Cross-dataset search** capabilities

### Quick Reference

```python
# Convenience functions
cmm_data.get_data_catalog()           # List all datasets
cmm_data.list_commodities()           # List commodity codes
cmm_data.list_critical_minerals()     # List DOE critical minerals
cmm_data.load_usgs_commodity(code, type)  # Quick data loading

# Loaders
USGSCommodityLoader()      # USGS MCS 2023
USGSOreDepositsLoader()    # USGS ore geochemistry
OSTIDocumentsLoader()      # OSTI technical reports
PreprocessedCorpusLoader() # LLM training corpus
GAChronostratigraphicLoader()  # GA 3D model
NETLREECoalLoader()        # NETL REE/coal data
OECDSupplyChainLoader()    # OECD trade data
```