# Data Module: Basic XN_SAMPLE Cleaning

This notebook demonstrates basic usage of the `XNSampleProcessor` for cleaning and standardizing Sysmex XN_SAMPLE.csv files.

## Overview

The `XNSampleProcessor` provides a simple, pandas-like interface for processing raw Sysmex XN_SAMPLE data exported from decrypted .116 files. It handles:

- Removing duplicate rows and technical samples (QC, calibration)
- Encoding flags and indicators
- Detecting and handling clotted samples
- Managing multiple measurements per sample
- Converting data types and cleaning non-numeric values

## Setup

In [None]:
import sys
from pathlib import Path

# Add parent directory to path to import sysmexcbctools
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sysmexcbctools.data import XNSampleProcessor

# For nice display
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 1000)

print("Imports successful!")

## 1. Basic Usage

The simplest way to use the processor is to load and clean a single XN_SAMPLE.csv file.

### Specifying Data Paths

You have two options for specifying the path to your data:

**Option 1: Use YAML configuration** (recommended for managing multiple datasets)
- Edit `sysmexcbctools/transfer/config/data_paths.yaml` to include your data paths
- The code below will automatically load from the config
- This keeps paths centralized and makes notebooks portable

**Option 2: Manual path specification** (simple fallback)
- If the config file is not available or doesn't contain your dataset
- The code will fall back to a manual path that you can edit
- Just update the `data_path` variable with your file location

In [None]:
# Path to your XN_SAMPLE.csv file
# Option 1: Use the config loader (if you have the data_paths.yaml configured)
try:
    from sysmexcbctools.transfer.config.config_loader import ConfigLoader

    config_loader = ConfigLoader(
        config_file=str(
            project_root / "sysmexcbctools/transfer/config/data_paths.yaml"
        ),
        environment="production",
    )

    # Get the raw data directory for INTERVAL dataset 36
    dataset_dir = config_loader.get_dataset_path(category="raw", dataset="interval_36")
    data_path = dataset_dir / "XN_SAMPLE.csv"

    print(f"✓ Loaded path from config: {data_path}")

except (FileNotFoundError, ValueError, KeyError) as e:
    print(f"⚠ Could not load from config: {e}")
    print("  Falling back to manual path specification...")

    # Option 2: Manually specify your data path
    # EDIT THIS PATH to point to your XN_SAMPLE.csv file:
    data_path = Path("/path/to/your/XN_SAMPLE.csv")

    print(f"\n  Please edit this cell and set data_path to your file location.")
    print(f"  Current (placeholder) path: {data_path}")

print(f"\nFile exists: {data_path.exists()}")
if data_path.exists():
    print(f"File size: {data_path.stat().st_size / 1024**2:.1f} MB")

In [None]:
# Create processor with default settings
processor = XNSampleProcessor(
    verbose=1,  # Show info-level logging
    log_to_file=True,  # Also save diagnostics and logs to file (optional)
)

# Process the file (returns cleaned DataFrame)
df_clean = processor.process_files(
    input_files=str(data_path),
    dataset_name="example",
    save_output=False,  # Don't save to disk (default)
)

print(f"\nCleaned dataframe shape: {df_clean.shape}")
print(f"Unique samples: {df_clean['Sample No.'].nunique()}")

## 2. Inspect Before/After Statistics

Let's compare the raw and cleaned data to see what was removed:

In [None]:
# Load raw data for comparison
df_raw = pd.read_csv(data_path, encoding="ISO-8859-1", low_memory=False)

print("=" * 60)
print("BEFORE CLEANING")
print("=" * 60)
print(f"Shape: {df_raw.shape}")
print(f"Columns: {df_raw.shape[1]}")
print(f"Rows: {df_raw.shape[0]}")
print(f"Unique samples: {df_raw['Sample No.'].nunique()}")
print(f"Sample number prefixes (first 100):")
print(df_raw["Sample No."].str[:3].value_counts().head(10))

print("\n" + "=" * 60)
print("AFTER CLEANING")
print("=" * 60)
print(f"Shape: {df_clean.shape}")
print(f"Columns: {df_clean.shape[1]}")
print(f"Rows: {df_clean.shape[0]}")
print(f"Unique samples: {df_clean['Sample No.'].nunique()}")

## 3. Explore Standard FBC Features

Let's look at the standard full blood count (FBC) parameters that are preserved:

In [None]:
# Standard FBC parameters
fbc_params = [
    "WBC(10^3/uL)",
    "RBC(10^6/uL)",
    "HGB(g/dL)",
    "HCT(%)",
    "PLT(10^3/uL)",
    "MCV(fL)",
    "MCH(pg)",
    "MCHC(g/dL)",
    "NEUT#(10^3/uL)",
    "LYMPH#(10^3/uL)",
    "MONO#(10^3/uL)",
    "EO#(10^3/uL)",
    "BASO#(10^3/uL)",
]

# Check which are available
available_fbc = [p for p in fbc_params if p in df_clean.columns]
print(f"Available FBC parameters ({len(available_fbc)}/{len(fbc_params)}):")
print(available_fbc)

# Display summary statistics
if available_fbc:
    print("\nSummary statistics for FBC parameters:")
    display(df_clean[available_fbc].describe())

In [None]:
# Visualize distributions of key FBC parameters
if len(available_fbc) >= 4:
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    fig.suptitle("Distribution of Key FBC Parameters", fontsize=14, fontweight="bold")

    for ax, param in zip(axes.flat, available_fbc[:4]):
        data = df_clean[param].dropna()
        ax.hist(data, bins=50, alpha=0.7, edgecolor="black")
        ax.set_xlabel(param)
        ax.set_ylabel("Frequency")
        ax.set_title(
            f"{param}\n(n={len(data):,}, {len(data)/len(df_clean)*100:.1f}% non-null)"
        )
        ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()
else:
    print("Not enough FBC parameters available for visualization")

## 4. Review Quality Control Flags

The processor encodes various quality control flags as binary indicators. Let's examine them:

In [None]:
# Find flag columns
flag_cols = [
    col
    for col in df_clean.columns
    if col.startswith("IP ")
    or col.startswith("Error")
    or col.startswith("Positive")
    or col.endswith("Abnormal")
    or col.endswith("Suspect")
]

print(f"Found {len(flag_cols)} flag columns:")
print(flag_cols[:20])  # Show first 20

if flag_cols:
    # Count how many samples have each flag
    flag_counts = df_clean[flag_cols].sum().sort_values(ascending=False)

    print("\nMost common flags (top 10):")
    print(flag_counts.head(10))

    # Visualize most common flags
    plt.figure(figsize=(12, 6))
    flag_counts.head(15).plot(kind="barh")
    plt.xlabel("Number of samples")
    plt.title("Most Common Quality Control Flags")
    plt.tight_layout()
    plt.show()

## 5. Missing Data Analysis

Understanding missingness patterns is important for downstream analysis:

In [None]:
# Visualize missingness for key FBC parameters
if available_fbc:
    missing_fbc = df_clean[available_fbc].isnull().sum() / len(df_clean) * 100

    plt.figure(figsize=(10, 6))
    missing_fbc.sort_values().plot(kind="barh", color="coral")
    plt.xlabel("Missing (%)")
    plt.title("Missingness in Standard FBC Parameters")
    plt.tight_layout()
    plt.show()

## 6. Data Types and Memory Usage

The processor automatically converts columns to appropriate numeric types:

In [None]:
# Check data types
print("Data types summary:")
print(df_clean.dtypes.value_counts())

# Memory usage
memory_mb = df_clean.memory_usage(deep=True).sum() / 1024**2
print(f"\nMemory usage: {memory_mb:.2f} MB")

# Show non-numeric columns (if any)
non_numeric = df_clean.select_dtypes(include=["object"]).columns.tolist()
print(f"\nNon-numeric columns ({len(non_numeric)}):")
print(non_numeric)

## 7. Diagnostic Files

The processor generates diagnostic files for samples that need manual review (e.g., multiple measurements with discrepancies):

In [None]:
# Check if any diagnostic files were created
diagnostic_files = processor.get_diagnostic_files()

if diagnostic_files:
    print("Diagnostic files generated:")
    for file_type, path in diagnostic_files.items():
        print(f"  {file_type}: {path}")
        if Path(path).exists():
            size_kb = Path(path).stat().st_size / 1024
            print(f"    Size: {size_kb:.1f} KB")
else:
    print("No diagnostic files were generated (all samples passed quality checks)")

## 8. Saving Results

If you want to save the processed data:

In [None]:
# Option 1: Save manually
output_path = project_root / "examples" / "outputs" / "cleaned_example.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)

print(f"Saving to: {output_path}")
# Uncomment to actually save:
# df_clean.to_csv(output_path, index=False)
# print(f"Saved! File size: {output_path.stat().st_size / 1024**2:.1f} MB")

print("(Uncomment code above to actually save the file)")

In [None]:
# Option 2: Use processor to save (with automatic timestamping)
processor2 = XNSampleProcessor(
    output_dir=str(project_root / "examples" / "outputs"), verbose=1
)

# This will automatically save with timestamp
# Uncomment to actually process and save:
# df_clean2 = processor2.process_files(
#     input_files=str(data_path),
#     dataset_name="example",
#     save_output=True  # This enables automatic saving
# )

print("(Uncomment code above to process and automatically save)")

## Summary

In this notebook, we demonstrated:

1. ✅ **Basic usage** - Loading and cleaning XN_SAMPLE.csv files with minimal code
2. ✅ **Before/after comparison** - Understanding what data is removed and why
3. ✅ **FBC parameters** - Exploring standard full blood count measurements
4. ✅ **Quality flags** - Reviewing encoded quality control indicators
5. ✅ **Missing data** - Analyzing missingness patterns
6. ✅ **Data types** - Checking automatic type conversion
7. ✅ **Diagnostic files** - Accessing samples flagged for manual review
8. ✅ **Saving results** - Multiple options for output