# Notebook 07 Microexpression Modeling Kickoff ‚Äî CASME II + SMIC Fusion  
Last polished on: 2025-10-14

### Purpose
This notebook launches the first emotion modeling phase of the trauma-informed AI framework.  
Using the fused metadata from CASME II and SMIC, we will:

- Frame the difference between **micro** and **macro** expressions
- Engineer features based on **duration, modality, and action units**
- Train early classifiers to **predict emotion labels**
- Prepare for downstream Z3 symbolic verification (Notebook 08)

---

### Input:
- `fused_microexpression_metadata.parquet` (from Notebook 06)

### Output:
- Classifier artifacts (joblib / pickle)
- Cleaned modeling data
- Visuals: confusion matrix, ROC, emotion distributions

---

### Reminder:
All saves must go to:
- `outputs/checks/` ‚Üí for `.parquet`, `.csv`, `.joblib`
- `outputs/visuals/` ‚Üí for plots and diagrams


In [None]:
# =============================================================================
# 7.0 Microexpression Modeling Kickoff 
# =============================================================================
# Purpose:
#   - Begin emotion modeling using fused CASME II + SMIC metadata
#   - Engineer emotion features, temporal windows, and AU tags
#   - Build early exploratory models (baseline classifiers, timelines, flags)
# =============================================================================

from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- Define root paths (project-level consistency) ----------------------------
ROOT = Path.cwd().parent  # From /notebooks/, go up to project root
DATA_DIR = ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
RAW_DIR = DATA_DIR / "raw"
CHECKS_DIR = ROOT / "outputs" / "checks"
VIS_DIR = ROOT / "outputs" / "visuals"

# --- Create output folders if missing -----------------------------------------
CHECKS_DIR.mkdir(parents=True, exist_ok=True)
VIS_DIR.mkdir(parents=True, exist_ok=True)

# --- Confirm notebook init ----------------------------------------------------
print("‚úÖ Notebook 07 initialized successfully")
print(f"üìÇ Root:       {ROOT}")
print(f"üìÇ Checks:     {CHECKS_DIR}")
print(f"üìÇ Visuals:    {VIS_DIR}")


# =============================================================================
# 7.1 Load Fused Metadata
# -----------------------------------------------------------------------------
# Load the cleaned metadata that includes both CASME II and SMIC microexpression records.
# This will be the foundation for all modeling and AU-based augmentation.
# =============================================================================

FUSED_PATH = CHECKS_DIR / "fused_microexpression_metadata.parquet"

try:
    fusion_df = pd.read_parquet(FUSED_PATH)
    print(f"‚úÖ Loaded fused metadata: {fusion_df.shape}")
except FileNotFoundError:
    print(f"‚ùå Fused metadata not found at: {FUSED_PATH}")
    fusion_df = None

# --- Preview structure and distribution ---------------------------------------
if fusion_df is not None:
    display(fusion_df.head(3))
    display(fusion_df.info())
    print("‚úÖ Emotion distribution:")
    print(fusion_df["Emotion"].value_counts())



---
## 7.2 Engineer Microexpression Features

This step sets the stage for any model to learn patterns by creating useful, numeric features from raw metadata.

 Goals:

- Convert Onset, Peak, Offset, and Duration to numeric

- Compute Latency (time between Onset and Peak)

- Compute Intensity Window (Peak to Offset)

- Count ActionUnits (AU count from string list like "4+L10" ‚Üí 2)

- Normalize casing in Modality, handle missing values if needed

- Confirm feature distribution

In [None]:
# =============================================================================
# 7.2 Engineer Microexpression Features
# -----------------------------------------------------------------------------
# Convert timing columns to numeric and compute derived features:
#   - Latency = Peak - Onset
#   - Intensity = Offset - Peak
#   - AU_Count = number of Action Units (e.g., "4+L10" ‚Üí 2)
# Also standardize modality and handle missing values.
# =============================================================================

# --- Convert columns to numeric ------------------------------------------------
cols_to_numeric = ["Onset", "Peak", "Offset", "Duration"]
for col in cols_to_numeric:
    fusion_df[col] = pd.to_numeric(fusion_df[col], errors="coerce")

# --- Derive latency and intensity ---------------------------------------------
fusion_df["Latency"] = fusion_df["Peak"] - fusion_df["Onset"]
fusion_df["Intensity"] = fusion_df["Offset"] - fusion_df["Peak"]

# --- Count Action Units -------------------------------------------------------
# Handles values like "4+L10", "12", etc.
def count_aus(entry):
    if pd.isna(entry):
        return 0
    return len(str(entry).split("+"))

fusion_df["AU_Count"] = fusion_df["ActionUnits"].apply(count_aus)

# --- Normalize modality casing ------------------------------------------------
fusion_df["Modality"] = fusion_df["Modality"].str.upper()

# --- Check nulls and structure ------------------------------------------------
display(fusion_df[["Onset", "Peak", "Offset", "Latency", "Intensity", "AU_Count"]].describe())
print("‚úÖ Feature engineering complete ‚Äî ready for modeling!")


In [None]:
# =============================================================================
# 7.2.1 Save Feature-Engineered Microexpression Metadata
# -----------------------------------------------------------------------------
# Purpose:
#   - Save the updated DataFrame after computing Latency, Intensity, AU_Count
#   - Stored as safe .parquet format for reuse in 7.3 modeling pipeline
# =============================================================================

FEATURES_PATH = CHECKS_DIR / "microexpression_features.parquet"

fusion_df.to_parquet(FEATURES_PATH, index=False)
print(f"‚úÖ Saved engineered features ‚Üí {FEATURES_PATH.name}")


In [None]:
# üï∑Ô∏è SPider Check- Confirm save worked -----------------------------------------------------
if FEATURES_PATH.exists():
    print("üìÇ Feature file contents:")
    display(pd.read_parquet(FEATURES_PATH).sample(3))
else:
    print("‚ùå Save failed ‚Äî file not found!")


---
## 7.3 Microexpression Emotion Modeling Kickoff
Purpose:
   Build baseline classifiers to predict emotion labels using facial-action 
   metadata features (Latency, Intensity, AU_Count).
   Evaluate model performance with accuracy, F1-score, and confusion matrix.
   Save predictions for integration with Z3 rule logic in Notebook 08.


In [None]:
# =============================================================================
# 7.3 Microexpression Emotion Modeling Kickoff
# -----------------------------------------------------------------------------
# Train baseline classifiers (LogReg, RF, KNN) on facial metadata.
# Use a pipeline with imputation + scaling.
# Evaluate with classification metrics and confusion matrices.
# Save Z3-ready predictions as a separate cell to ensure image output finalizes.
# =============================================================================

# --- Import Packages ---------------------------------------------------------
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, f1_score, confusion_matrix,
    classification_report, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt
import seaborn as sns

# --- Load Features -----------------------------------------------------------
ROOT = Path.cwd().parent
FEATURE_PATH = ROOT / "outputs" / "checks" / "microexpression_features.parquet"
df = pd.read_parquet(FEATURE_PATH)

print(f"‚úÖ Loaded features: {df.shape}")
display(df.head())

# --- Visualize Emotion Distribution ------------------------------------------
plt.figure(figsize=(8, 4))
sns.countplot(x="Emotion", data=df, order=df["Emotion"].value_counts().index)
plt.title("Emotion Class Distribution")
plt.xlabel("Emotion")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# --- Train/Test Split --------------------------------------------------------
X = df[["Latency", "Intensity", "AU_Count"]]
y = df["Emotion"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

print(f"üìä Train set: {X_train.shape}, Test set: {X_test.shape}")

