# Taxonomy Category Analysis

This notebook provides helper functions for analyzing taxonomy category prevalence across rating classes.

## Overview
- **Kruskal-Wallis statistical tests**: Test category prevalence differences across rating classes
- **Visualization helpers**: Box plots with jitter for category prevalence by rating class

## Prerequisites

This notebook expects book-level category proportions data from `aggregate_taxonomy_by_book.py`:
- DataFrame with columns: `book_id`, `rating_class`, `main_category_id`, `prop`


## Setup: Imports and Configuration

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

from typing import List
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import kruskal

print("✓ Imports loaded")


## Helper Functions

In [None]:
def kruskal_by_rating(book_cat: pd.DataFrame) -> pd.DataFrame:
    """
    For each taxonomy category, run a Kruskal–Wallis test
    over rating_class (e.g. low/mid/high) on book-level proportions.

    Parameters
    ----------
    book_cat:
        DataFrame with columns:
        - 'main_category_id'
        - 'rating_class'
        - 'prop'

    Returns
    -------
    DataFrame with columns:
        - category_id
        - groups (list of rating classes tested)
        - n_books_per_group (list of sample sizes)
        - H_statistic
        - p_value
    """
    results: List[dict] = []

    cats = sorted(book_cat["main_category_id"].dropna().unique())
    for cat in cats:
        sub = book_cat[book_cat["main_category_id"] == cat]

        groups = []
        labels = []
        ns = []

        for rating in sorted(sub["rating_class"].unique()):
            vals = sub.loc[sub["rating_class"] == rating, "prop"].dropna()
            if len(vals) >= 5:  # avoid tiny groups
                groups.append(vals.to_numpy())
                labels.append(rating)
                ns.append(len(vals))

        if len(groups) >= 2:
            stat, p = kruskal(*groups)
            results.append(
                {
                    "category_id": cat,
                    "groups": labels,
                    "n_books_per_group": ns,
                    "H_statistic": stat,
                    "p_value": p,
                }
            )

    return pd.DataFrame(results)

print("✓ Kruskal-Wallis helper function defined")


In [None]:
def plot_category_prevalence(
    book_cat: pd.DataFrame,
    category_id: str,
    rating_order=("low", "mid", "high"),
):
    """
    Box + jitter plot for one taxonomy category across rating classes.

    Parameters
    ----------
    book_cat:
        DataFrame with columns:
        - 'main_category_id'
        - 'rating_class'
        - 'prop'
    category_id:
        Taxonomy category ID to plot (e.g., "4.4", "2.3").
    rating_order:
        Tuple of rating class labels in desired order.

    Returns
    -------
    fig, ax:
        Matplotlib figure and axes objects.
    """
    sub = book_cat[book_cat["main_category_id"] == category_id].copy()
    if sub.empty:
        raise ValueError(f"No rows for category_id={category_id}")

    fig, ax = plt.subplots(figsize=(6, 4))

    sns.boxplot(
        data=sub,
        x="rating_class",
        y="prop",
        order=rating_order,
        ax=ax,
    )
    sns.stripplot(
        data=sub,
        x="rating_class",
        y="prop",
        order=rating_order,
        ax=ax,
        alpha=0.4,
        jitter=0.2,
        dodge=False,
    )

    ax.set_title(f"Category {category_id}: prevalence by rating class")
    ax.set_ylabel("Proportion of sentences per book")
    ax.set_xlabel("Rating class")
    plt.tight_layout()
    return fig, ax

print("✓ Visualization helper function defined")
