# Tutorial: Getting Started with jpinfectpy

**Audience:** analysts and researchers who want a quick, reproducible look at Japanese infectious disease surveillance data.

**Prerequisites:** basic Python, familiarity with dataframes, and `pip install -e .` (or `pip install jpinfectpy`).

**Learning goals:**
- Load a bundled dataset.
- Reshape between wide and long formats.
- Answer a small, practical question with Polars.

## Outline

1. Load a bundled dataset.
2. Reshape to long format.
3. Summarize a disease signal.
4. Practice: find top prefectures for the latest week.

In [None]:
# Setup
from __future__ import annotations

import polars as pl

from jpinfectpy.datasets import load_dataset
from jpinfectpy.transform import pivot

## Step 1 - Load a bundled dataset

The `bullet` dataset ships with the package, so it runs offline and is ideal for examples.

In [None]:
# Load the weekly bulletin dataset (wide format)
wide_df = load_dataset("bullet")
wide_df.head(5)

## Step 2 - Reshape to long format

`pivot` switches between wide and long layouts. Long format is often easier for filtering and group-by analysis.

In [None]:
long_df = pivot(wide_df, return_type="polars")
long_df.head(5)

## Step 3 - Summarize a disease signal

Let's compute total influenza cases by prefecture and keep the top 5.

In [None]:
influenza_totals = (
    long_df
    .filter(pl.col("disease") == "Influenza")
    .group_by("prefecture")
    .agg(pl.col("cases").sum().alias("cases_total"))
    .sort("cases_total", descending=True)
    .head(5)
)

influenza_totals

## Exercises

**Exercise:** Find the latest year/week in the dataset and return the top 5 prefectures for Influenza in that week.

In [None]:
# Exercise scaffold
# 1) Find the latest year/week
latest = long_df.select([
    pl.max("year").alias("year"),
    pl.max("week").alias("week"),
]).row(0)
latest_year, latest_week = latest

# 2) Filter Influenza rows for that week
# 3) Group by prefecture and return top 5
# TODO: implement

In [None]:
# Exercise answer
latest = long_df.select([
    pl.max("year").alias("year"),
    pl.max("week").alias("week"),
]).row(0)
latest_year, latest_week = latest

influenza_latest = (
    long_df
    .filter(
        (pl.col("disease") == "Influenza")
        & (pl.col("year") == latest_year)
        & (pl.col("week") == latest_week)
    )
    .group_by("prefecture")
    .agg(pl.col("cases").sum().alias("cases_total"))
    .sort("cases_total", descending=True)
    .head(5)
)

influenza_latest

## Pitfalls and extensions

- **Pitfall:** `read_bullet` downloads files from the web; use it only when you want the latest data and have network access.
- **Extension:** Join with population or demographics data to compute rates per 100,000.