# Streak Leaderboard (Active Streaks)

Compute the **active streak** for each user and create a leaderboard of the **top 10** users with the longest active streaks.

## Definitions

- **Streak length**: number of **consecutive days** where the user has **at least one** lesson completion.
- **Current date (fixed for this drill)**: `2025-09-29`
- **Active streak requirement**: the user must have completed a lesson on `2025-09-28` (the day before the current date).

## Input

`LessonStreaks.csv` (~900,000 rows) containing lesson completions with:
- user identifier (e.g., `user_id`)
- completion timestamp/date (e.g., `completed_at`)
- lesson identifier (not needed for streak computation)

## Approach

### 1) Normalize to one row per user per day
- Parse completion timestamps into a normalized date column (`day`)
- Filter to `day <= 2025-09-28` because the active streak must end on that date
- Drop duplicates so multiple lessons on the same day count once

### 2) Identify consecutive-day runs
For each user, sort their active days and assign:
- `rn`: running index within each user (0, 1, 2, ...)
- `grp = day - rn days`

Consecutive dates will share the same `grp`, so each `grp` defines a streak “run”.

### 3) Aggregate runs and filter active streaks
Aggregate per `(user, grp)` to get:
- `run_start` = first day in the run
- `run_end` = last day in the run
- `streak_len` = number of days in the run

Then keep only runs where:
- `run_end == 2025-09-28`

These are the **active streaks**.

### 4) Leaderboard + submission value
- Sort active streaks by `streak_len` descending
- Take the top 10 for the leaderboard
- The submission answer is the **third** largest `streak_len`

## Outputs

- `leaderboard`: top 10 users with the longest active streaks
- `third_longest_active_streak`: length of the third longest active streak

In [1]:
import pandas as pd
from pathlib import Path

# Config (per prompt)
DATA_PATH = Path("LessonStreaks.csv")  # update if needed
CURRENT_DATE = pd.Timestamp("2025-09-29")
ACTIVE_DAY = CURRENT_DATE - pd.Timedelta(days=1)  # 2025-09-28

# Load
df = pd.read_csv(DATA_PATH)

# Infer columns (robust to common naming)
def infer_column(columns, candidates):
    cols = set(columns)
    for c in candidates:
        if c in cols:
            return c
    return None

user_col = infer_column(df.columns, ["user_id", "user", "userid", "userId", "UserId", "User"])
date_col = infer_column(df.columns, ["date", "completion_date", "completed_at", "completed_on", "timestamp", "event_date", "lesson_date"])

if user_col is None or date_col is None:
    raise ValueError(
        "Could not infer required columns. "
        f"Found columns: {list(df.columns)}. "
        "Expected something like user_id/user and date/completed_at."
    )

# Normalize to one row per user-day
work = df[[user_col, date_col]].copy()
work["day"] = pd.to_datetime(work[date_col], errors="coerce").dt.normalize()
work = work.dropna(subset=["day"])
work = work[work["day"] <= ACTIVE_DAY]
work = work.drop_duplicates(subset=[user_col, "day"])
work = work.sort_values([user_col, "day"]).reset_index(drop=True)

# Compute consecutive-day runs
work["rn"] = work.groupby(user_col).cumcount()
work["grp"] = work["day"] - pd.to_timedelta(work["rn"], unit="D")

runs = (
    work.groupby([user_col, "grp"], as_index=False)
        .agg(run_start=("day", "min"),
             run_end=("day", "max"),
             streak_len=("day", "size"))
)

# Active streaks must end on ACTIVE_DAY
active_runs = runs[runs["run_end"] == ACTIVE_DAY].copy()

# Leaderboard (top 10)
leaderboard = (
    active_runs.sort_values(["streak_len", user_col], ascending=[False, True])
              .head(10)
              .reset_index(drop=True)
)

print(leaderboard)

# Submit: third longest active streak length
streaks_sorted = active_runs["streak_len"].sort_values(ascending=False).reset_index(drop=True)
third_longest = int(streaks_sorted.iloc[2]) if len(streaks_sorted) >= 3 else None
print("third_longest_active_streak =", third_longest)


     user_id        grp  run_start    run_end  streak_len
0   84071240 2025-05-01 2025-05-01 2025-09-28         151
1   16814590 2025-05-13 2025-05-31 2025-09-28         121
2  215518363 2025-05-11 2025-06-01 2025-09-28         120
3  213142099 2025-05-28 2025-06-25 2025-09-28          96
4  215121049 2025-05-12 2025-07-03 2025-09-28          88
5  176891675 2025-05-03 2025-07-13 2025-09-28          78
6  227438241 2025-07-15 2025-07-15 2025-09-28          76
7  199107800 2025-06-13 2025-08-07 2025-09-28          53
8  210092431 2025-07-11 2025-08-09 2025-09-28          51
9   74627365 2025-08-04 2025-08-11 2025-09-28          49
third_longest_active_streak = 120
