## Objective
Create per-user features from Netflix viewing activity:
- Date of the first movie finished
- Name of the first movie finished
- Date of the last movie finished
- Name of the last movie finished
- Movies started (count)
- Movies finished (count)

## Inputs
- `users.csv`: one row per user (`id`)
- `activity.csv`: viewing starts with:
  - `user_id`
  - `date` (started)
  - `movie_name`
  - `finished` (1/0)

## Feature logic
1. Parse `activity.date` as a datetime.
2. `movies_started` = number of activity rows per user.
3. `movies_finished` = number of rows where `finished == 1` per user.
4. Filter to finished rows, sort by `(user_id, date, activity_id)`:
   - First finished movie = first row per user after sorting.
   - Last finished movie = last row per user after sorting.
5. Merge features back onto the `users` table.

## Submission
Count users where `last_finished_movie == "Fight Club"`.


In [1]:
import pandas as pd

USERS_PATH = "users.csv"
ACTIVITY_PATH = "activity.csv"

users = pd.read_csv(USERS_PATH)
activity = pd.read_csv(ACTIVITY_PATH)

# Expected columns:
# users: id
# activity: id, user_id, date, movie_name, finished
req_users = {"id"}
req_activity = {"id", "user_id", "date", "movie_name", "finished"}
if not req_users.issubset(users.columns):
    raise ValueError(f"users.csv missing columns: {sorted(req_users - set(users.columns))}")
if not req_activity.issubset(activity.columns):
    raise ValueError(f"activity.csv missing columns: {sorted(req_activity - set(activity.columns))}")

activity = activity.copy()
activity["date"] = pd.to_datetime(activity["date"], errors="raise")
activity["finished"] = activity["finished"].astype(int)

# Movies started / finished
movies_started = activity.groupby("user_id").size().rename("movies_started")
movies_finished = activity[activity["finished"] == 1].groupby("user_id").size().rename("movies_finished")

# First/last finished movie (tie-break by activity id within same date)
finished = activity[activity["finished"] == 1].sort_values(["user_id", "date", "id"])

first_finished = (
    finished.groupby("user_id", as_index=False).first()[["user_id", "date", "movie_name"]]
    .rename(columns={"date": "first_finished_date", "movie_name": "first_finished_movie"})
)

last_finished = (
    finished.groupby("user_id", as_index=False).last()[["user_id", "date", "movie_name"]]
    .rename(columns={"date": "last_finished_date", "movie_name": "last_finished_movie"})
)

# Combine features
features = (
    pd.DataFrame({"user_id": users["id"]})
    .merge(first_finished, on="user_id", how="left")
    .merge(last_finished, on="user_id", how="left")
    .merge(movies_started.reset_index(), on="user_id", how="left")
    .merge(movies_finished.reset_index(), on="user_id", how="left")
    .rename(columns={"user_id": "id"})
)

features["movies_started"] = features["movies_started"].fillna(0).astype(int)
features["movies_finished"] = features["movies_finished"].fillna(0).astype(int)

# Final table per user (users + engineered features)
user_metrics = users.merge(features, on="id", how="left")
print(user_metrics)

# Submission question:
# How many users have "Fight Club" as the last film they've seen?
fight_club_last = (user_metrics["last_finished_movie"] == "Fight Club").sum()
print("users_with_fight_club_as_last_finished_movie =", int(fight_club_last))


    id  created_at country_code first_finished_date  \
0    1  2023-05-26           CA          2023-09-12   
1    2  2023-06-15           CA          2023-06-22   
2    3  2023-07-18           MX          2023-11-10   
3    4  2023-07-27           CA          2023-07-27   
4    5  2023-09-01           US          2023-09-07   
5    6  2023-11-20           CA          2023-12-18   
6    7  2023-11-21           US          2024-08-28   
7    8  2024-01-12           US          2024-01-24   
8    9  2024-01-17           US          2024-02-25   
9   10  2024-02-13           MX          2024-05-01   
10  11  2024-03-25           US          2025-01-29   
11  12  2024-04-10           MX          2024-04-11   
12  13  2024-07-02           MX          2024-09-19   
13  14  2024-07-24           CA          2024-09-29   
14  15  2024-08-19           MX          2024-09-21   
15  16  2024-08-28           US          2024-10-12   
16  17  2024-09-01           US          2024-09-01   
17  18  20