# Day 1 — Build Dataset + EDA (Spotify Streaming History → Modeling Tables)

**Goal:** Create clean, enriched datasets from your personal Spotify streaming history (2012–2023) and generate a small set of story-ready EDA visuals.

**Outputs written by this notebook**
- `data/processed/listens_event_level.parquet` (one row per play)
- `data/processed/tracks_modeling_table.parquet` (one row per track; ready for Day 2)
- `data/processed/artists_modeling_table.parquet` (one row per artist; optional)
- `data/cache/audio_features.parquet` (Spotify track audio features + metadata cache)
- `data/cache/artist_genres.json` (Spotify artist genres cache)

**Expected runtime:** depends on how many unique tracks/artists you have and your Spotify API rate limits. Caching prevents repeated calls.

---

## Setup checklist
1. Create a Spotify Developer app and set `SPOTIPY_CLIENT_ID`, `SPOTIPY_CLIENT_SECRET`, `SPOTIPY_REDIRECT_URI` as environment variables.
2. Install deps: `pip install spotipy pandas numpy matplotlib plotly tqdm python-dotenv pyarrow`
3. (Optional) Add a CJK-capable font (Noto Sans CJK) to avoid tofu boxes in plots.


In [1]:
import logging
logging.getLogger("spotipy").setLevel(logging.WARNING)

In [2]:
# Core imports
from __future__ import annotations

import json
import os
from pathlib import Path
from typing import Iterable, Dict, Any, List, Optional

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

import matplotlib as mpl
import matplotlib.pyplot as plt

# Optional: load .env in local dev
try:
    from dotenv import load_dotenv
    load_dotenv()
except Exception:
    pass

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 160)


In [3]:
# Repo root + config setup (for notebooks/ execution)
import sys
from pathlib import Path

HERE = Path().resolve()
ROOT_DIR = HERE
# Walk upward until we find config.py (repo root marker)
while not (ROOT_DIR / 'config.py').exists() and ROOT_DIR != ROOT_DIR.parent:
    ROOT_DIR = ROOT_DIR.parent

if not (ROOT_DIR / 'config.py').exists():
    raise RuntimeError(
        "config.py not found. Create it at the repo root (or copy config_template.py -> config.py)."
    )

# Ensure repo root is importable
if str(ROOT_DIR) not in sys.path:
    sys.path.insert(0, str(ROOT_DIR))

import config
print('Repo root:', ROOT_DIR)
print('Loaded config.py')


Repo root: C:\Users\maxma\Documents\Spotify Project
Loaded config.py


In [4]:
# Paths (rooted at repo root so running from notebooks/ works)
PROJECT_DIR = ROOT_DIR
DATA_DIR = PROJECT_DIR / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
CACHE_DIR = DATA_DIR / 'cache'
FIG_DIR = PROJECT_DIR / 'reports' / 'figures'

for d in [RAW_DIR, PROCESSED_DIR, CACHE_DIR, FIG_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print('PROJECT_DIR:', PROJECT_DIR)
print('RAW_DIR:', RAW_DIR)

# Default location for Spotify JSONs
DEFAULT_INPUT_DIR = RAW_DIR / 'spotify'
# Fallback for sandbox runs (this repo uses /mnt/data)
if not DEFAULT_INPUT_DIR.exists():
    sandbox_dir = Path('/mnt/data')
    if sandbox_dir.exists():
        DEFAULT_INPUT_DIR = sandbox_dir

print('DEFAULT_INPUT_DIR:', DEFAULT_INPUT_DIR)


PROJECT_DIR: C:\Users\maxma\Documents\Spotify Project
RAW_DIR: C:\Users\maxma\Documents\Spotify Project\data\raw
DEFAULT_INPUT_DIR: C:\Users\maxma\Documents\Spotify Project\data\raw\spotify


In [5]:
# CJK-safe plotting helper

def set_cjk_font(preferred_fonts: Optional[list[str]] = None) -> Optional[str]:
    # Attempt to set a CJK-capable font. Returns chosen font name (or None).
    if preferred_fonts is None:
        preferred_fonts = [
            'Noto Sans CJK JP',
            'Noto Sans CJK KR',
            'Noto Sans CJK SC',
            'Noto Sans JP',
            'Arial Unicode MS',
        ]

    available = {f.name for f in mpl.font_manager.fontManager.ttflist}
    for name in preferred_fonts:
        if name in available:
            mpl.rcParams['font.family'] = name
            mpl.rcParams['axes.unicode_minus'] = False
            return name

    # Fallback: DejaVu is commonly available but may not cover CJK fully
    mpl.rcParams['font.family'] = 'DejaVu Sans'
    mpl.rcParams['axes.unicode_minus'] = False
    return None

chosen = set_cjk_font()
print('CJK font chosen:' , chosen or '(fallback DejaVu Sans; install Noto Sans CJK for better coverage)')


CJK font chosen: (fallback DejaVu Sans; install Noto Sans CJK for better coverage)


## 1) Load + clean streaming history JSONs

We will:
- Read all `Streaming_History_Audio_*.json` files
- Keep only music tracks (exclude podcast episodes)
- Parse timestamps, build time features
- Keep raw playback facts: `ms_played`, `skipped`, start/end reasons


In [6]:
def load_streaming_history(input_dir: Path) -> pd.DataFrame:
    files = sorted(input_dir.glob('Streaming_History_Audio_*.json'))
    if not files:
        raise FileNotFoundError(f'No Streaming_History_Audio_*.json found in {input_dir}')

    dfs = []
    for fp in tqdm(files, desc='Reading streaming history JSON'):
        df = pd.read_json(fp)
        df['source_file'] = fp.name
        dfs.append(df)

    out = pd.concat(dfs, ignore_index=True)
    return out

raw_df = load_streaming_history(DEFAULT_INPUT_DIR)
raw_df.head(3)


Reading streaming history JSON:   0%|          | 0/9 [00:00<?, ?it/s]

Unnamed: 0,ts,username,platform,ms_played,conn_country,ip_addr_decrypted,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,episode_name,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,source_file
0,2012-08-31T17:21:11Z,1246157207,"iOS 5.1.1 (iPhone3,3)",21966,US,174.229.2.243,unknown,Kill Shit,Krizz Kaliko,Kickin' & Screamin',spotify:track:3eMfBkKz0ZuffMqIVHhNr1,,,,,,False,True,False,0,False,Streaming_History_Audio_2012-2014_0.json
1,2012-08-31T17:30:20Z,1246157207,"iOS 5.1.1 (iPhone3,3)",454489,US,174.229.2.243,unknown,Mayday,Krizz Kaliko,Kickin' & Screamin',spotify:track:44eZ0RG3gWBfiD5o9pvIV9,,,,,trackdone,False,False,False,0,False,Streaming_History_Audio_2012-2014_0.json
2,2012-08-31T17:31:17Z,1246157207,"iOS 5.1.1 (iPhone3,3)",59112,US,174.229.2.243,unknown,Dumb For You,Krizz Kaliko,Kickin' & Screamin',spotify:track:0uQWGMWQAtpISoXTEi5as6,,,,trackdone,backbtn,False,True,False,0,False,Streaming_History_Audio_2012-2014_0.json


In [7]:
def clean_streaming_history(df: pd.DataFrame) -> pd.DataFrame:
    # Keep only music tracks (exclude podcast episodes)
    df = df.copy()

    # Standardize timestamp
    df['ts'] = pd.to_datetime(df['ts'], utc=True, errors='coerce')

    # Spotify exports sometimes include episodes; keep rows with track URIs
    df = df[df['spotify_track_uri'].notna()].copy()

    # Keep key identifiers
    df.rename(columns={
        'master_metadata_track_name': 'track_name',
        'master_metadata_album_artist_name': 'artist_name',
        'master_metadata_album_album_name': 'album_name',
    }, inplace=True)

    # Basic sanity
    df['ms_played'] = pd.to_numeric(df['ms_played'], errors='coerce')
    df = df[df['ms_played'].notna()].copy()
    df = df[df['ms_played'] > 0].copy()

    # Time features
    df['date'] = df['ts'].dt.date
    df['year'] = df['ts'].dt.year
    df['month'] = df['ts'].dt.month
    df['year_month'] = df['ts'].dt.to_period('M').astype(str)
    df['dayofweek'] = df['ts'].dt.dayofweek
    df['hour'] = df['ts'].dt.hour

    # Playback features
    df['seconds_played'] = df['ms_played'] / 1000.0

    # Ensure boolean-ish skipped field
    if 'skipped' in df.columns:
        df['skipped'] = df['skipped'].fillna(False).astype(bool)
    else:
        df['skipped'] = False

    # Minimal columns for event-level modeling
    keep = [
        'ts','date','year','month','year_month','dayofweek','hour',
        'track_name','artist_name','album_name','spotify_track_uri',
        'ms_played','seconds_played','skipped',
        'reason_start','reason_end','shuffle','platform','conn_country','source_file'
    ]
    keep = [c for c in keep if c in df.columns]
    df = df[keep].copy()

    return df

df = clean_streaming_history(raw_df)
print('rows:', len(df))
print('unique tracks:', df['spotify_track_uri'].nunique())
print('unique artists:', df['artist_name'].nunique())
df.head(3)


  df['year_month'] = df['ts'].dt.to_period('M').astype(str)


rows: 136946
unique tracks: 25516
unique artists: 6144


Unnamed: 0,ts,date,year,month,year_month,dayofweek,hour,track_name,artist_name,album_name,spotify_track_uri,ms_played,seconds_played,skipped,reason_start,reason_end,shuffle,platform,conn_country,source_file
0,2012-08-31 17:21:11+00:00,2012-08-31,2012,8,2012-08,4,17,Kill Shit,Krizz Kaliko,Kickin' & Screamin',spotify:track:3eMfBkKz0ZuffMqIVHhNr1,21966,21.966,True,,,False,"iOS 5.1.1 (iPhone3,3)",US,Streaming_History_Audio_2012-2014_0.json
1,2012-08-31 17:30:20+00:00,2012-08-31,2012,8,2012-08,4,17,Mayday,Krizz Kaliko,Kickin' & Screamin',spotify:track:44eZ0RG3gWBfiD5o9pvIV9,454489,454.489,False,,trackdone,False,"iOS 5.1.1 (iPhone3,3)",US,Streaming_History_Audio_2012-2014_0.json
2,2012-08-31 17:31:17+00:00,2012-08-31,2012,8,2012-08,4,17,Dumb For You,Krizz Kaliko,Kickin' & Screamin',spotify:track:0uQWGMWQAtpISoXTEi5as6,59112,59.112,True,trackdone,backbtn,False,"iOS 5.1.1 (iPhone3,3)",US,Streaming_History_Audio_2012-2014_0.json


## 2) Spotify API enrichment (cached)

We will add:
- Track metadata: `track_id`, `track_popularity`, `duration_ms`, `explicit`, `album_release_date`
- Audio features: danceability, energy, valence, tempo, etc.

We cache results to `data/cache/audio_features.parquet`.


In [8]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

sp = spotipy.Spotify(
    auth_manager=SpotifyClientCredentials(
        client_id=config.spotify["client_id"],
        client_secret=config.spotify["client_secret"],
    )
)

# Quick ping (no user login required)
res = sp.search(q="test", limit=1)
print("Spotify API OK. Sample track:", res["tracks"]["items"][0]["name"])


Spotify API OK. Sample track: Test & Recognise (Flume Re-work)


In [9]:
AUDIO_CACHE_FP = CACHE_DIR / "audio_features.parquet"

def uri_to_id(uri: str) -> str:
    # spotify:track:<id>
    return uri.split(":")[-1]

from spotipy.exceptions import SpotifyException

AUDIO_FEATURE_KEYS = [
    "danceability","energy","key","loudness","mode","speechiness","acousticness",
    "instrumentalness","liveness","valence","tempo","time_signature"
]

def fetch_track_metadata_and_audio_features(track_uris, batch_size: int = 50) -> pd.DataFrame:
    track_uris = list(track_uris)
    track_ids = [uri_to_id(u) for u in track_uris]

    rows = []
    audio_features_available = True

    for i in tqdm(range(0, len(track_ids), batch_size), desc="Spotify tracks()"):
        batch_ids = track_ids[i:i + batch_size]
        batch_uris = track_uris[i:i + batch_size]

        # Track metadata (works reliably)
        tracks_resp = sp.tracks(batch_ids)
        tracks = tracks_resp.get("tracks", [])

        # Audio features (may be blocked)
        feats_by_id = {}
        if audio_features_available:
            try:
                feats = sp.audio_features(batch_ids)
                feats_by_id = {f["id"]: f for f in feats if f and f.get("id")}
            except SpotifyException as e:
                if getattr(e, "http_status", None) == 403:
                    audio_features_available = False
                    print(
                        "\n⚠️ Spotify /audio-features endpoint blocked (403). "
                        "Continuing WITHOUT audio features.\n"
                    )
                else:
                    raise

        for uri, t in zip(batch_uris, tracks):
            if not t:
                continue

            tid = t.get("id")
            album = t.get("album") or {}
            artists = t.get("artists") or [{}]

            row = {
                "spotify_track_uri": uri,
                "track_id": tid,
                "track_popularity": t.get("popularity"),
                "duration_ms": t.get("duration_ms"),
                "explicit": t.get("explicit"),
                "album_release_date": album.get("release_date"),
                "album_release_date_precision": album.get("release_date_precision"),
                "artist_id_primary": artists[0].get("id"),
                "artist_name_primary": artists[0].get("name"),
            }

            af = feats_by_id.get(tid)
            if af:
                for k in AUDIO_FEATURE_KEYS:
                    row[k] = af.get(k)

            rows.append(row)

    return pd.DataFrame(rows)


In [10]:
# Build / refresh audio + metadata cache

unique_uris = sorted(df["spotify_track_uri"].dropna().unique())

if AUDIO_CACHE_FP.exists():
    audio_cache = pd.read_parquet(AUDIO_CACHE_FP)
    cached_uris = set(audio_cache["spotify_track_uri"].unique())
    missing_uris = [u for u in unique_uris if u not in cached_uris]
    print(f"Audio cache found: {len(cached_uris)} cached, {len(missing_uris)} missing")
else:
    audio_cache = pd.DataFrame()
    missing_uris = unique_uris
    print("No audio cache found; fetching all tracks...")

if missing_uris:
    fetched = fetch_track_metadata_and_audio_features(missing_uris)
    print("Fetched rows:", len(fetched))

    audio_cache = pd.concat([audio_cache, fetched], ignore_index=True)
    audio_cache = audio_cache.drop_duplicates(
        subset=["spotify_track_uri"], keep="last"
    )
    audio_cache.to_parquet(AUDIO_CACHE_FP, index=False)
    print("Wrote:", AUDIO_CACHE_FP)

print("audio_cache rows:", len(audio_cache))
audio_cache.head(3)


Audio cache found: 25516 cached, 0 missing
audio_cache rows: 25516


Unnamed: 0,spotify_track_uri,track_id,track_popularity,duration_ms,explicit,album_release_date,album_release_date_precision,artist_id_primary,artist_name_primary
0,spotify:track:002AzLaJtX4Tyi7Yv0J49w,002AzLaJtX4Tyi7Yv0J49w,0,210378,False,2020-08-11,day,5a8EJtOEbUJDF4RX3mKK02,Woo
1,spotify:track:003FTlCpBTM4eSqYSWPv4H,003FTlCpBTM4eSqYSWPv4H,70,233266,False,2002-10-15,day,3vAaWhdBR38Q02ohXqaNHT,The All-American Rejects
2,spotify:track:003vvx7Niy0yvhvHt4a68B,003vvx7Niy0yvhvHt4a68B,91,222973,False,2004,year,0C0XlULifJtAgn6ZNCW2eu,The Killers


In [11]:
audio_cache[["spotify_track_uri", "duration_ms", "track_popularity"]].isna().mean()

spotify_track_uri    0.0
duration_ms          0.0
track_popularity     0.0
dtype: float64

In [12]:
# Merge enrichment into event-level df

df_enriched = df.merge(audio_cache, on='spotify_track_uri', how='left')

# Listen ratio (more defensible than a fixed 30-second rule)
# Note: duration_ms can be missing for some tracks; handle safely

df_enriched['listen_ratio'] = df_enriched['ms_played'] / df_enriched['duration_ms']
df_enriched['listen_ratio'] = df_enriched['listen_ratio'].replace([np.inf, -np.inf], np.nan)
# Clip extreme ratios (some exports can exceed duration slightly)
df_enriched['listen_ratio'] = df_enriched['listen_ratio'].clip(lower=0, upper=1.5)

# Buckets for narrative + modeling
bins = [-0.01, 0.2, 0.8, 1.5]
labels = ['skip_early', 'partial', 'complete']
df_enriched['listen_bucket'] = pd.cut(df_enriched['listen_ratio'], bins=bins, labels=labels)

print(df_enriched[['seconds_played','duration_ms','listen_ratio','listen_bucket']].head(10))


   seconds_played  duration_ms  listen_ratio listen_bucket
0          21.966       240106      0.091485    skip_early
1         454.489       261453      1.500000      complete
2          59.112       150693      0.392268       partial
3         261.453       261453      1.000000      complete
4          22.923       150693      0.152117    skip_early
5          35.387       261453      0.135347    skip_early
6         160.172       248440      0.644711       partial
7         209.413       209413      1.000000      complete
8          11.331       256653      0.044149    skip_early
9           2.368       203280      0.011649    skip_early


In [13]:
print("df rows:", len(df))
print("duration_ms missing rate:", df["duration_ms"].isna().mean())


df rows: 136946


KeyError: 'duration_ms'

## 3) Artist genres (cached)

Spotify genres are usually at the **artist** level. We'll:
- collect primary artist IDs from track metadata
- call `artists()` in batches
- cache to `data/cache/artist_genres.json`


In [None]:
GENRE_CACHE_FP = CACHE_DIR / 'artist_genres.json'

def fetch_artist_genres(artist_ids: list[str], batch_size: int = 50) -> dict[str, list[str]]:
    out: dict[str, list[str]] = {}
    artist_ids = [a for a in artist_ids if isinstance(a, str) and a]

    for i in tqdm(range(0, len(artist_ids), batch_size), desc='Spotify artists()'):
        batch = artist_ids[i:i+batch_size]
        resp = sp.artists(batch)
        for a in resp.get('artists', []) or []:
            if not a:
                continue
            out[a.get('id')] = a.get('genres') or []

    return out

# Load cache if exists
if GENRE_CACHE_FP.exists():
    artist_genres = json.load(open(GENRE_CACHE_FP, 'r', encoding='utf-8'))
else:
    artist_genres = {}

artist_ids = sorted(set(df_enriched['artist_id_primary'].dropna().astype(str)))
missing_ids = [a for a in artist_ids if a not in artist_genres]
print('artists total:', len(artist_ids), 'missing:', len(missing_ids))

if missing_ids:
    new_map = fetch_artist_genres(missing_ids)
    artist_genres.update(new_map)
    with open(GENRE_CACHE_FP, 'w', encoding='utf-8') as f:
        json.dump(artist_genres, f, ensure_ascii=False, indent=2)
    print('Wrote:', GENRE_CACHE_FP)

# Add genres to enriched df

def primary_genre(genres: list[str]) -> str:
    if not genres:
        return 'unknown'
    return genres[0]

df_enriched['artist_genres'] = df_enriched['artist_id_primary'].map(artist_genres)
df_enriched['artist_genres'] = df_enriched['artist_genres'].apply(lambda x: x if isinstance(x, list) else [])
df_enriched['primary_genre'] = df_enriched['artist_genres'].apply(primary_genre)

df_enriched[['artist_name','artist_name_primary','primary_genre']].head(10)


## 4) Build modeling tables

### Event-level table
One row per play (keeps context like hour, platform, reason_end).

### Track-level table (Day 2 ready)
One row per track with aggregated behavior + content features.


In [None]:
# Event-level exports
EVENT_FP = PROCESSED_DIR / 'listens_event_level.parquet'

# Keep an explicit ordered set of columns
event_cols = [
    'ts','date','year','month','year_month','dayofweek','hour',
    'spotify_track_uri','track_id','track_name','artist_name','artist_id_primary','artist_name_primary','album_name',
    'ms_played','seconds_played','duration_ms','listen_ratio','listen_bucket','skipped',
    'primary_genre','artist_genres',
    'danceability','energy','valence','tempo','acousticness','instrumentalness','liveness','speechiness','loudness','mode','key','time_signature',
    'track_popularity','album_release_date',
    'reason_start','reason_end','shuffle','platform','conn_country','source_file'
]
event_cols = [c for c in event_cols if c in df_enriched.columns]

listens_event = df_enriched[event_cols].copy()
listens_event.to_parquet(EVENT_FP, index=False)
print('Wrote:', EVENT_FP)
listens_event.head(3)


In [None]:
# Track-level modeling table
TRACK_FP = PROCESSED_DIR / 'tracks_modeling_table.parquet'

agg = {
    'ts': ['min','max'],
    'ms_played': ['count','sum','mean'],
    'listen_ratio': ['mean','median'],
    'skipped': ['mean'],
    'seconds_played': ['mean','median'],
    'duration_ms': ['first'],
}

track_df = df_enriched.groupby('spotify_track_uri').agg(agg)
track_df.columns = ['_'.join([c for c in col if c]) for col in track_df.columns.values]
track_df = track_df.reset_index()

# Rename for clarity
track_df = track_df.rename(columns={
    'ms_played_count': 'play_count',
    'ms_played_sum': 'total_ms_played',
    'ms_played_mean': 'avg_ms_played',
    'skipped_mean': 'skip_rate',
    'ts_min': 'first_played_ts',
    'ts_max': 'last_played_ts',
    'listen_ratio_mean': 'avg_listen_ratio',
    'listen_ratio_median': 'median_listen_ratio',
    'seconds_played_mean': 'avg_seconds_played',
    'seconds_played_median': 'median_seconds_played',
    'duration_ms_first': 'duration_ms'
})

# Join static metadata/audio features (use audio_cache to avoid duplication)
meta_cols = [
    'spotify_track_uri','track_id','artist_id_primary','artist_name_primary',
    'track_popularity','duration_ms','explicit','album_release_date','album_release_date_precision',
    'danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo','time_signature'
]
meta_cols = [c for c in meta_cols if c in audio_cache.columns]
track_meta = audio_cache[meta_cols].drop_duplicates(subset=['spotify_track_uri'])

track_df = track_df.merge(track_meta, on='spotify_track_uri', how='left')

# Add a representative name/album from event df (most common)
name_df = (df_enriched.groupby('spotify_track_uri')
           .agg(track_name=('track_name', lambda s: s.mode().iloc[0] if not s.mode().empty else s.iloc[0]),
                artist_name=('artist_name', lambda s: s.mode().iloc[0] if not s.mode().empty else s.iloc[0]),
                album_name=('album_name', lambda s: s.mode().iloc[0] if not s.mode().empty else s.iloc[0]),
                primary_genre=('primary_genre', lambda s: s.mode().iloc[0] if not s.mode().empty else 'unknown'))
           .reset_index())

track_df = track_df.merge(name_df, on='spotify_track_uri', how='left')

# Simple recency (days since last play) for Day 2 ranking
now = pd.Timestamp.utcnow()
track_df['days_since_last_play'] = (now - pd.to_datetime(track_df['last_played_ts'], utc=True)).dt.total_seconds() / (3600*24)

# Save
track_df.to_parquet(TRACK_FP, index=False)
print('Wrote:', TRACK_FP)
track_df.sort_values('play_count', ascending=False).head(10)


## 5) EDA: minimum set of story-ready visuals

We will generate four visuals that are strong “blog/dashboard starters”:
1. Listening volume over time (monthly)
2. Top artists by play count
3. Listen ratio distribution
4. Genre share over time (top genres)

All plots should display CJK titles safely.


In [None]:
# 1) Listening volume over time
plt.figure(figsize=(12,4))
monthly = listens_event.groupby('year_month').size().sort_index()
monthly.plot()
plt.title('Listening volume over time (monthly plays)')
plt.xlabel('Year-Month')
plt.ylabel('Plays')
plt.xticks(rotation=45)
plt.tight_layout()
fp = FIG_DIR / '01_listening_volume_monthly.png'
plt.savefig(fp, dpi=200)
print('Saved:', fp)
plt.show()


In [None]:
# 2) Top artists by play count
plt.figure(figsize=(10,5))
top_artists = listens_event['artist_name'].value_counts().head(15)[::-1]
top_artists.plot(kind='barh')
plt.title('Top artists by play count')
plt.xlabel('Plays')
plt.tight_layout()
fp = FIG_DIR / '02_top_artists.png'
plt.savefig(fp, dpi=200)
print('Saved:', fp)
plt.show()


In [None]:
# 3) Listen ratio distribution
plt.figure(figsize=(10,4))
vals = listens_event['listen_ratio'].dropna()
plt.hist(vals, bins=50)
plt.title('Distribution of listen_ratio (ms_played / duration_ms)')
plt.xlabel('listen_ratio')
plt.ylabel('count')
plt.tight_layout()
fp = FIG_DIR / '03_listen_ratio_distribution.png'
plt.savefig(fp, dpi=200)
print('Saved:', fp)
plt.show()

print(listens_event['listen_bucket'].value_counts(dropna=False))


In [None]:
# 4) Genre share over time (top 8 genres by volume)

# Prepare top genres
counts = listens_event['primary_genre'].value_counts()
top_genres = counts.head(8).index.tolist()

d = listens_event.copy()
d['genre_for_plot'] = d['primary_genre'].where(d['primary_genre'].isin(top_genres), other='other')

pivot = (d.pivot_table(index='year_month', columns='genre_for_plot', values='spotify_track_uri', aggfunc='count')
         .fillna(0)
         .sort_index())

share = pivot.div(pivot.sum(axis=1), axis=0)

plt.figure(figsize=(12,5))
plt.stackplot(share.index, share.T.values, labels=share.columns)
plt.title('Genre share over time (top genres)')
plt.xlabel('Year-Month')
plt.ylabel('Share of plays')
plt.xticks(rotation=45)
plt.legend(loc='upper left', bbox_to_anchor=(1.02, 1.0))
plt.tight_layout()
fp = FIG_DIR / '04_genre_share_over_time.png'
plt.savefig(fp, dpi=200)
print('Saved:', fp)
plt.show()


## Day 1 ✅ Exit criteria

If you got this far, Day 1 is complete:
- You have **event-level** and **track-level** modeling tables written to disk
- You have **audio features + genres cached**
- You generated 4 story-ready visuals

Next: Day 2 will build a recommender using the track table (content similarity / TFRS) and evaluate recommendation quality.
