
# Discogs Data Enrichment Notebook

This notebook demonstrates how to work with the Discogs data dump, available at [Discogs Data Dumps](https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html). The data consists of large compressed XML files (up to 10GB each), and not all possible data is present in the dump.

**Workflow:**
- Download and unpack the relevant XML data files.
- Parse the XML to extract release records.
- Use the Discogs API to enrich each record with additional information, such as price, demand (people asking), and supply (people offering).
- Save the enriched results as a CSV file, formatted for easy use in Jupyter notebooks and collaborative workflows.

## Download and unpack the relevant XML data files.


In [0]:
spark.sql("CREATE VOLUME IF NOT EXISTS swi_audience_prd.discogs.raw")

In [0]:
%sh
mkdir -p /Volumes/swi_audience_prd/discogs/raw
ls -lh /Volumes/swi_audience_prd/discogs/raw


In [0]:
%sh
which aria2c || echo "aria2c nicht gefunden (ok, dann wget nutzen)"


In [0]:
%sh
URL="https://discogs-data-dumps.s3.us-west-2.amazonaws.com/data/2025/discogs_20251201_releases.xml.gz"
OUT="/Volumes/swi_audience_prd/discogs/raw/discogs_20251201_releases.xml.gz"

wget -c -O "$OUT" "$URL"


In [0]:
spark.conf.set("spark.sql.files.maxPartitionBytes", 64 * 1024 * 1024)  # 64MB


In [0]:
%sh
SRC="/Volumes/swi_audience_prd/discogs/raw/discogs_20251201_releases.xml.gz"
DST="/Volumes/swi_audience_prd/discogs/raw/discogs_20251201_releases.xml"

# falls schon vorhanden, nicht nochmal machen:
if [ ! -f "$DST" ]; then
  gunzip -c "$SRC" > "$DST"
fi

ls -lh "$SRC" "$DST"

In [0]:
# mehr Parallelität & weniger "Monster"-Partitions
spark.conf.set("spark.sql.files.maxPartitionBytes", 64 * 1024 * 1024)  # 64MB
spark.conf.set("spark.sql.shuffle.partitions", 2000)                   # je nach cluster, lieber hoch

# XML parsing ist CPU lastig: AQE hilft nicht immer, kann aber helfen


## Parse the XML to extract release records.

In [0]:
path_xml = "/Volumes/swi_audience_prd/discogs/raw/discogs_20251201_releases.xml"

df_raw = (
    spark.read.format("xml")
    .option("rowTag", "release")
    # optional, falls Probleme:
    # .option("inferSchema", "false")
    # .option("samplingRatio", 0.01)
    .load(path_xml)
)

# Direkt schreiben (Bronze Delta) – ohne Anzeige
(df_raw
 .write
 .mode("overwrite")
 .format("delta")
 .saveAsTable("swi_audience_prd.discogs.releases_bronze_delta")
)


In [0]:
from pyspark.sql import functions as F

df_bronze = spark.table("swi_audience_prd.discogs.releases_bronze_delta")

from pyspark.sql.functions import col, expr

df_vinyl = df_bronze.filter(
    expr("exists(formats.format, x -> x._name = 'Vinyl')")
)


In [0]:
df_vinyl.count()

In [0]:
from pyspark.sql import functions as F

df_ids = (
    df_vinyl
    .select(F.col("_id").cast("long").alias("release_id"))
    .dropna()
    .dropDuplicates()
)

df_ids.write.mode("overwrite").format("delta").saveAsTable(
    "swi_audience_prd.discogs.vinyl_release_ids"
)


## Use the Discogs API to enrich each record 

In [0]:
import time
import requests

TOKEN = "IRHdEbtSbNGcCnJFgyjIxhVcEqWwsQbYAfGJGolW"  # generiert auf https://www.discogs.com/de/settings/developers
HEADERS = {
    "Authorization": f"Discogs token={TOKEN}",
    "User-Agent": "swi-audience-prd-discogs/1.0"
}

In [0]:
import time
import json
import requests
from pyspark.sql import functions as F
from pyspark.sql.types import (
    StructType, StructField,
    LongType, IntegerType, DoubleType, StringType, BooleanType, TimestampType
)
from datetime import datetime

number_of_releases_to_fetch = 10000  # starte klein, dann erhöhen
CURRENCY = "CHF"

# ------------------------------------------------------------
# Helpers
# ------------------------------------------------------------
def _to_json_str(x):
    if x is None:
        return None
    try:
        return json.dumps(x, ensure_ascii=False)
    except Exception:
        return str(x)

def _to_bool(x):
    if x is None:
        return None
    if isinstance(x, bool):
        return x
    if isinstance(x, str):
        return x.strip().lower() in ("true", "1", "yes", "y")
    if isinstance(x, (int, float)):
        return bool(x)
    return None

def _to_int(x):
    try:
        return int(x) if x is not None else None
    except Exception:
        return None

def _to_float(x):
    try:
        return float(x) if x is not None else None
    except Exception:
        return None

def _to_ts(x):
    if not x:
        return None
    try:
        s = str(x).replace("Z", "+00:00")
        return datetime.fromisoformat(s)
    except Exception:
        return None

# ------------------------------------------------------------
# HTTP helper with retry (now supports params)
# ------------------------------------------------------------
def get_with_retry(url: str, params: dict = None, max_tries: int = 6, base_sleep: float = 1.2):
    last_err = None
    for attempt in range(1, max_tries + 1):
        try:
            r = requests.get(url, headers=HEADERS, params=params, timeout=30)

            # Rate limit
            if r.status_code == 429:
                retry_after = r.headers.get("Retry-After")
                sleep_s = float(retry_after) if retry_after else (base_sleep * attempt)
                time.sleep(sleep_s)
                continue

            r.raise_for_status()
            return r

        except Exception as e:
            last_err = e
            time.sleep(base_sleep * attempt)

    raise RuntimeError(f"Failed after {max_tries} tries: {url} / last_err={last_err}")

# ------------------------------------------------------------
# Main fetch: releases + marketplace stats (separate endpoint!)
# ------------------------------------------------------------
def fetch_release_full(release_id: int, curr_abbr: str = "CHF") -> dict:
    # A) Release endpoint
    url_rel = f"https://api.discogs.com/releases/{release_id}"
    j = get_with_retry(url_rel).json()

    community = j.get("community") or {}
    rating = (community.get("rating") or {})
    submitter = (community.get("submitter") or {})

    # B) Marketplace stats endpoint (reliable source for num_for_sale/lowest_price/blocked)
    url_m = f"https://api.discogs.com/marketplace/stats/{release_id}"
    jm = get_with_retry(url_m, params={"curr_abbr": curr_abbr}).json()

    # Marketplace parsing
    num_for_sale = _to_int(jm.get("num_for_sale"))
    blocked_from_sale = _to_bool(jm.get("blocked_from_sale"))

    lowest_value = None
    lowest_currency = None
    lowest = jm.get("lowest_price")
    if isinstance(lowest, dict):
        lowest_value = _to_float(lowest.get("value"))
        lowest_currency = lowest.get("currency")
    elif lowest is not None:
        lowest_value = _to_float(lowest)
        lowest_currency = curr_abbr

    row = {
        # identifiers / basic
        "release_id": _to_int(j.get("id", release_id)),
        "status": j.get("status"),
        "year": _to_int(j.get("year")),
        "data_quality": j.get("data_quality"),
        "resource_url": j.get("resource_url"),
        "uri": j.get("uri"),

        # main metadata
        "title": j.get("title"),
        "country": j.get("country"),
        "released": j.get("released"),
        "released_formatted": j.get("released_formatted"),
        "notes": j.get("notes"),
        "artists_sort": j.get("artists_sort"),

        # master
        "master_id": _to_int(j.get("master_id")),
        "master_url": j.get("master_url"),

        # marketplace (NOW from /marketplace/stats)
        "num_for_sale": num_for_sale,
        "lowest_price_value": lowest_value,
        "lowest_price_currency": lowest_currency,
        "blocked_from_sale": blocked_from_sale,

        # timestamps
        "date_added_ts": _to_ts(j.get("date_added")),
        "date_changed_ts": _to_ts(j.get("date_changed")),

        # community metrics
        "have": _to_int(community.get("have")),
        "want": _to_int(community.get("want")),
        "rating_count": _to_int(rating.get("count")),
        "rating_average": _to_float(rating.get("average")),
        "submitter_username": submitter.get("username"),
        "submitter_resource_url": submitter.get("resource_url"),

        # small “thumb” convenience
        "thumb": j.get("thumb"),

        # nested fields as JSON strings (robust)
        "artists_json": _to_json_str(j.get("artists")),
        "labels_json": _to_json_str(j.get("labels")),
        "companies_json": _to_json_str(j.get("companies")),
        "formats_json": _to_json_str(j.get("formats")),
        "identifiers_json": _to_json_str(j.get("identifiers")),
        "genres_json": _to_json_str(j.get("genres")),
        "styles_json": _to_json_str(j.get("styles")),
        "tracklist_json": _to_json_str(j.get("tracklist")),
        "videos_json": _to_json_str(j.get("videos")),
        "images_json": _to_json_str(j.get("images")),
        "extraartists_json": _to_json_str(j.get("extraartists")),
        "community_contributors_json": _to_json_str(community.get("contributors")),

        "error": None,
    }
    return row

# ------------------------------------------------------------
# IDs holen aus deinem df_vinyl (du hast _id)
# ------------------------------------------------------------
N = number_of_releases_to_fetch
ids = (
    df_vinyl
    .select(F.col("_id").cast("long").alias("release_id"))
    .dropna()
    .dropDuplicates()
    .limit(N)
    .toPandas()["release_id"]
    .astype(int)
    .tolist()
)

# ------------------------------------------------------------
# Fetch loop (throttle)
# ------------------------------------------------------------
rows = []
for i, rid in enumerate(ids, start=1):
    try:
        rows.append(fetch_release_full(rid, curr_abbr=CURRENCY))
    except Exception as e:
        rows.append({
            "release_id": int(rid),
            "status": None,
            "year": None,
            "data_quality": None,
            "resource_url": None,
            "uri": None,
            "title": None,
            "country": None,
            "released": None,
            "released_formatted": None,
            "notes": None,
            "artists_sort": None,
            "master_id": None,
            "master_url": None,
            "num_for_sale": None,
            "lowest_price_value": None,
            "lowest_price_currency": None,
            "blocked_from_sale": None,
            "date_added_ts": None,
            "date_changed_ts": None,
            "have": None,
            "want": None,
            "rating_count": None,
            "rating_average": None,
            "submitter_username": None,
            "submitter_resource_url": None,
            "thumb": None,
            "artists_json": None,
            "labels_json": None,
            "companies_json": None,
            "formats_json": None,
            "identifiers_json": None,
            "genres_json": None,
            "styles_json": None,
            "tracklist_json": None,
            "videos_json": None,
            "images_json": None,
            "extraartists_json": None,
            "community_contributors_json": None,
            "error": str(e),
        })

    time.sleep(1.2)  # konservativ gegen Rate Limit
    if i % 200 == 0:
        print(f"{i}/{len(ids)} done")

# ------------------------------------------------------------
# Explizites Schema (wie gehabt)
# ------------------------------------------------------------
schema = StructType([
    StructField("release_id", LongType(), False),
    StructField("status", StringType(), True),
    StructField("year", IntegerType(), True),
    StructField("data_quality", StringType(), True),
    StructField("resource_url", StringType(), True),
    StructField("uri", StringType(), True),

    StructField("title", StringType(), True),
    StructField("country", StringType(), True),
    StructField("released", StringType(), True),
    StructField("released_formatted", StringType(), True),
    StructField("notes", StringType(), True),
    StructField("artists_sort", StringType(), True),

    StructField("master_id", LongType(), True),
    StructField("master_url", StringType(), True),

    StructField("num_for_sale", IntegerType(), True),
    StructField("lowest_price_value", DoubleType(), True),
    StructField("lowest_price_currency", StringType(), True),
    StructField("blocked_from_sale", BooleanType(), True),

    StructField("date_added_ts", TimestampType(), True),
    StructField("date_changed_ts", TimestampType(), True),

    StructField("have", IntegerType(), True),
    StructField("want", IntegerType(), True),
    StructField("rating_count", IntegerType(), True),
    StructField("rating_average", DoubleType(), True),
    StructField("submitter_username", StringType(), True),
    StructField("submitter_resource_url", StringType(), True),

    StructField("thumb", StringType(), True),

    StructField("artists_json", StringType(), True),
    StructField("labels_json", StringType(), True),
    StructField("companies_json", StringType(), True),
    StructField("formats_json", StringType(), True),
    StructField("identifiers_json", StringType(), True),
    StructField("genres_json", StringType(), True),
    StructField("styles_json", StringType(), True),
    StructField("tracklist_json", StringType(), True),
    StructField("videos_json", StringType(), True),
    StructField("images_json", StringType(), True),
    StructField("extraartists_json", StringType(), True),
    StructField("community_contributors_json", StringType(), True),

    StructField("error", StringType(), True),
])

df_api = spark.createDataFrame(rows, schema=schema)

display(df_api.limit(10))


In [0]:

# ------------------------------------------------------------
# 6) Persistieren
# ------------------------------------------------------------
(df_api.write
 .mode("append")
 .format("delta")
 .saveAsTable("swi_audience_prd.discogs.vinyl_release_api_stats"))


## Save the enriched results as a CSV file

In [0]:
df_api = spark.read.table("swi_audience_prd.discogs.vinyl_release_api_stats")

In [0]:
import os

base_dir = "/Volumes/swi_audience_prd/discogs/gold"
tmp_dir  = f"{base_dir}/_tmp_vinyl"
final_file = f"{base_dir}/vinyl.parquet"

# 1) Als Single-File in temporären Ordner schreiben
(df_api
 .coalesce(1)
 .write
 .mode("overwrite")
 .parquet(tmp_dir)
)

# 2) Die erzeugte part-Datei finden
files = dbutils.fs.ls(tmp_dir)
part_file = [f.path for f in files if f.name.startswith("part-")][0]

# 3) In gewünschte Zieldatei verschieben/umbenennen
dbutils.fs.mv(part_file, final_file)

# 4) Temp-Ordner aufräumen
dbutils.fs.rm(tmp_dir, recurse=True)

final_file



## **Manual Work Note: Export and Upload CSV to GitHub**
## 

1. **Find the CSV file in Databricks:**
   - The CSV was saved to:  
     `/Volumes/swi_audience_prd/discogs/gold/`
   - [Download the CSV file from Databricks](https://adb-4119964566130471.11.azuredatabricks.net/explore/data/volumes/swi_audience_prd/discogs/gold?o=4119964566130471&volumePath=%2FVolumes%2Fswi_audience_prd%2Fdiscogs%2Fgold)  
     *(You may need to navigate to the "Data" tab in Databricks, browse to the path above, and use the UI to download the file.)*

2. **Log in to GitHub:**
   - Go to: https://github.com/Tao-Pi/CAS-Applied-Data-Science/tree/main/Hidden%20Grooves/ZZ%20-%20Data

3. **Upload the CSV file:**
   - Click "Add file" > "Upload files"
   - Drag and drop the downloaded CSV file, or use the file picker.
   - Commit the upload to the repository.
1. **Find the CSV file in Databricks:**
   - The CSV was saved to:  
     `/Volumes/swi_audience_prd/discogs/gold/`
   - [Download the CSV file from Databricks](https://adb-4119964566130471.11.azuredatabricks.net/explore/data/volumes/swi_audience_prd/discogs/gold?o=4119964566130471&volumePath=%2FVolumes%2Fswi_audience_prd%2Fdiscogs%2Fgold)  
     *(You may need to navigate to the "Data" tab in Databricks, browse to the path above, and use the UI to download the file.)*

2. **Log in to GitHub:**
   - Go to: https://github.com/Tao-Pi/CAS-Applied-Data-Science/tree/main/Hidden%20Grooves/data

3. **Upload the CSV file:**
   - Click "Add file" > "Upload files"
   - Drag and drop the downloaded CSV file, or use the file picker.
   - Commit the upload to the repository.