# 01 - Bronze Layer: Raw User Data Ingestion

**Medallion Architecture: BRONZE**
This notebook:
1. Parses MAL XML export (raw, unmodified)
2. Saves to Bronze table: `bronze_mal_raw`

In [0]:
import pandas as pd
from pyspark.sql import functions
from pyspark.sql import SparkSession
from datetime import datetime
import os

Configuration

In [0]:
spark = SparkSession.builder.appName("raw_to_bronze_extraction").getOrCreate()

Step-1 Parsing the xml files

In [0]:
source_dir = "/Volumes/anime_project/source/anime_data"
xml_files = [f for f in os.listdir(source_dir) if f.endswith(".xml") or f.endswith(".xml.gz")]
if not xml_files:
    raise FileNotFoundError("No XML or XML.GZ files found in the source directory.")
xml_path = os.path.join(source_dir, xml_files[0])
anime_row_tag = "anime"
myinfo_row_tag = "myinfo"

In [0]:
# read anime
df_anime = (
    spark.read
         .format("xml")
         .option("rowTag", anime_row_tag)
         .load(xml_path)
)

# read myinfo
df_info = (
    spark.read
         .format("xml")
         .option("rowTag", myinfo_row_tag)
         .load(xml_path)
)

In [0]:
df_anime.printSchema()

In [0]:
df_anime.schema

Since we will be enriching it afterwards we will only be taking essentials from the df_anime dataset.
The columns included are:
- series_animedb_id
- my_score

In [0]:
df_anime = df_anime.select("series_animedb_id", "my_score")

extract single-row user info

In [0]:
info = df_info.select("user_id", "user_name").first().asDict()
user_id = info.get("user_id")
user_name = info.get("user_name")

add user info columns to every anime record

In [0]:
df_bronze = (
    df_anime.withColumn("user_id", functions.lit(user_id))
            .withColumn("user_name", functions.lit(user_name))
)

In [0]:
df_bronze.count()

In [0]:
df_bronze.show()

write to bronze delta table

In [0]:
bronze_path = "/Volumes/anime_project/bronze/anime_data"

In [0]:
df_bronze.write.format("delta").mode("overwrite").save(bronze_path)