# Silver Layer - Clean and Structure the Data

🎯 **Goal**: Make the data clean, consistent, and usable for downstream tasks

🔧 **Tasks a data engineer performs**:
- Read from the bronze layer
- Flatten nested structures (e.g. explode arrays)
- Drop invalid or duplicate rows
- Convert data types to the correct formats (e.g., height as float)
- Rename columns for clarity or consistency
- Store result as a new dataset

### 🧹 This notebook: Silver Layer – Cleaned & Structured Pokémon Data

In this notebook, we:
- Read the raw Pokémon data from the bronze Delta table
- Extract selected fields from the nested `raw_json` column
- Explode arrays such as `types` into individual rows
- Ensure data types are consistent (e.g. height as float)
- Rename or restructure columns for clarity
- Save the cleaned data to a new Delta table in the silver layer at `../data/silver/pokemon`

This layer prepares the data for analysis and ML by enforcing structure and consistency.


# Step 1: Import Spark and setup the session
As we did in the previous notebook.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, explode, from_json
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("DemoPipe - Silver Layer") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.3.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Confirm Spark is running
spark.version

your 131072x1 screen size is bogus. expect trouble
25/04/25 19:56:33 WARN Utils: Your hostname, Laptop-van-Lotte resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/04/25 19:56:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/lotte/anaconda3/envs/demo-pipeline/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/lotte/.ivy2/cache
The jars for the packages stored in: /home/lotte/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-64f9956e-c7c5-4410-ba51-4d93ab38bb38;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.3.0 in central
	found io.delta#delta-storage;3.3.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 332ms :: artifacts dl 12ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.3.0 from central in [default]
	io.delta#delta-storage;3.3.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0  

'3.5.5'

# Step 2: Define paths to bronze and silver tables
As we did in the previous notebook

In [2]:
bronze_path = "../data/bronze/pokemon"
silver_path = "../data/silver/pokemon"

# For Databricks:
# bronze_path = "dbfs:/tmp/bronze/pokemon"
# silver_path = "dbfs:/tmp/silver/pokemon"

# For Microsoft Fabric:
# bronze_path = "Tables/bronze_pokemon"
# silver_path = "Tables/silver_pokemon"

# Step 3: Load the raw bronze Delta table
- First we use Sparks read functionality where we read the data on the `bronze_path` from the delta format, and put it in a Spark DataFrame for optimal processing.
- Then, we print the schema to have a general sense of what we are dealing with.
- Next, we show one row in the DataFrame. Do NOT load all rows into memory unless you are dealing with a small dataset. 

In [3]:
bronze_df = spark.read.format("delta").load(bronze_path)
bronze_df.printSchema()
bronze_df.show(1, truncate=False)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- raw_json: string (nullable = true)



25/04/25 19:56:56 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+---+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You can see every row will have three columns: id, name and then a very long, very messy json string that is hard to make sense of.


# Step 4: Define schema to parse JSON string
To be able to use the data, we need to parse the JSON string. 

In [4]:
parsed_schema = StructType([
    StructField("height", IntegerType(), True),
    StructField("weight", IntegerType(), True),
    StructField("base_experience", IntegerType(), True),
    StructField("types", ArrayType(
        StructType([
            StructField("slot", IntegerType(), True),
            StructField("type", StructType([
                StructField("name", StringType(), True),
                StructField("url", StringType(), True)
            ]), True)
        ])
    ), True),
    StructField("stats", ArrayType(
        StructType([
            StructField("base_stat", IntegerType(), True),
            StructField("effort", IntegerType(), True),
            StructField("stat", StructType([
                StructField("name", StringType(), True),
                StructField("url", StringType(), True)
            ]), True)
        ])
    ), True),
    StructField("abilities", ArrayType(
        StructType([
            StructField("is_hidden", BooleanType(), True),
            StructField("slot", IntegerType(), True),
            StructField("ability", StructType([
                StructField("name", StringType(), True),
                StructField("url", StringType(), True)
            ]), True)
        ])
    ), True),
    StructField("moves", ArrayType(
        StructType([
            StructField("move", StructType([
                StructField("name", StringType(), True),
                StructField("url", StringType(), True)
            ]), True)
        ])
    ), True)
])


# Step 5: Parse raw JSON string to struct

In [5]:
df_parsed = bronze_df.withColumn("parsed", from_json(col("raw_json"), parsed_schema))
df_parsed.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- raw_json: string (nullable = true)
 |-- parsed: struct (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- weight: integer (nullable = true)
 |    |-- base_experience: integer (nullable = true)
 |    |-- types: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- slot: integer (nullable = true)
 |    |    |    |-- type: struct (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |    |-- stats: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- base_stat: integer (nullable = true)
 |    |    |    |-- effort: integer (nullable = true)
 |    |    |    |-- stat: struct (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- url: string (nullable = true)
 |    |-- abilities: array (nulla

# Step 6: Extract fields from struct


In [6]:
base_df = df_parsed.select(
    col("id"),
    col("name"),
    col("parsed.height").alias("height"),
    col("parsed.weight").alias("weight"),
    col("parsed.base_experience").alias("base_experience")
)

types_df = df_parsed.select("id", "name", explode("parsed.types").alias("type")) \
    .select("id", "name", col("type.type.name").alias("type_name"))

abilities_df = df_parsed.select("id", "name", explode("parsed.abilities").alias("ability")) \
    .select("id", "name", col("ability.ability.name").alias("ability_name"), col("ability.is_hidden"))

stats_df = df_parsed.select("id", "name", explode("parsed.stats").alias("stat")) \
    .select("id", "name", col("stat.stat.name").alias("stat_name"), col("stat.base_stat").alias("stat_value"))

moves_df = df_parsed.select("id", "name", explode("parsed.moves").alias("move")) \
    .select("id", "name", col("move.move.name").alias("move_name"))

In [7]:
base_df.printSchema()
types_df.printSchema()
abilities_df.printSchema()
stats_df.printSchema()
moves_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- height: integer (nullable = true)
 |-- weight: integer (nullable = true)
 |-- base_experience: integer (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- type_name: string (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- ability_name: string (nullable = true)
 |-- is_hidden: boolean (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- stat_name: string (nullable = true)
 |-- stat_value: integer (nullable = true)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- move_name: string (nullable = true)



In [8]:
stats_df.show(20, truncate=False)

                                                                                

+---+--------+---------------+----------+
|id |name    |stat_name      |stat_value|
+---+--------+---------------+----------+
|49 |venomoth|hp             |70        |
|49 |venomoth|attack         |65        |
|49 |venomoth|defense        |60        |
|49 |venomoth|special-attack |90        |
|49 |venomoth|special-defense|75        |
|49 |venomoth|speed          |90        |
|50 |diglett |hp             |10        |
|50 |diglett |attack         |55        |
|50 |diglett |defense        |25        |
|50 |diglett |special-attack |35        |
|50 |diglett |special-defense|45        |
|50 |diglett |speed          |95        |
|51 |dugtrio |hp             |35        |
|51 |dugtrio |attack         |100       |
|51 |dugtrio |defense        |50        |
|51 |dugtrio |special-attack |50        |
|51 |dugtrio |special-defense|70        |
|51 |dugtrio |speed          |120       |
|52 |meowth  |hp             |40        |
|52 |meowth  |attack         |45        |
+---+--------+---------------+----

# Step 7: Save exploded tables seperately for accurate multi-valued representation

In [9]:
base_df.write.format("delta").mode("overwrite").save(f"{silver_path}/base")
types_df.write.format("delta").mode("overwrite").save(f"{silver_path}/types")
abilities_df.write.format("delta").mode("overwrite").save(f"{silver_path}/abilities")
stats_df.write.format("delta").mode("overwrite").save(f"{silver_path}/stats")
moves_df.write.format("delta").mode("overwrite").save(f"{silver_path}/moves")

25/04/25 19:57:21 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
25/04/25 19:57:28 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
25/04/25 19:57:34 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
25/04/25 19:57:40 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
25/04/25 19:57:46 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
                                                                                

# Optional: Read and verify

In [10]:
spark.read.format("delta").load(f"{silver_path}/base").show(5)
spark.read.format("delta").load(f"{silver_path}/types").show(5)
spark.read.format("delta").load(f"{silver_path}/abilities").show(5)
spark.read.format("delta").load(f"{silver_path}/stats").show(5)
spark.read.format("delta").load(f"{silver_path}/moves").show(5)

                                                                                

+---+--------+------+------+---------------+
| id|    name|height|weight|base_experience|
+---+--------+------+------+---------------+
|193|   yanma|    12|   380|             78|
|194|  wooper|     4|    85|             42|
|195|quagsire|    14|   750|            151|
|196|  espeon|     9|   265|            184|
|197| umbreon|    10|   270|            184|
+---+--------+------+------+---------------+
only showing top 5 rows

+---+-------+---------+
| id|   name|type_name|
+---+-------+---------+
|145| zapdos| electric|
|145| zapdos|   flying|
|146|moltres|     fire|
|146|moltres|   flying|
|147|dratini|   dragon|
+---+-------+---------+
only showing top 5 rows

+---+--------+------------+---------+
| id|    name|ability_name|is_hidden|
+---+--------+------------+---------+
| 49|venomoth| shield-dust|    false|
| 49|venomoth| tinted-lens|    false|
| 49|venomoth| wonder-skin|     true|
| 50| diglett|   sand-veil|    false|
| 50| diglett|  arena-trap|    false|
+---+--------+-----------