# Transformation Pipeline - Bronze [![259302-pipeline-management-aws-deployment-copy-icon.png](https://i.postimg.cc/3w6CMc6p/259302-pipeline-management-aws-deployment-copy-icon.png)](https://postimg.cc/zLCRK0qX)

##### Load Data

In [1]:
# Define file paths for the ingested data
placements_path = "Files/2025/01/tblPlacements.csv" 
interviews_path = "Files/2025/01/tblInterviews.csv"

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 3, Finished, Available, Finished)

In [2]:
# Load the data into DataFrames
placements_df = spark.read.format("csv").option("header", "true").load(placements_path)
interviews_df = spark.read.format("csv").option("header", "true").load(interviews_path)

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 4, Finished, Available, Finished)

##### EDA

In [3]:
# Inspect the placement data
print("Placements Schema:")
placements_df.printSchema()
print("Sample Placements Data:")
placements_df.show(5)

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 5, Finished, Available, Finished)

Placements Schema:
root
 |-- PlacementId: string (nullable = true)
 |-- Candidate email: string (nullable = true)
 |-- Start Date: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Marketing Opt Out: string (nullable = true)

Sample Placements Data:
+-----------+-----------------+----------+--------+-----------------+
|PlacementId|  Candidate email|Start Date|  Status|Marketing Opt Out|
+-----------+-----------------+----------+--------+-----------------+
|          1|Francis@gmail.com| 4/11/2024|  Active|            FALSE|
|          2|Jessica@gmail.com| 5/10/2024|  Active|             TRUE|
|          3|Michael@gmail.com| 6/12/2024|  Active|            FALSE|
|          4|  Sarah@gmail.com|  7/1/2024|Inactive|             TRUE|
|          5| Thomas@gmail.com| 8/15/2024|  Active|            FALSE|
+-----------+-----------------+----------+--------+-----------------+
only showing top 5 rows



In [4]:
# Inspect interview data
print("Interviews Schema:")
interviews_df.printSchema()
print("Sample Interviews Data:")
interviews_df.show(5)

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 6, Finished, Available, Finished)

Interviews Schema:
root
 |-- InterviewId: string (nullable = true)
 |-- Candidate email: string (nullable = true)
 |-- Interview Date: string (nullable = true)

Sample Interviews Data:
+-----------+--------------------+--------------+
|InterviewId|     Candidate email|Interview Date|
+-----------+--------------------+--------------+
|          1|     Emily@gmail.com|      4/1/2024|
|          2|lisa.white@gmail.com|      4/5/2024|
|          3|eric.smith@gmail.com|     4/12/2024|
|          4|megan.jones@gmail...|     4/15/2024|
|          5|kevin.hill@gmail.com|     4/18/2024|
+-----------+--------------------+--------------+
only showing top 5 rows



##### Bronze Layer Transformations

In [5]:
# Rename columns to replace spaces with underscores
def rename_columns(df):
    for column in df.columns:
        df = df.withColumnRenamed(column, column.replace(" ", "_"))
    return df

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 7, Finished, Available, Finished)

In [6]:
# Rename columns for placements and interviews DataFrames
placements_df = rename_columns(placements_df)
interviews_df = rename_columns(interviews_df)

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 8, Finished, Available, Finished)

In [7]:
# Inspect the updated schemas
print("Updated Placements Schema:")
placements_df.printSchema()

print("Updated Interviews Schema:")
interviews_df.printSchema()

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 9, Finished, Available, Finished)

Updated Placements Schema:
root
 |-- PlacementId: string (nullable = true)
 |-- Candidate_email: string (nullable = true)
 |-- Start_Date: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Marketing_Opt_Out: string (nullable = true)

Updated Interviews Schema:
root
 |-- InterviewId: string (nullable = true)
 |-- Candidate_email: string (nullable = true)
 |-- Interview_Date: string (nullable = true)



##### Save Data to Bronze Layer

In [8]:
# Save raw data as Delta tables in the Lakehouse
placements_df.write.format("delta").mode("overwrite").save("Tables/Bronze/tblPlacements")
interviews_df.write.format("delta").mode("overwrite").save("Tables/Bronze/tblInterviews")

print("Tables loaded into Bronze Layer Successfully!")

StatementMeta(, 0e4988d6-73c9-44f1-96ea-bcd7a1b27c80, 10, Finished, Available, Finished)

Tables loaded into Bronze Layer Successfully!
