# Part 1 — Data Ingestion (Bronze Layer)
**Goal:** Ingest raw Yelp review data into the Bronze layer of the Medallion Architecture.  
We will:
1. Download and inspect the dataset.
2. Store the raw data in our cloud storage (local `/data/bronze/yelp`).
3. Verify ingestion with simple Spark RDD operations.


## Setup Spark

In [None]:
import findspark
findspark.init()

try: 
    from pyspark.sql import SparkSession
    pyspark_available = True
except ImportError:
    print("PySpark not available. Install with: pip install pyspark")
    pyspark_available = False

# Initialize SparkSession and SparkContext
if pyspark_available:
    spark = SparkSession.builder \
        .appName("Yelp_bronze_ingestion") \
        .master("local[*]") \
        .getOrCreate()
    sc = spark.sparkContext
    print(f"Spark version: {spark.version}")
else:
    print("Skipping Spark tasks - Pyspark not available")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/13 13:28:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/11/13 13:28:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


Spark version: 3.5.0


## Quick santity check

In [2]:
!head -n 3 /data/bronze/yelp/raw/2025-11-13/yelp_academic_dataset_review.json

{"review_id":"KU_O5udG6zpxOg-VcAEodg","user_id":"mh_-eMZ6K5RLWhZyISBhwA","business_id":"XQfwVwDr-v0ZS3_CbbE5Xw","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.","date":"2018-07-07 22:09:11"}
{"review_id":"BiTunyQ73aT9WBnpR9DZGw","user_id":"OyoGAe7OKpv6SyGZT5g77Q","business_id":"7ATYjTIgM3jUlt4UM3IypQ","stars":5.0,"useful":1,"funny":0,"cool":1,"text":"I've taken a lot of spin classes over the years, and nothing compares to the classes at Body Cycle. From the nice, clean space an

## Load raw data into Spark RDD

In [3]:
raw_path = "file:///data/bronze/yelp/raw/2025-11-13/yelp_academic_dataset_review.json"

if pyspark_available:
    raw_rdd = sc.textFile(raw_path)
    print("Raw record count(approx): ", raw_rdd.count())
    print("Sample line: ", raw_rdd.first())

                                                                                

Raw record count(approx):  6990280
Sample line:  {"review_id":"KU_O5udG6zpxOg-VcAEodg","user_id":"mh_-eMZ6K5RLWhZyISBhwA","business_id":"XQfwVwDr-v0ZS3_CbbE5Xw","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. \n\nThe food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.","date":"2018-07-07 22:09:11"}


## Shutdown

In [4]:
if pyspark_available:
    spark.stop()