## 1. Environment Setup (Linux)

### 1.1. Update Linux Packages

In [None]:
!sudo apt update

### 1.2. Install Python and Packages (pip & venv)

In [None]:
!sudo apt install -y python3 python3-pip python3-venv

### 1.3. Create a Virtual Environment

In [None]:
!python3 -m venv venv

### 1.4. Activate the Virtual Environment

In [None]:
!source venv/bin/activate

### 1.5. Install the Required Python Packages

In [None]:
%pip install -r requirements.txt

## 2. Data Ingestion

### 2.1. Download the Dataset

In [None]:
!wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \
"https://business.yelp.com/external-assets/files/Yelp-JSON.zip" \
-O Yelp-JSON.zip


### 2.2. Unzip the downloaded file

In [None]:
# Unzip the downloaded file
!unzip -o Yelp-JSON.zip

In [None]:
# Create Datasets directory
!mkdir Datasets

In [None]:
# Move the extracted tar file to Datasets directory
!tar -xvf Yelp\ JSON/yelp_dataset.tar -C Datasets

In [None]:
# Remove unnecessary files
!rm Yelp-JSON.zip
!rm -rf __MACOSX/
!rm -rf Yelp\ JSON/

In [None]:
# sudo apt update
# sudo apt install -y openjdk-17-jdk

# # Verify Java 17 is present
# ls -d /usr/lib/jvm/java-17-openjdk-amd64 || echo "Java 17 not found"

# # Force your shell to prefer Java 17 and unset JAVA_TOOL_OPTIONS
# echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
# echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc
# echo 'unset JAVA_TOOL_OPTIONS' >> ~/.bashrc
# source ~/.bashrc

# # Confirm
# which java
# java -version

In [1]:
# Import and initialize Spark
import os

# Pin Java 17 for Spark
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["HADOOP_USER_NAME"] = "root"

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Yelp Data - Analysis") \
    .master("local[*]") \
    .config("spark.hadoop.security.authentication", "simple") \
    .getOrCreate()

print("Spark version:", spark.version)

25/12/12 20:14:24 WARN Utils: Your hostname, codespaces-4d50f8 resolves to a loopback address: 127.0.0.1; using 10.0.0.15 instead (on interface eth0)
25/12/12 20:14:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/12 20:14:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark version: 3.5.1


In [2]:
business_df = spark.read.json("Datasets/yelp_academic_dataset_business.json")
checkin_df = spark.read.json("Datasets/yelp_academic_dataset_checkin.json")
review_df = spark.read.json("Datasets/yelp_academic_dataset_review.json")
tip_df = spark.read.json("Datasets/yelp_academic_dataset_tip.json")
user_df = spark.read.json("Datasets/yelp_academic_dataset_user.json")

25/12/12 20:16:26 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

## 3. Data Cleaning
This section applies consistent, reproducible cleaning steps for each dataset using Spark DataFrames. Goals:
- Remove exact duplicate rows
- Handle missing values with sensible defaults
- Normalize data types (dates, booleans, nested structs)
- Trim whitespace and standardize text where relevant
- Validate schemas and basic constraints
- Cache cleaned DataFrames for downstream use

## 4. Data Transformation

## 5. Data Querying