In [6]:
import pyspark
from pyspark.sql import SparkSession

# Setting up the PySpark session
# as creating a SparkSession is necessary to interact with Spark,
spark = SparkSession.builder \
    .appName("Yelp Dataset Exploration") \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

# Defining file paths for each JSON file of yelp dataset
business_path = "C:/Users/perfect/OneDrive/Desktop/Yelp_Data/yelp_academic_dataset_business.json"
review_path = "C:/Users/perfect/OneDrive/Desktop/Yelp_Data/yelp_academic_dataset_review.json"
user_path = "C:/Users/perfect/OneDrive/Desktop/Yelp_Data/yelp_academic_dataset_user.json"
checkin_path = "C:/Users/perfect/OneDrive/Desktop/Yelp_Data/yelp_academic_dataset_checkin.json"
tip_path = "C:/Users/perfect/OneDrive/Desktop/Yelp_Data/yelp_academic_dataset_tip.json"

# Now loading all these JSON files into a dataframe 
# Using Spark's built-in JSON reader to load each file into a DataFrame
business_df = spark.read.format('json') \
                   .option('header','True') \
                   .option('inferSchema','True') \
                   .load(business_path)
review_df = spark.read.format('json') \
                   .option('header','True') \
                   .option('inferSchema','True') \
                   .load(review_path)
user_df = spark.read.format('json') \
                   .option('header','True') \
                   .option('inferSchema','True') \
                   .load(user_path)
checkin_df = spark.read.format('json') \
                   .option('header','True') \
                   .option('inferSchema','True') \
                   .load(checkin_path)
tip_df = spark.read.format('json') \
                   .option('header','True') \
                   .option('inferSchema','True') \
                   .load(tip_path)


In [8]:
# Now lets check the schema for each dataset to better understand the data.
# The `printSchema` method can be used to do that.
print("Schema of Business Dataset:")
business_df.printSchema()

print("\nSchema of Review Dataset:")
review_df.printSchema()

print("\nSchema of User Dataset:")
user_df.printSchema()

print("\nSchema of Check-in Dataset:")
checkin_df.printSchema()

print("\nSchema of Tip Dataset:")
tip_df.printSchema()

Schema of Business Dataset:
root
 |-- address: string (nullable = true)
 |-- attributes: struct (nullable = true)
 |    |-- AcceptsInsurance: string (nullable = true)
 |    |-- AgesAllowed: string (nullable = true)
 |    |-- Alcohol: string (nullable = true)
 |    |-- Ambience: string (nullable = true)
 |    |-- BYOB: string (nullable = true)
 |    |-- BYOBCorkage: string (nullable = true)
 |    |-- BestNights: string (nullable = true)
 |    |-- BikeParking: string (nullable = true)
 |    |-- BusinessAcceptsBitcoin: string (nullable = true)
 |    |-- BusinessAcceptsCreditCards: string (nullable = true)
 |    |-- BusinessParking: string (nullable = true)
 |    |-- ByAppointmentOnly: string (nullable = true)
 |    |-- Caters: string (nullable = true)
 |    |-- CoatCheck: string (nullable = true)
 |    |-- Corkage: string (nullable = true)
 |    |-- DietaryRestrictions: string (nullable = true)
 |    |-- DogsAllowed: string (nullable = true)
 |    |-- DriveThru: string (nullable = true)
 

Great, now that we have observed the schema, lets display sample data from each dataset.
To do that we'll be Using `show` to display the first few rows of each DataFrame.
And could have used truncate=false to have the complete view to long data but it would have played with the views of column in line and occupied more space as well. So i have left the long values in columns truncated.

In [17]:

print("\nSample Data from Business Dataset:")
business_df.show(5)

print("\nSample Data from Review Dataset:")
review_df.show(5)

print("\nSample Data from User Dataset:")
user_df.show(5)

print("\nSample Data from Check-in Dataset:")
checkin_df.show(5)

print("\nSample Data from Tip Dataset:")
tip_df.show(5)




Sample Data from Business Dataset:
+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+-------+----------+------------+--------------------+-----------+------------+-----+-----+
|             address|          attributes|         business_id|          categories|         city|               hours|is_open|  latitude|   longitude|                name|postal_code|review_count|stars|state|
+--------------------+--------------------+--------------------+--------------------+-------------+--------------------+-------+----------+------------+--------------------+-----------+------------+-----+-----+
|1616 Chapala St, ...|{NULL, NULL, NULL...|Pns2l4eNsfO8kk83d...|Doctors, Traditio...|Santa Barbara|                NULL|      0|34.4266787|-119.7111968|Abby Rappoport, L...|      93101|           7|  5.0|   CA|
|87 Grasso Plaza S...|{NULL, NULL, NULL...|mpf3x-BjTdTEA3yCZ...|Shipping Centers,...|       Affton|{8:0-18:30, 0:0-0...|

Key features/entities that I observed exploring this datasets:
-  Business: Information about businesses, such as name, location, categories, etc.
-  Review: User reviews of businesses, including ratings and review text.
-  User: Information about users, such as name, review count, and friends.
-  Check-in: Check-in data for businesses, such as timestamps.
-  Tip: Additional short tips left by users for businesses.

