
# Hands-on Project: Spark Pipeline in Databricks

This notebook will guide you through the process of building a Spark pipeline on Databricks using the Airbnb NYC listings dataset.

## Step 1: Initialize Spark Session
First, we need to initialize the Spark session.


In [None]:

# Initialize Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Airbnb NYC Data Analysis").getOrCreate()



## Step 2: Load the Dataset into Databricks
We will load the dataset into a Spark DataFrame.


In [None]:

# Example: Load CSV into a Spark DataFrame
file_path = "the csv name is AB_NYC_2019!"
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Show the first few rows and schema
# YOUR CODE HERE



## Step 3: Selecting and Renaming Important Columns
To make the dataset more manageable, we'll select a subset of important columns and rename them for clarity.


In [None]:

# Select specific columns and rename them for simplicity
selected_columns_df = df.select(
    df.id.alias("listing_id"),
    df.name.alias("listing_name"),
    df.host_id.alias("host_id"),
    df.neighbourhood.alias("neighborhood"),
    df.price.alias("price_per_night"),
    df.minimum_nights.alias("min_nights"),
    df.number_of_reviews.alias("reviews_count"),
    df.availability_365.alias("availability"),
    df.room_type.alias("room_type")
)

# Show the first few rows of the simplified DataFrame
selected_columns_df.show(5)



## Step 4: Data Exploration and Filtering
We'll filter the dataset to include listings in Manhattan with more than 5 reviews.


In [None]:

# Filter listings in Manhattan with more than 1 review
# YOUR CODE HERE



## Step 5: Data Transformation
Next, we'll create a new column that categorizes listings into price tiers: Low, Medium, and High.


In [None]:

from pyspark.sql.functions import when

# Convert 'price_per_night' to a numeric type
filtered_df = filtered_df.withColumn("price_per_night", filtered_df["price_per_night"].cast("float"))

# Create price tiers: Low (<$100), Medium ($100-$200), High (>$200)
# YOUR CODE HERE

transformed_df.show(5)



## Step 6: Aggregation and Analysis
Let's calculate the average price per neighborhood in Manhattan and the average number of reviews for each price tier.


In [None]:

# Convert 'reviews_count' to numeric type
# YOUR CODE HERE
# Average price per neighborhood in Manhattan
avg_price_neighbourhood = transformed_df.groupBy("neighborhood").avg("price_per_night")
avg_price_neighbourhood.show()

# Average number of reviews per price tier
# YOUR CODE HERE



## Final Thoughts
You can expand this project by exploring other datasets, adding new transformations, or experimenting with different types of aggregations.
