
# Explanation of Functions Used in the Airbnb Notebook

In this notebook, we will explain the key functions used in the Airbnb analysis, demonstrated with a simple dataset about people and their salary. The dataset contains information about people's salary, department, and years of experience.



## 1. Loading a CSV File into a Spark DataFrame

The first step in any Spark analysis is to load the data. Here's how to load a CSV file into a Spark DataFrame.


In [None]:
display(dbutils.fs.ls("/FileStore"))

In [None]:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Salary Analysis").getOrCreate()

# Load the CSV file into a Spark DataFrame
file_path = "/dbfs/FileStore/tables/people_salary_data.csv"  # Adjust path if needed
df = spark.read.csv(file_path, header=True, inferSchema=True)

# Show the first few rows of the DataFrame
df.show()



## 2. Filtering Data

The `filter()` function is used to filter rows based on a condition. For example, we can filter people whose salary is greater than $60,000.


In [None]:

# Filter rows where salary is greater than 60,000
high_salary_df = df.filter(df.salary > 60000)
high_salary_df.show()



## 3. Adding a New Column

We can use the `withColumn()` function to add a new column. In this case, we'll add a column that categorizes people into salary tiers: Low, Medium, and High.


In [None]:

from pyspark.sql.functions import when

# Add a new column 'salary_tier' based on the salary
df = df.withColumn("salary_tier", 
                   when(df.salary < 60000, "Low")
                   .when((df.salary >= 60000) & (df.salary <= 70000), "Medium")
                   .otherwise("High"))

df.show()



## 4. Aggregating Data

The `groupBy()` function is used to group data by one or more columns. We can then use aggregation functions like `avg()` to calculate the average salary in each department.


In [None]:

# Group data by 'department' and calculate the average salary
avg_salary_by_department = df.groupBy("department").avg("salary")
avg_salary_by_department.show()



## 5. Data Cleaning (Handling Null Values)

To clean the data, we can use functions like `fillna()` to replace `null` values with a default value. For example, if the 'experience_years' column has missing values, we can fill them with 0.


In [None]:

# Fill missing values in 'experience_years' with 0
cleaned_df = df.fillna({'experience_years': 0})
cleaned_df.show()
