# Introduction to PySpark

PySpark is the Python API for Apache Spark, a powerful distributed computing system designed for big data processing and analytics. This tutorial will guide you through the fundamentals of PySpark, from basic concepts to practical applications.

## What You'll Learn
1. Setting up PySpark
2. Creating and Working with SparkSession
3. Basic DataFrame Operations
4. Data Transformations and Actions
5. SQL Operations in PySpark
6. Data Cleaning and Preprocessing
7. Advanced Operations and Best Practices

## 1. Setting Up PySpark

First, we'll import the necessary libraries and create our SparkSession, which is the entry point for PySpark functionality.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create a SparkSession
spark = SparkSession.builder \
    .appName('PySpark Tutorial') \
    .getOrCreate()

print('SparkSession created successfully!')

## 2. Creating Your First DataFrame

Let's start by creating a simple DataFrame and exploring basic operations.

In [None]:
# Create a simple DataFrame
data = [
    (1, "John", 25),
    (2, "Alice", 30),
    (3, "Bob", 35)
]

columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Display the DataFrame
print("DataFrame Schema:")
df.printSchema()



## 3. Basic DataFrame Operations

Let's explore some common DataFrame operations that you'll use frequently.

In [None]:
# Select specific columns
df.select("name", "age").show()

# Filter data
df.filter(df.age > 28).show()

# Add a new column
df_with_category = df.withColumn(
    "age_category",
    when(df.age < 30, "Young")
    .when(df.age < 40, "Middle")
    .otherwise("Senior")
)

df_with_category.show()

## 4. Working with Real Data

Let's load and analyze some real-world data using the World Happiness dataset.

In [None]:
# Read CSV file
happiness_df = spark.read.csv('world_happiness_data/2023.csv', header=True, inferSchema=True)

# Display basic information about the dataset
print("Dataset Schema:")
happiness_df.printSchema()

print("\nFirst few rows:")
happiness_df.show(5)

## 5. Data Analysis and Aggregations

Let's perform some common data analysis operations.

In [None]:
# Basic statistics
happiness_df.describe().show()

# Group by operations
happiness_df.groupBy("Region") \
    .agg(
        avg("Happiness Score").alias("avg_happiness"),
        count("*").alias("count")
    ) \
    .orderBy("avg_happiness", ascending=False) \
    .show()

## 6. Using SQL with PySpark

PySpark also allows you to use SQL queries on DataFrames.

In [None]:
# Create a temporary view
happiness_df.createOrReplaceTempView("happiness")

# Run SQL query
spark.sql("""
    SELECT 
        Region,
        AVG(`Happiness Score`) as avg_happiness,
        COUNT(*) as country_count
    FROM happiness
    GROUP BY Region
    ORDER BY avg_happiness DESC
""").show()

## 7. Data Cleaning and Preprocessing

Let's learn some common data cleaning operations.

In [None]:
# Handle missing values
cleaned_df = happiness_df.dropna()

# Fill missing values
filled_df = happiness_df.fillna({
    'Happiness Score': happiness_df.select(avg('Happiness Score')).collect()[0][0]
})

# Remove duplicates
unique_df = happiness_df.dropDuplicates()

# Print counts
print(f"Original count: {happiness_df.count()}")
print(f"After removing nulls: {cleaned_df.count()}")
print(f"After removing duplicates: {unique_df.count()}")

## 8. Best Practices and Tips

Here are some important best practices when working with PySpark:

1. Always clean up your SparkSession when done
2. Use caching wisely for frequently accessed DataFrames
3. Monitor your transformations using explain()
4. Optimize your queries for better performance

In [None]:
# Example of explaining the execution plan
df_cached = happiness_df.cache()

# Show the execution plan
complex_query = df_cached.groupBy("Region") \
    .agg(avg("Happiness Score").alias("avg_happiness")) \
    .filter(col("avg_happiness") > 5.0)

print("Execution Plan:")
complex_query.explain()

# Clean up
spark.stop()