# Introduction to PySpark

PySpark is the Python API for Apache Spark, a powerful distributed computing system designed for big data processing and analytics. This tutorial will guide you through the fundamentals of PySpark, from basic concepts to practical applications.

## What You'll Learn
1. Setting up PySpark
2. Creating and Working with SparkSession
3. Basic DataFrame Operations
4. Data Transformations and Actions
5. SQL Operations in PySpark
6. Data Cleaning and Preprocessing
7. Advanced Operations and Best Practices

## 1. Setting Up PySpark

First, we'll import the necessary libraries and create our SparkSession, which is the entry point for PySpark functionality.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create a SparkSession
spark = SparkSession.builder \
    .appName('PySpark Tutorial') \
    .getOrCreate()

print('SparkSession created successfully!')

SparkSession created successfully!


## 2. Creating Your First DataFrame

Let's start by creating a simple DataFrame and exploring basic operations.

In [2]:
# Create a simple DataFrame
data = [
    (1, "John", 25),
    (2, "Alice", 30),
    (3, "Bob", 35)
]

columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)

# Display the DataFrame
print("DataFrame Schema:")
df.printSchema()



DataFrame Schema:
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



## 3. Basic DataFrame Operations

Let's explore some common DataFrame operations that you'll use frequently.

In [None]:
# Select specific columns
df.select("name", "age").show()

# Filter data
df.filter(df.age > 28).show()

# Add a new column
df_with_category = df.withColumn(
    "age_category",
    when(df.age < 30, "Young")
    .when(df.age < 40, "Middle")
    .otherwise("Senior")
)

df_with_category.show()

## 4. Working with Real Data

Let's load and analyze some real-world data using the World Happiness dataset.

In [4]:
# Read CSV file
happiness_df = spark.read.csv('world_happiness_data/2023.csv', header=True, inferSchema=True)

# Display basic information about the dataset
print("Dataset Schema:")
happiness_df.printSchema()

print("\nFirst few rows:")
happiness_df.show(5)

Dataset Schema:
root
 |-- Country name: string (nullable = true)
 |-- Happiness Rank: integer (nullable = true)
 |-- Happiness score: double (nullable = true)
 |-- Upperwhisker: double (nullable = true)
 |-- Lowerwhisker: double (nullable = true)
 |-- Economy (GDP per Capita)\t: double (nullable = true)
 |-- Social support: double (nullable = true)
 |-- Healthy life expectancy: double (nullable = true)
 |-- Freedom to make life choices: double (nullable = true)
 |-- Generosity: double (nullable = true)
 |-- Perceptions of corruption: double (nullable = true)


First few rows:
+------------+--------------+---------------+------------+------------+--------------------------+--------------+-----------------------+----------------------------+----------+-------------------------+
|Country name|Happiness Rank|Happiness score|Upperwhisker|Lowerwhisker|Economy (GDP per Capita)\t|Social support|Healthy life expectancy|Freedom to make life choices|Generosity|Perceptions of corruption|
+--------

## 5. Data Analysis and Aggregations

Let's perform some common data analysis operations.

In [6]:
# Basic statistics
happiness_df.describe().show()

# Group by operations
happiness_df.groupBy("Country name") \
    .agg(
        avg("Happiness Score").alias("avg_happiness"),
        count("*").alias("count")
    ) \
    .orderBy("avg_happiness", ascending=False) \
    .show()

+-------+------------+-----------------+------------------+------------------+------------------+--------------------------+------------------+-----------------------+----------------------------+-------------------+-------------------------+
|summary|Country name|   Happiness Rank|   Happiness score|      Upperwhisker|      Lowerwhisker|Economy (GDP per Capita)\t|    Social support|Healthy life expectancy|Freedom to make life choices|         Generosity|Perceptions of corruption|
+-------+------------+-----------------+------------------+------------------+------------------+--------------------------+------------------+-----------------------+----------------------------+-------------------+-------------------------+
|  count|         137|              137|               137|               137|               137|                       137|               137|                    136|                         137|                137|                      137|
|   mean|        NULL|      

## 6. Using SQL with PySpark

PySpark also allows you to use SQL queries on DataFrames.

In [11]:
# Create a temporary view
happiness_df.createOrReplaceTempView("happiness")

# Run SQL query
spark.sql("""
    SELECT 
        "Country name",
        AVG(`Happiness Score`) as avg_happiness,
        COUNT(*) as country_count
    FROM happiness
    GROUP BY "Country name"
    ORDER BY avg_happiness DESC
""").show()

+------------+-----------------+-------------+
|Country name|    avg_happiness|country_count|
+------------+-----------------+-------------+
|Country name|5.539795620437957|          137|
+------------+-----------------+-------------+



## 7. Data Cleaning and Preprocessing

Let's learn some common data cleaning operations.

In [12]:
# Handle missing values
cleaned_df = happiness_df.dropna()

# Fill missing values
filled_df = happiness_df.fillna({
    'Happiness Score': happiness_df.select(avg('Happiness Score')).collect()[0][0]
})

# Remove duplicates
unique_df = happiness_df.dropDuplicates()

# Print counts
print(f"Original count: {happiness_df.count()}")
print(f"After removing nulls: {cleaned_df.count()}")
print(f"After removing duplicates: {unique_df.count()}")

Original count: 137
After removing nulls: 136
After removing duplicates: 137
After removing duplicates: 137


## 8. Best Practices and Tips

Here are some important best practices when working with PySpark:

1. Always clean up your SparkSession when done
2. Use caching wisely for frequently accessed DataFrames
3. Monitor your transformations using explain()
4. Optimize your queries for better performance

In [14]:
# Example of explaining the execution plan
df_cached = happiness_df.cache()

# Show the execution plan
complex_query = df_cached.groupBy("Country name") \
    .agg(avg("Happiness Score").alias("avg_happiness")) \
    .filter(col("avg_happiness") > 5.0)

print("Execution Plan:")
complex_query.explain()

# Clean up
spark.stop()

Execution Plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Filter (isnotnull(avg_happiness#1848) AND (avg_happiness#1848 > 5.0))
   +- HashAggregate(keys=[Country name#34], functions=[avg(Happiness Score#36)])
      +- Exchange hashpartitioning(Country name#34, 200), ENSURE_REQUIREMENTS, [plan_id=458]
         +- HashAggregate(keys=[Country name#34], functions=[partial_avg(Happiness Score#36)])
            +- InMemoryTableScan [Country name#34, Happiness score#36]
                  +- InMemoryRelation [Country name#34, Happiness Rank#35, Happiness score#36, Upperwhisker#37, Lowerwhisker#38, Economy (GDP per Capita)	#39, Social support#40, Healthy life expectancy#41, Freedom to make life choices#42, Generosity#43, Perceptions of corruption#44], StorageLevel(disk, memory, deserialized, 1 replicas)
                        +- FileScan csv [Country name#34,Happiness Rank#35,Happiness score#36,Upperwhisker#37,Lowerwhisker#38,Economy (GDP per Capita)	#39,Social support#40,Heal