# PySpark Basics and Operations: Beginner Session

Welcome to your first PySpark session! In this notebook, we'll learn what PySpark is, how to use it for real-world data analysis, and how it differs from standard Python.

## What is PySpark?

- **PySpark** is the Python API for Apache Spark.
- **Apache Spark** is a powerful open-source engine for big data processing and analytics.
- It allows you to process large datasets in parallel across multiple computers (distributed computing).
- PySpark is used for data engineering, transformation, machine learning, and more.

## How Does PySpark Differ from Python?

- **Distributed**: PySpark can process data on many machines at once, while normal Python usually runs on a single computer.
- **Lazy Evaluation**: PySpark builds a plan for your data processing and only runs when you need the result.
- **DataFrames**: PySpark uses DataFrames for table-like data, similar to pandas but designed for big data.
- **Syntax Differences**: Some functions look similar to pandas, but often require using `SparkSession` or `DataFrame` APIs.

**Example:**
- Python (pandas): `df['age'].mean()`
- PySpark: `df.agg({'age': 'mean'}).show()`

## Basic Syntax in PySpark

- Variables: Same as Python.
- Data Types: int, float, string, etc.
- If-Else and Loops: Syntax is the same as Python (but often used less, since you want to use DataFrame operations for speed).

In [0]:
# Variable declaration
a = 10  # integer
b = 3.14  # float
name = "Alice"  # string

# If-else
if a > 5:
    print("a is greater than 5")
else:
    print("a is 5 or less")

# Loop
for i in range(3):
    print(f"Loop iteration {i}")

## Working with Data in PySpark

Usually, you'll use DataFrames for data processing. Let's see how to load data.

### Loading Data Directly from Databricks Catalog (Database Table)

If your data is already in a Databricks database table, you can load it directly using the table API.

In [0]:
# Replace 'customers' with your actual table name if different
customer_table = "my_database.my_schema.customers"
customer_df = spark.read.table(customer_table)
customer_df.show()

## DataFrame Operations in PySpark

Let's explore some key operations: joins, aggregations, and sorting.

### 1. Joins in PySpark
- PySpark supports all major types of joins: inner, left, right, outer.
- You typically join two DataFrames on a common column.

In [0]:
# Let's create another small DataFrame to demonstrate joins
from pyspark.sql import Row

city_data = [
    Row(location="New York", region="East",time="AM"),
    Row(location="Los Angeles", region="West",time="AM"),
    Row(location="Chicago", region="Midwest",time="AM"),
    Row(location="Houston", region="South",time="AM"),
    Row(location="Miami", region="South",time="AM")  # Not present in customer_df
]
city_df = spark.createDataFrame(city_data)
city_df.show()

**Inner Join Example:**

In [0]:
inner_join = customer_df.join(city_df, on="location", how="inner")
inner_join.show()

**Left Join Example:**

In [0]:
left_join = customer_df.join(city_df, on="location", how="left")
left_join.show()

**Right Join Example:**

In [0]:
right_join = customer_df.join(city_df, on="location", how="right")
right_join.show()

**Full Outer Join Example:**

In [0]:
outer_join = customer_df.join(city_df, on="location", how="outer")
outer_join.show()

**Load this data to output table**

In [0]:
outer_join.write.mode("overwrite").saveAsTable("my_database.my_schema.outer_join_op")

### 2. Aggregations in PySpark
- You can calculate totals, averages, counts, etc. using `groupBy` and aggregation functions.

In [0]:
from pyspark.sql import functions as F

# Total purchase amount by gender
new_df = customer_df.groupBy("gender").agg(F.sum("purchase_amount").alias("total_purchase"))

In [0]:
new_df.write.mode("overwrite").saveAsTable("my_database.my_schema.new_df")

In [0]:
# Average purchase amount
customer_df.agg(F.avg("purchase_amount").alias("avg_purchase")).show()

In [0]:
# Count by location
customer_df.groupBy("location").count().show()

In [0]:
# Minimum and Maximum age
customer_df.agg(F.min("age").alias("min_age"), F.max("age").alias("max_age")).show()

### 3. Sorting in PySpark
- Use `.orderBy()` to sort DataFrames.

In [0]:
# Sort by purchase amount descending
customer_df.orderBy(F.col("purchase_amount").desc()).show()

# Summary: Key Points
- PySpark lets you process big data efficiently using Python.
- It is different from standard Python because it can use multiple machines and is optimized for big data.
- Most real work is done via DataFrame APIs instead of Python loops for best performance.
- Loading data from Databricks catalog is easy with `spark.read.table()`.
- Common operations include joins, aggregations, and sorting, all of which are simple in PySpark.
