# PySpark Fundamentals - Jupyter Notebook

This notebook explains PySpark fundamentals for beginners.

## 0. Getting Started

### 0.1 Acknowledgments

**Original Source:** Professor Dr. Ali Safari  
**Modified by:** Benjamin Gao (Enhanced structure and explanations for better learning experience)  
**Email:** g103200@gmail.com 

---

**☕ Support This Work:**

If you find this notebook helpful, consider buying me a coffee! Your support helps create more free educational content. 🙏

**💳 Fiat Currency:**
- **Alipay/支付宝:** `g103200@gmail.com`
- **Wise:** `g103200@gmail.com`
- **PayPal:** [paypal.me/gbenjamin3](https://paypal.me/gbenjamin3)

**₿ Cryptocurrency:**

<details>
<summary>Click to view crypto addresses</summary>

- **ETH (ERC20):** `0x05bd3070993c1ef72b1ca3a06999cbcc3f61ad8b`
- **USDT (ERC20):** `0x0a4649d6cbabf9bcf0419ac829f22a273136af51`
- **SOL (Solana):** `3bsEtgBPeNwMrHLzQBrxiQ7wX1nr3dSRrzVAHoa1nudQ`
- **BTC (Bitcoin):** `bc1ql0pafavp4l0l7j9m6dhgqajces3a80zqdj2kp8nua3aw4hqsm6vsnucv2m`

</details>

*Every contribution, no matter how small, is greatly appreciated!* ✨

---

### 0.2 How to Use This Notebook

#### 📖 Folding Feature
- Click **▼** to collapse sections
- Click **▶** to expand content

#### 🎯 Learning Path
1. Getting Started → Setup environment
2. Introduction → PySpark basics
3. Creating DataFrames → Three creation methods
4. Basic Operations → Common operations
5. Class Activity → Hands-on practice

---

### 0.3 System Information

**Local Environment (This Notebook):**

| Component | Version |
|-----------|---------|
| System | M4Pro MacBook Pro macOS 26 |
| Python | 3.14.0 |
| PySpark | 4.0.1 |
| Java | OpenJDK 17 |
| Environment | `~/.venvs/pyspark-latest` |

**💡 Alternative: Google Colab**

If you find local environment setup too complicated, consider using **Google Colab**:
- ✅ **Free** - No cost to use
- ✅ **Easy** - No installation needed, runs in browser
- ✅ **Powerful** - Free GPU/TPU access
- ✅ **Pre-configured** - PySpark ready to use with minimal setup

**To use Colab:** Visit [colab.research.google.com](https://colab.research.google.com)

---

### 0.4 Kernel Selection

**Steps:**
1. Click "Select Kernel" (top-right corner)
2. Choose **"Python (PySpark latest)"**
3. Run the setup cell below to configure JAVA_HOME

In [104]:
# ⚙️ IMPORTANT: Set JAVA_HOME for PySpark (Run this FIRST!)
import os

# Set JAVA_HOME to the OpenJDK 17 installed via Homebrew
java_home = "/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"
os.environ["JAVA_HOME"] = java_home

print(f"✓ JAVA_HOME set to: {java_home}")
print("✓ Ready to run PySpark!")
print(f"✓ Current Python: {os.__file__}")

✓ JAVA_HOME set to: /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home
✓ Ready to run PySpark!
✓ Current Python: /opt/homebrew/Cellar/python@3.14/3.14.0/Frameworks/Python.framework/Versions/3.14/lib/python3.14/os.py


In [105]:
# Setup: install and import
# !pip install pyspark
from pyspark.sql import SparkSession   # SparkSession is the entry point to programming Spark with the Dataset and DataFrame API.
from pyspark.sql.functions import *    # * = all
from pyspark.sql.types import *        
import pandas as pd                    
import os                  

In [106]:
# 🔍 Test: Verify Python environment
import sys
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check if PySpark is available
try:
    import pyspark
    print(f"✅ PySpark is installed: {pyspark.__version__}")
except ImportError:
    print("❌ PySpark NOT found! Wrong kernel selected!")

Python executable: /Users/benjamingao/.venvs/pyspark-latest/bin/python
Python version: 3.14.0 (main, Oct  7 2025, 09:34:52) [Clang 17.0.0 (clang-1700.3.19.1)]
✅ PySpark is installed: 4.0.1


In [107]:
# 🔍 Deep Diagnostics: Check Java and Spark Environment
import os
import subprocess
import sys

print("=" * 60)
print("DIAGNOSTIC REPORT")
print("=" * 60)

# 1. Check Python
print(f"\n1️⃣ Python Environment:")
print(f"   Executable: {sys.executable}")
print(f"   Version: {sys.version.split()[0]}")

# 2. Check JAVA_HOME
print(f"\n2️⃣ JAVA_HOME Setting:")
java_home = os.environ.get('JAVA_HOME', 'NOT SET')
print(f"   JAVA_HOME: {java_home}")

# 3. Check Java executable
print(f"\n3️⃣ Java Executable Test:")
java_paths = [
    "/opt/homebrew/opt/openjdk@17/bin/java",
    "/usr/bin/java",
    "java"
]

java_works = False
for java_path in java_paths:
    try:
        result = subprocess.run([java_path, '-version'], 
                              capture_output=True, 
                              text=True, 
                              timeout=5)
        if result.returncode == 0:
            print(f"   ✅ Working Java found: {java_path}")
            print(f"   Version: {result.stderr.split('\\n')[0]}")
            java_works = True
            break
    except Exception as e:
        print(f"   ❌ {java_path}: {str(e)[:50]}")

if not java_works:
    print("\n   ⚠️ WARNING: No working Java found!")
    print("   Solution: Install Java 17 with:")
    print("   brew install openjdk@17")

# 4. Check PySpark
print(f"\n4️⃣ PySpark Installation:")
try:
    import pyspark
    print(f"   ✅ PySpark version: {pyspark.__version__}")
    print(f"   Location: {pyspark.__file__}")
except ImportError as e:
    print(f"   ❌ PySpark not found: {e}")

# 5. Check if Spark can start
print(f"\n5️⃣ Spark Startup Test:")
print("   Attempting to start Spark...")

try:
    from pyspark.sql import SparkSession
    
    # Try to create with minimal config
    test_spark = SparkSession.builder \
        .master("local[1]") \
        .appName("DiagnosticTest") \
        .config("spark.ui.enabled", "false") \
        .config("spark.driver.host", "localhost") \
        .getOrCreate()
    
    print(f"   ✅ Spark started successfully!")
    print(f"   Version: {test_spark.version}")
    print(f"   Master: {test_spark.sparkContext.master}")
    
    # Clean up
    test_spark.stop()
    print("   ✅ Spark stopped successfully")
    
except Exception as e:
    print(f"   ❌ Spark failed to start:")
    print(f"   Error: {str(e)[:200]}")
    print("\n   🔧 Recommended Fix:")
    print("   1. Restart Jupyter Kernel")
    print("   2. Run: brew reinstall openjdk@17")
    print("   3. Re-run all cells from the beginning")

print("\n" + "=" * 60)
print("END OF DIAGNOSTIC REPORT")
print("=" * 60)

DIAGNOSTIC REPORT

1️⃣ Python Environment:
   Executable: /Users/benjamingao/.venvs/pyspark-latest/bin/python
   Version: 3.14.0

2️⃣ JAVA_HOME Setting:
   JAVA_HOME: /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home

3️⃣ Java Executable Test:
   ✅ Working Java found: /opt/homebrew/opt/openjdk@17/bin/java
   Version: openjdk version "17.0.17" 2025-10-21
OpenJDK Runtime Environment Homebrew (build 17.0.17+0)
OpenJDK 64-Bit Server VM Homebrew (build 17.0.17+0, mixed mode, sharing)


4️⃣ PySpark Installation:
   ✅ PySpark version: 4.0.1
   Location: /Users/benjamingao/.venvs/pyspark-latest/lib/python3.14/site-packages/pyspark/__init__.py

5️⃣ Spark Startup Test:
   Attempting to start Spark...
   ✅ Spark started successfully!
   Version: 4.0.1
   Master: local[1]


25/10/25 22:10:14 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


   ✅ Spark stopped successfully

END OF DIAGNOSTIC REPORT


## 1. Introduction to PySpark

### 1.1 What is PySpark?

PySpark is the Python API for Apache Spark, a distributed computing engine for big data. It supports:
- Distributed data processing
- Fault tolerance
- In-memory computations
- Integration with many data sources

In [5]:
# 🔧 Troubleshooting: Stop any existing Spark sessions first
try:
    spark.stop()
    print("✓ Stopped old Spark session")
except:
    print("No existing session to stop (this is fine)")

# Re-verify JAVA_HOME is set
import os
java_home = "/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"
os.environ["JAVA_HOME"] = java_home
print(f"✓ JAVA_HOME: {os.environ['JAVA_HOME']}")

# Create SparkSession with additional configs to prevent connection issues
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkFundamentals") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .config("spark.driver.host", "localhost") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .config("spark.ui.enabled", "false") \
    .getOrCreate()

print(f"✅ Spark version: {spark.version}")
print(f"✅ Spark is running on: {spark.sparkContext.master}")

No existing session to stop (this is fine)
✓ JAVA_HOME: /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home
✅ Spark version: 4.0.1
✅ Spark is running on: local[1]


### 1.2 Creating a SparkSession

#### 💡 Code Explanation

**SparkSession** (case sensitive)

#### Why need session?
Without session, the system does not recognize who you are.

#### 🏦 Analogy: Bank Account

| Code Part | Bank Analogy | Explanation |
|-----------|--------------|-------------|
| `SparkSession` | Opening a bank account | Your identity in the Spark system |
| `.builder` | Walking into the bank | Starting the process |
| `.appName("PySparkFundamentals")` | Naming your account | Give your app a unique name |
| `.config(...)` | Setting up special features | Configure how Spark behaves |
| `.getOrCreate()` | Get existing OR create new | Smart! Reuses if exists, creates if not |

#### 📋 Step-by-Step Breakdown

```python
# Step 1: Start building a SparkSession
SparkSession.builder

# Step 2: Name your application
.appName("PySparkFundamentals")

# Step 3: Add configuration (optional)
.config("spark.sql.repl.eagerEval.enabled", True)
# ↑ This makes DataFrames auto-display in Jupyter

# Step 4: Create or get existing session
.getOrCreate()
```

#### 🎯 Key Points

- ✅ **Case Sensitive**: Must write `SparkSession` (not `sparksession`)
- ✅ **One Session**: `getOrCreate()` ensures only one session exists
- ✅ **Required**: Without it, you can't use any Spark functions
- ✅ **Like Login**: It's your "login credentials" for Spark

**This is called "Method Chaining"**
- Object.Method1().Method2().Method3()

#### 🧪 Quick Practice

> 任务：创建一个 SparkSession，要求：
> - 应用名称："StudentDataAnalysis"（学生数据分析）
> - 不需要配置
> - 存储到变量：spark2

In [6]:
spark2 = SparkSession.builder.appName("StudentDataAnalysis").getOrCreate()
print(f"Spark2 version: {spark2.version}")

Spark2 version: 4.0.1


25/10/25 13:22:11 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### 1.3 Understanding `getOrCreate()`

#### 🎓 Why It's Called `getOrCreate()`

| Scenario | Behavior |
|----------|----------|
| **First Call** | **Create** - Creates a new SparkSession |
| **Subsequent Calls** | **Get** - Returns existing SparkSession (ignores new config) |

#### ⚠️ Important Discovery

Run the verification code above and you'll find:
- Both variables have the app name `"PySparkFundamentals"` (the first one created)
- `spark is spark2` returns `True` (they are the same object)

#### 📝 What If You Really Want Multiple Sessions?

**Method 1: Stop the old one, then create a new one** (within the same program)

```python
# Stop the old session
spark.stop()

# Create a new session
spark2 = SparkSession.builder.appName("NewApp").getOrCreate()
```

**Method 2: Run different programs** (recommended)

Different Python scripts can each have their own SparkSession.

#### 🎯 Spark's Design Philosophy

**One Application = One SparkSession = Multiple DataFrames**

```python
# ✅ Correct approach: One Session, multiple datasets
spark = SparkSession.builder.appName("DataAnalysis").getOrCreate()

# Process multiple types of data simultaneously
students_df = spark.createDataFrame(student_data)
sales_df = spark.createDataFrame(sales_data)
products_df = spark.createDataFrame(product_data)

# Can analyze in parallel
students_df.show()
sales_df.show()
products_df.show()
```

**You don't need multiple Sessions, you need multiple DataFrames!**

## 2. Creating DataFrames

### 2.1 From a List

In [7]:
data = [("Alice", 34, "Engineer"),
        ("Bob", 45, "Data Scientist"),
        ("Catherine", 29, "Developer"),
        ("David", 52, "Manager")]
columns = ["Name", "Age", "Occupation"]

df = spark.createDataFrame(data, columns)
df.show()

+---------+---+--------------+
|     Name|Age|    Occupation|
+---------+---+--------------+
|    Alice| 34|      Engineer|
|      Bob| 45|Data Scientist|
|Catherine| 29|     Developer|
|    David| 52|       Manager|
+---------+---+--------------+



                                                                                

#### 📝 Code Example

#### 💡 Code Explanation: Creating DataFrame from a List

#### 🏦 Analogy: Opening a Savings Account with Customer Records

Imagine you work at a bank and need to digitize customer records:

| Code Part | Bank Analogy | What's Happening |
|-----------|--------------|------------------|
| `data = [(...), (...), ...]` | **Customer info cards** | Raw data: each tuple is like a customer card with details |
| `columns = ["Name", "Age", ...]` | **Form field labels** | Column headers: defining what each piece of data means |
| `spark.createDataFrame(data, columns)` | **Create digital database** | Convert paper records into a structured database table |
| `df` | **The customer database** | Your organized, searchable database |
| `df.show()` | **Print the database** | Display the records on screen |

#### 📋 Step-by-Step Breakdown

```python
# Step 1: Prepare raw data (like customer info cards)
data = [
    ("Alice", 34, "Engineer"),      # Card 1
    ("Bob", 45, "Data Scientist"),  # Card 2
    ("Catherine", 29, "Developer"), # Card 3
    ("David", 52, "Manager")        # Card 4
]
# ↑ This is a LIST of TUPLES (each tuple = one row/record)

# Step 2: Define column names (like form field labels)
columns = ["Name", "Age", "Occupation"]
# ↑ This is a LIST of STRINGS (column headers)

# Step 3: Create a DataFrame (digitize the records)
df = spark.createDataFrame(data, columns)
# ↑ spark: Your bank account (SparkSession)
#   .createDataFrame(): The "digitization machine"
#   data: What to digitize
#   columns: How to label each field

# Step 4: Display the database
df.show()
# ↑ Show the organized table on screen
```

### 2.2 From a Pandas DataFrame

#### 🐼 What is Pandas?

**Pandas** is a popular Python library for data analysis (like Excel for Python)

- **Full Name**: Python Data Analysis Library
- **Use Case**: Working with small-to-medium datasets (fits in your computer's memory)
- **Key Object**: `DataFrame` - a table with rows and columns (like an Excel spreadsheet)

#### 💡 Why Convert Pandas → PySpark?

You might have data in Pandas but want to:
- Scale up to larger datasets
- Use Spark's distributed processing
- Integrate with existing Spark pipelines

**Good News**: PySpark can easily convert Pandas DataFrames!

In [8]:
pandas_df = pd.DataFrame({
    "Product": ["Laptop", "Mouse", "Keyboard", "Monitor"],
    "Price": [1200, 25, 80, 300],
    "Quantity": [5, 20, 15, 8]
})

products_df = spark.createDataFrame(pandas_df)
products_df.show()

+--------+-----+--------+
| Product|Price|Quantity|
+--------+-----+--------+
|  Laptop| 1200|       5|
|   Mouse|   25|      20|
|Keyboard|   80|      15|
| Monitor|  300|       8|
+--------+-----+--------+



### 2.3 Reading from CSV

#### 📄 What is CSV?

**CSV = Comma-Separated Values**

- A simple text file format for storing tabular data
- Each line = one row
- Commas separate columns
- **Most common** format for big data exchange!

In [17]:
csv_data = """id,name,value
1,Alice,100
2,Bob,200
3,Charlie,150
4,Diana,300"""

with open("sample_data.csv", "w") as f:
    f.write(csv_data)

csv_df = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
csv_df.show()

+---+-------+-----+
| id|   name|value|
+---+-------+-----+
|  1|  Alice|  100|
|  2|    Bob|  200|
|  3|Charlie|  150|
|  4|  Diana|  300|
+---+-------+-----+



#### 📝 Code Example: Creating and Reading CSV

#### 🔍 Detailed Code Walkthrough

Let me break down this code line by line:

##### **Step 1: Create CSV Data (Line 1-5)**

```python
csv_data = """id,name,value
1,Alice,100
2,Bob,200
3,Charlie,150
4,Diana,300"""
```

**What's happening:**
- `csv_data` = a **variable** storing text
- `"""..."""` = **triple quotes** (allows multi-line strings)
- Content = CSV format data (comma-separated values)

**Structure:**
```
Line 1: id,name,value        ← Header row (column names)
Line 2: 1,Alice,100          ← Data row 1
Line 3: 2,Bob,200            ← Data row 2
Line 4: 3,Charlie,150        ← Data row 3
Line 5: 4,Diana,300          ← Data row 4
```

**Analogy:** Like writing customer records on a piece of paper

##### **Step 2: Create a Physical File (Line 7-8)**

```python
with open("sample_data.csv", "w") as f:
    f.write(csv_data)
```

**Breaking it down:**

| Part | Meaning | Analogy |
|------|---------|---------|
| `with open(...)` | Context manager (auto-closes file) | "Use this file, then clean up automatically" |
| `"sample_data.csv"` | Filename to create | Name of the file on your computer |
| `"w"` | Write mode | "Create new file or overwrite existing one" |
| `as f:` | Give it a nickname `f` | Short name for the file object |
| `f.write(csv_data)` | Write the text to file | Copy the text into the file |

**What happens:**
1. Creates (or overwrites) a file named `sample_data.csv`
2. Writes the CSV text into it
3. Automatically closes the file when done

**Analogy:** Taking your paper records and putting them in a filing cabinet

**Result:** You now have a real CSV file on your computer:
```
📁 Your Computer
  └─ sample_data.csv  ← This file now exists!
```

##### **Step 3: Read CSV into PySpark (Line 10)**

```python
csv_df = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
```

**Breaking it down:**

| Part | What It Does |
|------|--------------|
| `csv_df =` | Store the result in variable `csv_df` |
| `spark` | Your SparkSession (created earlier) |
| `.read` | Access the DataFrameReader |
| `.csv(...)` | Read a CSV file |
| `"sample_data.csv"` | The file to read |
| `header=True` | First row is column names |
| `inferSchema=True` | Auto-detect data types |

**Process:**
```
Step 1: spark.read
        ↓
        Access Spark's file reader

Step 2: .csv("sample_data.csv")
        ↓
        Read the CSV file

Step 3: header=True
        ↓
        Use first row as column names
        (id, name, value)

Step 4: inferSchema=True
        ↓
        Figure out data types automatically
        - id: integer
        - name: string
        - value: integer

Step 5: Return DataFrame → csv_df
```

**Analogy:** Scanning paper documents and creating a digital database

##### **Step 4: Display the DataFrame (Line 11)**

```python
csv_df.show()
```

**What it does:** Prints the DataFrame in a table format

**Output:**
```
+---+-------+-----+
| id|   name|value|
+---+-------+-----+
|  1|  Alice|  100|
|  2|    Bob|  200|
|  3|Charlie|  150|
|  4|  Diana|  300|
+---+-------+-----+
```

**Analogy:** Printing a report to see your database

##### 🎯 Complete Flow Diagram

```
┌─────────────────────────────────────┐
│ Step 1: Create Text Data           │
│ csv_data = """id,name,value..."""  │
└─────────────┬───────────────────────┘
              │
              ↓
┌─────────────────────────────────────┐
│ Step 2: Write to File               │
│ with open("sample_data.csv", "w"):  │
│     f.write(csv_data)               │
└─────────────┬───────────────────────┘
              │
              ↓
        📁 sample_data.csv
        (File on disk)
              │
              ↓
┌─────────────────────────────────────┐
│ Step 3: Read File into Spark        │
│ csv_df = spark.read.csv(...)        │
│ - header=True: Use first row        │
│ - inferSchema=True: Detect types    │
└─────────────┬───────────────────────┘
              │
              ↓
┌─────────────────────────────────────┐
│ csv_df (PySpark DataFrame)          │
│ +---+-------+-----+                 │
│ | id|   name|value|                 │
│ +---+-------+-----+                 │
│ |  1|  Alice|  100|                 │
│ +---+-------+-----+                 │
└─────────────┬───────────────────────┘
              │
              ↓
┌─────────────────────────────────────┐
│ Step 4: Display                     │
│ csv_df.show()                       │
└─────────────────────────────────────┘
```

##### 🤔 Why Two Steps (Create File + Read File)?

**Question:** Why not read data directly from `csv_data` string?

**Answer:** This example demonstrates the **real-world workflow**:
1. In reality, CSV files already exist (from other systems)
2. You just need to **read them** into Spark

**Real-world scenario:**
```python
# You DON'T create the file, it already exists!
# Just read it:
df = spark.read.csv("sales_data_2024.csv", header=True, inferSchema=True)
```

This code creates a file just for **demonstration purposes** so you can practice reading CSVs!

##### 💡 Key Takeaways

1. **CSV file** = text file with comma-separated values
2. **`with open()`** = Python's way to create/write files
3. **`spark.read.csv()`** = Spark's way to read CSV into DataFrame
4. **`header=True`** = Treat first row as column names
5. **`inferSchema=True`** = Auto-detect data types
6. **`.show()`** = Display the DataFrame

##### 📊 Common Big Data CSV Scenarios

**Scenario 1: Web Server Logs**
```csv
timestamp,ip_address,url,status_code
2024-10-24 10:30:00,192.168.1.1,/home,200
2024-10-24 10:31:15,192.168.1.2,/login,404
```

**Scenario 2: E-commerce Transactions**
```csv
order_id,customer_id,product,amount,date
12345,C001,Laptop,1200,2024-10-24
12346,C002,Mouse,25,2024-10-24
```

**Scenario 3: IoT Sensor Data**
```csv
sensor_id,temperature,humidity,timestamp
S001,22.5,65,2024-10-24 10:00:00
S002,23.1,62,2024-10-24 10:00:01
```

##### 🎓 CSV Best Practices

✅ **DO**:
- Always use `header=True` if your CSV has headers
- Use `inferSchema=True` for automatic type detection
- Clean data before loading (remove extra spaces)

❌ **DON'T**:
- Assume data is clean (always check for spaces, nulls)
- Load huge CSVs without partitioning (Spark handles this automatically)
- Forget to handle special characters in data

##### ⚡ Advanced CSV Options

```python
# Handle spaces around values
df = spark.read.csv("data.csv", 
    header=True,
    inferSchema=True,
    ignoreLeadingWhiteSpace=True,   # Remove leading spaces
    ignoreTrailingWhiteSpace=True   # Remove trailing spaces
)

# Different delimiter (tab-separated)
df = spark.read.csv("data.tsv", sep="\t", header=True)

# Handle missing values
df = spark.read.csv("data.csv", header=True, nullValue="N/A")

# Custom quote character
df = spark.read.csv("data.csv", header=True, quote="'")
```

##### 🔄 Real-World CSV Loading

**In production, you'll read files from:**

```python
# Local file
df = spark.read.csv("file:///path/to/data.csv", header=True, inferSchema=True)

# HDFS (Hadoop Distributed File System)
df = spark.read.csv("hdfs://namenode:9000/data/sales.csv", header=True)

# AWS S3
df = spark.read.csv("s3://my-bucket/data/logs.csv", header=True)

# Azure Blob Storage
df = spark.read.csv("wasbs://container@account.blob.core.windows.net/data.csv", header=True)

# Google Cloud Storage
df = spark.read.csv("gs://my-bucket/data/users.csv", header=True)
```

##### 🎯 Key Options Explained

**`header=True`**
- Treats the first line as column names
- Without this, first line would be treated as data
- Example:
  ```
  WITH header=True:     WITHOUT header=True:
  +---+-------+-----+   +---+-------+-----+
  | id|   name|value|   |_c0|    _c1|  _c2|
  +---+-------+-----+   +---+-------+-----+
  |  1|  Alice|  100|   | id|   name|value|
  |  2|    Bob|  200|   |  1|  Alice|  100|
  +---+-------+-----+   +---+-------+-----+
  ```

**`inferSchema=True`**
- Automatically detects data types (int, string, double, etc.)
- Without this, everything is treated as string
- Example:
  ```
  WITH inferSchema=True:     WITHOUT inferSchema=True:
  id: integer               id: string
  name: string              name: string
  value: integer            value: string
  ```

##### 📋 Complete Code Breakdown

```python
# Step 1: Create CSV data as a string
csv_data = """id,name,value
1,Alice,100
2,Bob,200
3,Charlie,150
4,Diana,300"""
# ↑ Triple quotes allow multi-line strings
#   First line: column names (header)
#   Other lines: data rows

# Step 2: Write CSV data to a file (creates "sample_data.csv")
with open("sample_data.csv", "w") as f:
    f.write(csv_data)
# ↑ "w" = write mode (creates or overwrites file)
#   This creates a real CSV file on your computer

# Step 3: Read CSV file into PySpark DataFrame
csv_df = spark.read.csv("sample_data.csv", header=True, inferSchema=True)
#         ↑      ↑                         ↑            ↑
#         |      |                         |            Auto-detect data types
#         |      |                         First row is header
#         |      Read CSV file
#         Spark session

# Step 4: Display the DataFrame
csv_df.show()
```

##### 🏦 Analogy: Importing Bank Records from a File Cabinet

Imagine you have customer records stored in a text file (CSV) and want to digitize them:

| Code Part | Bank Analogy | What's Happening |
|-----------|--------------|------------------|
| `csv_data = """..."""` | **Text file content** | The raw CSV data as a multi-line string |
| `with open(...) as f:` | **Create a physical file** | Write the CSV text to a real file on disk |
| `spark.read.csv(...)` | **Import file into database** | Read CSV file into a PySpark DataFrame |
| `header=True` | **First row is column names** | Tells Spark the first line contains headers |
| `inferSchema=True` | **Auto-detect data types** | Spark figures out which columns are numbers, strings, etc. |

#### 💡 Code Explanation Summary

This section provides a complete breakdown of how to read CSV files in PySpark.

## 3. Basic DataFrame Operations

### 3.1 Show, Schema, Columns, Describe

In [9]:
df.show()
print("="*60,"11111^^^^")
df.printSchema()
print("="*60,"22222^^^^")
print(df.columns)
print("="*60,"33333^^^^")
df.describe().show()
print("="*60,"44444^^^^")


+---------+---+--------------+
|     Name|Age|    Occupation|
+---------+---+--------------+
|    Alice| 34|      Engineer|
|      Bob| 45|Data Scientist|
|Catherine| 29|     Developer|
|    David| 52|       Manager|
+---------+---+--------------+

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Occupation: string (nullable = true)

['Name', 'Age', 'Occupation']
+-------+-----+------------------+--------------+
|summary| Name|               Age|    Occupation|
+-------+-----+------------------+--------------+
|  count|    4|                 4|             4|
|   mean| NULL|              40.0|          NULL|
| stddev| NULL|10.424330514074594|          NULL|
|    min|Alice|                29|Data Scientist|
|    max|David|                52|       Manager|
+-------+-----+------------------+--------------+

+-------+-----+------------------+--------------+
|summary| Name|               Age|    Occupation|
+-------+-----+------------------+--------------+
|  

### 3.2 Selecting and Filtering

In [10]:
df.select("Name", "Occupation").show()
df.filter(df.Age > 30).show()
df.filter((df.Age > 30) & (df.Occupation == "Engineer")).show()
df.filter("Age > 30 AND Occupation = 'Engineer'").show()

+---------+--------------+
|     Name|    Occupation|
+---------+--------------+
|    Alice|      Engineer|
|      Bob|Data Scientist|
|Catherine|     Developer|
|    David|       Manager|
+---------+--------------+

+-----+---+--------------+
| Name|Age|    Occupation|
+-----+---+--------------+
|Alice| 34|      Engineer|
|  Bob| 45|Data Scientist|
|David| 52|       Manager|
+-----+---+--------------+

+-----+---+----------+
| Name|Age|Occupation|
+-----+---+----------+
|Alice| 34|  Engineer|
+-----+---+----------+

+-----+---+----------+
| Name|Age|Occupation|
+-----+---+----------+
|Alice| 34|  Engineer|
+-----+---+----------+



### 3.3 Adding / Modifying columns

In [11]:
df_with_bonus = df.withColumn("Bonus", df.Age * 10)
df_with_bonus.show()

df_renamed = df.withColumnRenamed("Occupation", "Job")
df_renamed.show()

df_renamed.drop("Job").show()

+---------+---+--------------+-----+
|     Name|Age|    Occupation|Bonus|
+---------+---+--------------+-----+
|    Alice| 34|      Engineer|  340|
|      Bob| 45|Data Scientist|  450|
|Catherine| 29|     Developer|  290|
|    David| 52|       Manager|  520|
+---------+---+--------------+-----+

+---------+---+--------------+
|     Name|Age|           Job|
+---------+---+--------------+
|    Alice| 34|      Engineer|
|      Bob| 45|Data Scientist|
|Catherine| 29|     Developer|
|    David| 52|       Manager|
+---------+---+--------------+

+---------+---+
|     Name|Age|
+---------+---+
|    Alice| 34|
|      Bob| 45|
|Catherine| 29|
|    David| 52|
+---------+---+



## 4. Class Activity

### 📊 Practice Dataset

In [None]:
data = [
    ("Alice", 34, "Engineer", 70000),
    ("Bob", 45, "Data Scientist", 120000),
    ("Catherine", 29, "Developer", 90000),
    ("David", 52, "Manager", 150000),
    ("Eva", 41, "Engineer", 80000),
    ("Frank", 36, "Developer", 95000),
    ("Grace", 28, "Intern", 40000)
]
columns = ["Name", "Age", "Occupation", "Salary"]

df = spark.createDataFrame(data, columns)
df.show()



+---------+---+--------------+------+
|     Name|Age|    Occupation|Salary|
+---------+---+--------------+------+
|    Alice| 34|      Engineer| 70000|
|      Bob| 45|Data Scientist|120000|
|Catherine| 29|     Developer| 90000|
|    David| 52|       Manager|150000|
|      Eva| 41|      Engineer| 80000|
|    Frank| 36|     Developer| 95000|
|    Grace| 28|        Intern| 40000|
+---------+---+--------------+------+



### 📝 Activity Tasks

In [None]:
# - Show the first 5 rows 
# - Print the schema and column names
# - Describe the dataset (count, mean, min, max)
df.show(5)
df.printSchema()
df.columns
df.describe().show()
df.describe()

+---------+---+--------------+------+
|     Name|Age|    Occupation|Salary|
+---------+---+--------------+------+
|    Alice| 34|      Engineer| 70000|
|      Bob| 45|Data Scientist|120000|
|Catherine| 29|     Developer| 90000|
|    David| 52|       Manager|150000|
|      Eva| 41|      Engineer| 80000|
+---------+---+--------------+------+
only showing top 5 rows
root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- Salary: long (nullable = true)

+-------+-----+------------------+--------------+-----------------+
|summary| Name|               Age|    Occupation|           Salary|
+-------+-----+------------------+--------------+-----------------+
|  count|    7|                 7|             7|                7|
|   mean| NULL|37.857142857142854|          NULL|92142.85714285714|
| stddev| NULL| 8.706866474772873|          NULL|35338.49917313302|
|    min|Alice|                28|Data Scientist|            40000|
|    

In [None]:
# Filtering
# - Show only employees older than 35
# - Show names and salaries of Engineers only
# - Filter employees with salary > 90000 and age < 50
df.filter(df.Occupation == "Engineer"  ).select("Name",  "Salary").show()
df.filter((df.Salary > 90000) & (df.Age < 50)).show()

+-----+------+
| Name|Salary|
+-----+------+
|Alice| 70000|
|  Eva| 80000|
+-----+------+

+-----+---+--------------+------+
| Name|Age|    Occupation|Salary|
+-----+---+--------------+------+
|  Bob| 45|Data Scientist|120000|
|Frank| 36|     Developer| 95000|
+-----+---+--------------+------+



In [75]:
# Transformations
# - Add a new column Bonus equal to 10% of Salary
# - Rename the column Occupation to Job
# - Drop the Age column from a copy of the DataFrame
df_with_bonus = df.withColumn("Bonus", df.Salary * 0.1)
df_with_bonus.show()
df_renamed = df.withColumnRenamed("Occupation", "Job")
df_renamed.show()
df.show()
dfnew = df.drop("Age")
dfnew.show()


+---------+---+--------------+------+-------+
|     Name|Age|    Occupation|Salary|  Bonus|
+---------+---+--------------+------+-------+
|    Alice| 34|      Engineer| 70000| 7000.0|
|      Bob| 45|Data Scientist|120000|12000.0|
|Catherine| 29|     Developer| 90000| 9000.0|
|    David| 52|       Manager|150000|15000.0|
|      Eva| 41|      Engineer| 80000| 8000.0|
|    Frank| 36|     Developer| 95000| 9500.0|
|    Grace| 28|        Intern| 40000| 4000.0|
+---------+---+--------------+------+-------+

+---------+---+--------------+------+
|     Name|Age|           Job|Salary|
+---------+---+--------------+------+
|    Alice| 34|      Engineer| 70000|
|      Bob| 45|Data Scientist|120000|
|Catherine| 29|     Developer| 90000|
|    David| 52|       Manager|150000|
|      Eva| 41|      Engineer| 80000|
|    Frank| 36|     Developer| 95000|
|    Grace| 28|        Intern| 40000|
+---------+---+--------------+------+

+---------+---+--------------+------+
|     Name|Age|    Occupation|Salary

In [76]:
# Aggregations
# - Compute average salary per job (groupBy + avg)
# - Count the number of employees in each job

df2 = df.groupBy("Occupation").agg(avg("Salary").alias("Avg_Salary"))
df2.show()
df2 = df.groupBy("Occupation").agg(count("*").alias("Employee_Count"))
df2.show()

+--------------+----------+
|    Occupation|Avg_Salary|
+--------------+----------+
|        Intern|   40000.0|
|     Developer|   92500.0|
|Data Scientist|  120000.0|
|      Engineer|   75000.0|
|       Manager|  150000.0|
+--------------+----------+

+--------------+--------------+
|    Occupation|Employee_Count|
+--------------+--------------+
|        Intern|             1|
|     Developer|             2|
|Data Scientist|             1|
|      Engineer|             2|
|       Manager|             1|
+--------------+--------------+



## 📚 PySpark Operations Summary

### 🔍 Basic Display Operations

#### Show Data
```python
df.show()           # Display all rows (default 20)
df.show(n)          # Display first n rows
```

#### Inspect Structure
```python
df.printSchema()    # Print schema (column names and types)
df.columns          # Show column names as a list
df.describe().show() # Display statistical summary (count, mean, min, max)
```

**Note:** `df.describe()` returns a DataFrame object (not human-readable), use `.show()` to display it properly.

---

### 🔎 Filtering and Selecting

#### Filter Rows + Select Columns
```python
df.filter(condition).select(col1, col2).show()
```

**Filter:** Apply conditions to keep certain rows
- Supports logical operators: `&` (AND), `|` (OR), `~` (NOT)
- Example: `df.filter((df.Age > 30) & (df.Salary > 80000))`

**Select:** Choose which columns to display
- Example: `df.select("Name", "Salary")`

---

### ✏️ Column Transformations

#### Add or Modify Column
```python
df.withColumn("new_col", expression)
```
- If column exists → modifies it
- If column doesn't exist → creates it
- Example: `df.withColumn("Bonus", df.Salary * 0.1)`

#### Rename Column
```python
df.withColumnRenamed("old_name", "new_name")
```
- Example: `df.withColumnRenamed("Occupation", "Job")`

#### Drop Column
```python
df.drop("column_name")
```
- Example: `df.drop("Age")`

**Important:** These operations return a new DataFrame and don't modify the original!

---

### 📊 Aggregation Operations

#### GroupBy + Aggregate
```python
df.groupBy("column").agg(aggregation_function)
```

**Common Aggregation Functions:**

| Function | Purpose | Example |
|----------|---------|---------|
| `avg("col")` | Average | `avg("Salary")` → Average salary |
| `sum("col")` | Sum | `sum("Salary")` → Total salary |
| `count("col")` | Count non-null values | `count("Salary")` → Number of non-null salaries |
| `count("*")` | Count all rows | `count("*")` → Total number of rows |
| `max("col")` | Maximum | `max("Salary")` → Highest salary |
| `min("col")` | Minimum | `min("Salary")` → Lowest salary |

#### Alias - Rename Result Column
```python
avg("Salary").alias("Avg_Salary")
```
- Makes result column names more readable
- Example: Instead of `avg(Salary)`, displays as `Avg_Salary`

**Complete Example:**
```python
df.groupBy("Occupation").agg(
    avg("Salary").alias("Avg_Salary"),
    count("*").alias("Employee_Count")
).show()
```

---

### 🎯 Key Concepts

1. **Immutability:** All DataFrame operations return a new DataFrame
2. **Method Chaining:** Can chain operations like `.filter().select().show()`
3. **Lazy Evaluation:** Transformations aren't executed until an action (like `.show()`) is called
4. **`.show()` Returns None:** Don't assign it to a variable!

## 🎯 Aggregation Practice Exercises

Use the employee dataset above to complete these tasks:

### Exercise 1: Basic Aggregations
1. Find the **maximum salary** across all employees
2. Find the **minimum age** across all employees
3. Calculate the **total salary** (sum) for all employees

### Exercise 2: GroupBy Aggregations
1. Find the **maximum salary** for each Occupation
2. Find the **minimum age** for each Occupation
3. Count how many employees are in each Occupation

### Exercise 3: Multiple Aggregations
1. For each Occupation, show:
   - Average salary
   - Maximum salary
   - Minimum salary
   - Employee count
   
   (All in one query using multiple `.agg()` functions)

### Exercise 4: Advanced Filtering + Aggregation
1. Find the average salary for employees **older than 30**
2. Count how many Engineers have salary **greater than 75000**
3. For employees **under 40**, calculate average salary by Occupation

---

**💡 Hints:**
- Use `df.agg()` for aggregations on the entire dataset
- Use `df.groupBy("col").agg()` for aggregations per group
- Remember to use `.alias()` to rename result columns
- You can filter first with `.filter()`, then aggregate
- For multiple aggregations, pass them separated by commas in `.agg()`

In [None]:
# Exercise 1: Basic Aggregations
# 1. Find the maximum salary across all employees
df.agg(max("Salary")).show()
# 2. Find the minimum age across all employees
df.agg(min("Age")).show()
# 3. Calculate the total salary (sum) for all employees
df.agg(sum("Salary")).show()


+-----------+
|max(Salary)|
+-----------+
|     150000|
+-----------+

+--------+
|min(Age)|
+--------+
|      28|
+--------+

+-----------+
|sum(Salary)|
+-----------+
|     645000|
+-----------+



In [87]:
# Exercise 2: GroupBy Aggregations
# 1. Find the maximum salary for each Occupation
df.groupBy("Occupation").agg(max("Salary")).show()

# 2. Find the minimum age for each Occupation
df.groupBy("Occupation").agg(min("Age")).show()

# 3. Count how many employees are in each Occupation
df.groupBy("Occupation").agg(count("*")).show()

+--------------+-----------+
|    Occupation|max(Salary)|
+--------------+-----------+
|        Intern|      40000|
|     Developer|      95000|
|Data Scientist|     120000|
|      Engineer|      80000|
|       Manager|     150000|
+--------------+-----------+

+--------------+--------+
|    Occupation|min(Age)|
+--------------+--------+
|        Intern|      28|
|     Developer|      29|
|Data Scientist|      45|
|      Engineer|      34|
|       Manager|      52|
+--------------+--------+

+--------------+--------+
|    Occupation|count(1)|
+--------------+--------+
|        Intern|       1|
|     Developer|       2|
|Data Scientist|       1|
|      Engineer|       2|
|       Manager|       1|
+--------------+--------+



In [None]:
# Exercise 3: Multiple Aggregations
# For each Occupation, show: Average salary, Maximum salary, Minimum salary, Employee count
# Hint: df.groupBy("col").agg(func1.alias("name1"), func2.alias("name2"), ...)
# Exercise 3: Multiple Aggregations
# For each Occupation, show: Average salary, Maximum salary, Minimum salary, Employee count
# Hint: df.groupBy("col").agg(func1.alias("name1"), func2.alias("name2"), ...)

# Complete solution with all 4 aggregations
df.groupBy("Occupation").agg(
    avg("Salary").alias("Avg_Salary"),
    max("Salary").alias("Max_Salary"),
    min("Salary").alias("Min_Salary"),
    count("*").alias("Employee_Count")
).show()

+--------------+------+----+-----+
|    Occupation|  maxS|minA|count|
+--------------+------+----+-----+
|        Intern| 40000|  28|    1|
|     Developer| 95000|  29|    2|
|Data Scientist|120000|  45|    1|
|      Engineer| 80000|  34|    2|
|       Manager|150000|  52|    1|
+--------------+------+----+-----+



In [None]:
# Exercise 4: Advanced Filtering + Aggregation

# 1. Find the average salary for employees older than 30
df.filter(df.Age > 30).agg(
    avg("Salary").alias("Avg_Salary_Age_Over_30")
).show()

# 2. Count how many Engineers have salary greater than 75000
df.filter((df.Occupation == "Engineer") & (df.Salary > 75000)).agg(
    count("*").alias("Engineer_Count_Salary_Over_75k")
).show()

# 3. For employees under 40, calculate average salary by Occupation
df.filter(df.Age < 40).groupBy("Occupation").agg(
    avg("Salary").alias("Avg_Salary")
).show()

+-----------+
|avg(Salary)|
+-----------+
|   103000.0|
+-----------+

+--------+
|count(1)|
+--------+
|       5|
+--------+

+----------+-----------+
|Occupation|avg(Salary)|
+----------+-----------+
|    Intern|    40000.0|
| Developer|    92500.0|
|  Engineer|    70000.0|
+----------+-----------+



---

## 📝 Exercise Solutions Explanation

### Exercise 3: Multiple Aggregations

**Result Interpretation:**
- Each row shows comprehensive statistics for one occupation
- `Avg_Salary`: Average salary in that occupation
- `Max_Salary`: Highest earner in that occupation
- `Min_Salary`: Lowest earner in that occupation
- `Employee_Count`: Total number of employees in that occupation

**Key Learning:** You can compute multiple aggregations in a single query by passing them as comma-separated arguments to `.agg()`

---

### Exercise 4: Advanced Filtering + Aggregation

#### Question 1: Average Salary (Age > 30)
**Result:** ~103,000
- Filters out Grace (28) and Catherine (29)
- Averages: Alice (70k), Bob (120k), Frank (95k), Eva (80k), David (150k)

#### Question 2: Engineers with Salary > 75k
**Result:** 1 engineer
- Alice (70k) ❌ - Below threshold
- Eva (80k) ✅ - Above threshold
- **Important:** Must filter by BOTH Occupation AND Salary

#### Question 3: Average Salary by Occupation (Age < 40)
**Results:**
- Intern: 40,000 (only Grace, 28)
- Developer: 92,500 (Catherine 29, Frank 36)
- Engineer: 70,000 (only Alice, 34)
- Bob (45) and David (52) are excluded ✅

**Key Learning:** Filter first, then aggregate - this is method chaining in action!

## 🎉 Congratulations!

You've completed all the fundamental PySpark operations! 

### 📚 What You've Learned:

#### ✅ **1. SparkSession Management**
- Creating and configuring SparkSession
- Understanding `getOrCreate()` behavior
- Managing Spark lifecycle

#### ✅ **2. DataFrame Creation (3 Methods)**
- From Python lists
- From Pandas DataFrames
- Reading from CSV files

#### ✅ **3. Basic Operations**
- Display: `show()`, `printSchema()`, `columns`, `describe()`
- Filtering: Using conditions and logical operators
- Selecting: Choosing specific columns
- Transformations: Adding, renaming, dropping columns

#### ✅ **4. Aggregations**
- Basic aggregations: `avg()`, `sum()`, `count()`, `max()`, `min()`
- GroupBy operations
- Multiple aggregations in one query
- Combining filters with aggregations

#### ✅ **5. Key Concepts**
- DataFrame immutability
- Method chaining
- Lazy evaluation
- Proper use of `.alias()` for column naming

---

### 🎯 Next Steps:

1. **Advanced Transformations**
   - Window functions
   - User-defined functions (UDFs)
   - Join operations

2. **Performance Optimization**
   - Partitioning strategies
   - Caching and persistence
   - Broadcast variables

3. **Real-World Projects**
   - Log analysis
   - Data ETL pipelines
   - Machine learning with MLlib

---

### 📖 Additional Resources:

- **Official Documentation**: [PySpark API Docs](https://spark.apache.org/docs/latest/api/python/)
- **Spark SQL Guide**: [SQL Programming Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- **Community**: [Stack Overflow PySpark Tag](https://stackoverflow.com/questions/tagged/pyspark)

---

**Happy Spark Learning! 🚀**

If you found this helpful, consider ⭐ starring the repository!