# Module 1 - Introduction & SparkSession

## Introduction

This module covers Apache Spark fundamentals and PySpark, the Python API for Apache Spark. We'll start with understanding what Apache Spark is, its architecture, and then dive into PySpark for practical implementation. This module focuses on DataFrames and Spark SQL - the primary APIs for data engineering.


## What is Apache Spark?

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single node machines or clusters.

**Key Characteristics:**
- **Multi-language**: Supports Python, Scala, Java, R, and SQL
- **Flexible Deployment**: Works on single machines or distributed clusters
- **Unified Platform**: Handles batch processing, streaming, SQL, and machine learning
- **High Performance**: In-memory computing for faster processing


## Capabilities

Apache Spark provides a comprehensive set of capabilities:

- **ANSI SQL**: Full SQL support for querying structured data
- **Batch Processing API**: Process large volumes of data in batches
- **Stream Processing API**: Real-time data processing with Spark Streaming
- **Machine Learning API**: MLlib library for scalable machine learning algorithms


## Why Apache Spark?

Apache Spark has become the industry standard for big data processing due to several key advantages:

- **Abstraction**: High-level APIs that hide the complexity of distributed computing
- **Ease of Use**: Simple APIs similar to familiar tools (like Pandas for DataFrames)
- **Unified**: Single platform for batch, streaming, SQL, and ML workloads
- **Open Source**: Free, community-driven, and continuously improved
- **Ecosystem**: Rich ecosystem with integrations for various data sources and tools


## Apache Spark System Architecture

### Unified Architecture

Apache Spark follows a layered architecture where everything funnels into Spark Core:

```
┌──────────────────────────────────────────────────────────────┐
│                        SPARK CONNECT                          │
└──────────────────────────────────────────────────────────────┘
                              │
┌──────────────────────────────────────────────────────────────┐
│                    HIGH-LEVEL SPARK APIs                      │
│                                                              │
│  ┌──────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ SQL / DF API │  │ Structured        │  │ Pandas API  │  │
│  │              │  │ Streaming         │  │             │  │
│  └──────────────┘  └──────────────────┘  └──────────────┘  │
│                                                              │
│                    ┌──────────────┐                          │
│                    │   MLlib      │                          │
│                    └──────────────┘                          │
└──────────────────────────────────────────────────────────────┘
                              │
┌──────────────────────────────────────────────────────────────┐
│                   LANGUAGE BINDINGS                           │
│                                                              │
│      Python        Scala         Java           R             │
│                                                              │
└──────────────────────────────────────────────────────────────┘
                              │
┌──────────────────────────────────────────────────────────────┐
│                 SPARK CORE (RDD API)                          │
│                                                              │
│   - DAG Scheduler                                             │
│   - Task Scheduler                                            │
│   - Memory Management                                         │
│   - Fault Tolerance                                           │
└──────────────────────────────────────────────────────────────┘
                              │
┌──────────────────────────────────────────────────────────────┐
│                  RESOURCE MANAGER LAYER                       │
│                                                              │
│   Spark Standalone   |   YARN   |   Kubernetes                │
│                                                              │
└──────────────────────────────────────────────────────────────┘
                              │
┌──────────────────────────────────────────────────────────────┐
│                     COMPUTE CLUSTER                          │
│                                                              │
│     Hadoop        AWS EC2        Azure VM        GCP VM        │
│                                                              │
└──────────────────────────────────────────────────────────────┘
                              │
┌──────────────────────────────────────────────────────────────┐
│                       STORAGE LAYER                           │
│                                                              │
│      HDFS        |        S3        |     ADLS     |   GCS     │
│                                                              │
└──────────────────────────────────────────────────────────────┘
```

### Key Architecture Principles

**Everything funnels into Spark Core**: All APIs (SQL, DataFrame, Streaming, MLlib) ultimately execute through Spark Core.

**Language doesn't matter**: Python, Scala, Java, and R all compile to the same JVM-executed plans.

**DataFrame API is the primary interface**: DataFrames and Spark SQL are the recommended APIs for data engineering tasks.

**Resource Managers don't compute**: They only allocate resources (CPU, memory) to Spark applications.

**Storage is passive**: Spark pulls data from storage; storage never pushes data to Spark.


## Spark Platform

### Understanding the Complete Spark Ecosystem

**Apache Spark is a framework and library**, but to build, test, deploy, and operate Spark applications in production, you need more than just Spark:

- **Resource Manager**: Allocates compute resources (YARN, Kubernetes, Spark Standalone)
- **Compute Cluster**: Physical or virtual machines that execute Spark jobs
- **Storage**: Data storage systems (HDFS, S3, ADLS, GCS)
- **Security & Governance**: Access control, data lineage, compliance
- **Monitoring & Operations**: Job monitoring, alerting, performance tuning

### Spark Platforms

Different vendors package Spark with these components to create complete platforms:

- **Databricks**: Unified analytics platform with managed Spark, Delta Lake, MLflow, and Unity Catalog
- **AWS EMR**: Amazon's managed Spark service on AWS infrastructure
- **Azure HDInsight**: Microsoft's managed Spark service on Azure
- **Cloudera Hadoop**: On-premise Hadoop distribution with Spark
- **Google Cloud Dataproc**: Google's managed Spark service on GCP

**Key Point**: Spark alone is not enough for production. You need a complete platform that combines Spark with resource management, compute infrastructure, storage, security, and governance.


## What is Databricks Cloud?

**Databricks** is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.

### Key Features

- **Managed Open Source Integration**: Seamlessly integrates Apache Spark, MLflow, Delta Lake, Unity Catalog, and PostgreSQL
- **Unified Platform**: Combines data engineering, data science, and machine learning in one platform
- **Enterprise-Grade**: Built-in security, governance, and compliance features
- **Scalable**: Automatically scales compute resources based on workload
- **Collaborative**: Team collaboration features for data teams

### Databricks Components

- **Apache Spark**: Core processing engine
- **Delta Lake**: Open-source storage layer with ACID transactions
- **MLflow**: Machine learning lifecycle management
- **Unity Catalog**: Unified governance for data and AI assets
- **Databricks SQL**: Serverless SQL warehouse for analytics

**Note**: While Databricks is a popular Spark platform, Spark applications can run on any compatible platform (AWS EMR, Azure HDInsight, on-premise clusters, etc.).


---

## What is PySpark?

**PySpark** is the Python API for Apache Spark. It allows Python developers to leverage Spark's distributed computing capabilities using familiar Python syntax.

**Key Features:**
- **Distributed Computing**: Process data across multiple machines (cluster)
- **In-Memory Processing**: Fast processing by keeping data in memory
- **Fault Tolerance**: Automatically recovers from failures
- **Lazy Evaluation**: Operations are optimized before execution
- **DataFrame API**: Similar to Pandas, optimized for structured data processing

**Why PySpark for Data Engineering?**
- **Handles Big Data**: Process terabytes/petabytes of data
- **Faster than Pandas**: For large datasets, Spark is much faster
- **Scalable**: Can scale from single machine to thousands of machines
- **Multiple Data Sources**: Read from CSV, JSON, Parquet, Hive, databases, etc.
- **Industry Standard**: Used by major companies (Netflix, Uber, Airbnb, etc.)


## What You'll Learn in This Notebook

- What is Apache Spark and its architecture?
- Why Apache Spark for data engineering?
- Understanding Spark platforms and ecosystem
- Basic PySpark setup and installation
- Creating your first SparkSession
- Working with PySpark DataFrames
- Creating DataFrames from various file formats (CSV, JSON, Parquet)
- Understanding schema inference vs explicit schema


## PySpark vs Pandas

| Feature | Pandas | PySpark |
|---------|--------|---------|
| **Data Size** | Best for data that fits in memory (single machine) | Handles data larger than memory (distributed) |
| **Speed** | Fast for small/medium data | Faster for large data (distributed processing) |
| **Scalability** | Single machine | Multiple machines (cluster) |
| **Use Case** | Data analysis, ETL on small datasets | Big data processing, ETL on large datasets |
| **Syntax** | Python-like | Similar to Pandas (DataFrame API) |

**Rule of Thumb**: 
- Use **Pandas** when data fits in your machine's memory
- Use **PySpark** when data is too large or you need distributed processing


## Installing PySpark

PySpark can be installed using pip:

```bash
pip install pyspark
```

**Note**: For this course, we'll use PySpark in local mode (single machine). In production, Spark runs on clusters.


In [1]:
pip show pyspark

Name: pyspark
Version: 3.5.1
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: dev@spark.apache.org
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /opt/homebrew/lib/python3.11/site-packages
Requires: py4j
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pyspark==3.5.1


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.11/bin/python3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Import PySpark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Check if PySpark is installed
try:
    import pyspark
    print(f"PySpark version: {pyspark.__version__}")
    print("PySpark imported successfully!")
except ImportError:
    print("PySpark not installed. Please run: pip install pyspark")


PySpark version: 3.5.1
PySpark imported successfully!


## Creating a SparkSession

**SparkSession** is the entry point to PySpark. It's similar to a database connection - you need it to work with Spark.

### What is SparkSession?

**SparkSession** acts as an entry point to the Spark Cluster. To run code on a Spark Cluster, a SparkSession must be created.

**Key Points:**
- **For Higher-Level APIs**: To work with DataFrames and Spark SQL, SparkSession is required
- **For RDD Level**: SparkContext is required (but SparkSession wraps SparkContext)
- **Unified Entry Point**: SparkSession acts as an umbrella that encapsulates and unifies different contexts like SparkContext, HiveContext, SQLContext
- **One Session Per Application**: Only one SparkSession object is typically created per application

### Using Builder Pattern

SparkSession uses the **Builder Pattern** for creation, which allows you to configure various settings in a fluent, readable way:

**Key Points:**
- Use `getOrCreate()` to reuse existing session or create new one
- Always stop the session when done (in notebooks, this is usually automatic)
- The builder pattern allows chaining configuration options


In [2]:
# Create a SparkSession using Builder Pattern
# appName: Name of your application (appears in Spark UI)
# master: "local[*]" means use all available cores on local machine

spark = SparkSession.builder \
    .appName("PySpark Introduction") \
    .master("local[*]") \
    .getOrCreate()

# Check SparkSession
print(f"SparkSession created: {spark}")
print(f"Spark version: {spark.version}")
print(f"Spark context: {spark.sparkContext}")

print("\n" + "="*60)
print("Understanding Builder Pattern:")
print("="*60)
print("- .builder: Starts the builder")
print("- .appName(): Sets application name")
print("- .master(): Sets the master URL (local[*] for local mode)")
print("- .getOrCreate(): Creates or retrieves existing session")
print("\nNote: Only one SparkSession object is created per application.")


25/12/30 06:40:44 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.2 instead (on interface en0)
25/12/30 06:40:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/30 06:40:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


SparkSession created: <pyspark.sql.session.SparkSession object at 0x112801350>
Spark version: 3.5.1
Spark context: <SparkContext master=local[*] appName=PySpark Introduction>

Understanding Builder Pattern:
- .builder: Starts the builder
- .appName(): Sets application name
- .master(): Sets the master URL (local[*] for local mode)
- .getOrCreate(): Creates or retrieves existing session

Note: Only one SparkSession object is created per application.


In the context of Spark, the .master() parameter tells Spark where the cluster is and how many resources it should use.

When you pass the string "local[*]", you are giving Spark two specific instructions:

local: This tells Spark not to look for an external cluster (like YARN or Kubernetes). Instead, it tells Spark to run the Driver, the Master, and the Executor all inside a single process on your own machine.

[*]: This tells Spark how many CPU cores to use.

- local[1]: Use only one core (serial processing).
- local[2]: Use two cores.
- local[*]: Use all available cores on your machine. This is the most common setting for development because it maximizes your laptop's power.


- If Master is local: The "Cluster Manager" is just a thread on your laptop.
- If Master is yarn or spark://: The SparkSession sends a request over the network to a remote Cluster Manager to ask for "Workers" on different servers.

To check the exact number of cores your Spark application is using in local[*] mode, you need to access the Spark Web UI. Since Spark runs only as long as your script is active, you must keep your program running (e.g., using a time.sleep or an input prompt) to view the UI.

**Step-by-Step: Accessing the Core Count**

- Open the Web UI: While your Spark script is running, open your web browser and go to: http://localhost:4040 (If port 4040 is taken, Spark will automatically try 4041, 4042, etc.)
- Navigate to the "Executors" Tab: In the top navigation menu, click on the Executors tab. This is the "Resource Dashboard" for your Spark session.
- Find the "Cores" Column: In the table below the summary, look for a row labeled driver. In local[*] mode, the driver and the executor are the same process.
- Look at the Cores column.
- The number displayed there is the total count of threads Spark has successfully claimed from your machine.

## Understanding SparkContext

**SparkContext** is the entry point to Spark's low-level API. SparkSession wraps SparkContext and provides higher-level APIs.

- **SparkContext**: Low-level API (RDD operations)
- **SparkSession**: High-level API (DataFrame operations) - **We'll use this mostly**

For most data engineering tasks, you'll work with SparkSession and DataFrames.


### Creating Multiple Spark Sessions

While typically you create one SparkSession per application, Spark allows you to create multiple Spark Sessions when needed.

**Important Points:**
- **Same SparkContext**: Multiple Spark Sessions created in the same application will share the same underlying SparkContext
- **Isolated Environments**: Each SparkSession can have its own isolated environment (different configurations, temporary views, etc.)
- **Use Cases**: Useful when you need different configurations for different parts of your application

**Example Use Case**: You might want one SparkSession for reading from one data source with specific configurations, and another for writing to a different data source with different settings.


In [5]:
# Example: Creating Multiple Spark Sessions
# Note: In practice, you usually only need one SparkSession

# First SparkSession (already created above)
print("First SparkSession:")
print(f"App Name: {spark.sparkContext.appName}")
print(f"SparkContext ID: {id(spark.sparkContext)}")

# Create a second SparkSession with different app name
spark2 = SparkSession.builder \
    .appName("PySpark Second Session") \
    .master("local[*]") \
    .getOrCreate()

print("\nSecond SparkSession:")
print(f"App Name: {spark2.sparkContext.appName}")
print(f"SparkContext ID: {id(spark2.sparkContext)}")

print("\n" + "="*60)
print("Key Observation:")
print("="*60)
print("Notice that both SparkSessions share the same SparkContext!")
print("This is because SparkContext is created once per JVM/application.")
print("\nEvery Spark Application has:")
print("- One Driver (Master) - where your code runs")
print("- Multiple Executors (Workers) - where tasks execute")


First SparkSession:
App Name: PySpark Introduction
SparkContext ID: 4530630736

Second SparkSession:
App Name: PySpark Introduction
SparkContext ID: 4530630736

Key Observation:
Notice that both SparkSessions share the same SparkContext!
This is because SparkContext is created once per JVM/application.

Every Spark Application has:
- One Driver (Master) - where your code runs
- Multiple Executors (Workers) - where tasks execute


25/12/28 21:25:41 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [6]:
# Access SparkContext from SparkSession
sc = spark.sparkContext

# Get some information about the Spark environment
print(f"Spark Context: {sc}")
print(f"Default parallelism: {sc.defaultParallelism}")
print(f"Master URL: {sc.master}")


Spark Context: <SparkContext master=local[*] appName=PySpark Introduction>
Default parallelism: 11
Master URL: local[*]


## Creating Your First DataFrame

Let's create a simple DataFrame to get started. PySpark DataFrames are similar to Pandas DataFrames but are distributed across a cluster.


In [7]:
# Create a simple DataFrame from a list of tuples
data = [
    ("Alice", 25, "New York"),
    ("Bob", 30, "London"),
    ("Charlie", 35, "Tokyo"),
    ("Diana", 28, "Paris")
]

# Define column names
columns = ["Name", "Age", "City"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Display the DataFrame
print("DataFrame created:")
df.show()


DataFrame created:


                                                                                

+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 25|New York|
|    Bob| 30|  London|
|Charlie| 35|   Tokyo|
|  Diana| 28|   Paris|
+-------+---+--------+



In [8]:
# Get DataFrame schema (structure)
print("DataFrame Schema:")
df.printSchema()


DataFrame Schema:
root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)



## Different Ways of Creating DataFrames

There are several ways to create DataFrames in PySpark. Understanding all these methods gives you flexibility in different scenarios. Let's explore each method:

### 1. Using `spark.read` (Reading from Files)

This is the most common method for reading data from files. We've already seen examples of this with CSV, JSON, and Parquet files.

**When to Use**: When you have data stored in files (CSV, JSON, Parquet, etc.)

**Example**:
```python
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df = spark.read.json("path/to/file.json")
df = spark.read.parquet("path/to/file.parquet")
```


In [9]:
# Example 1: Using spark.read to create DataFrame from CSV
# This is the most common way to read data from files

# Create a sample CSV file for demonstration
import os
os.makedirs("data", exist_ok=True)

sample_csv = """Name,Age,City
Alice,25,New York
Bob,30,London
Charlie,35,Tokyo"""

with open("data/sample_read.csv", "w") as f:
    f.write(sample_csv)

# Read using spark.read
df_read = spark.read.csv("data/sample_read.csv", header=True, inferSchema=True)

print("DataFrame created using spark.read:")
df_read.show()
df_read.printSchema()


DataFrame created using spark.read:
+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 25|New York|
|    Bob| 30|  London|
|Charlie| 35|   Tokyo|
+-------+---+--------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)



### 2. Using `spark.sql` (Executing SQL Queries)

You can create DataFrames by executing SQL queries. This is useful when you want to use SQL syntax or query existing tables/views.

**When to Use**: 
- When you prefer SQL syntax over DataFrame API
- When querying existing tables or views
- When working with complex queries that are easier to express in SQL

**Example**:
```python
df = spark.sql("SELECT * FROM employees WHERE age > 25")
```


In [10]:
# Example 2: Using spark.sql to create DataFrame from SQL query
# First, let's create a temporary view from existing DataFrame
df.createOrReplaceTempView("people")

# Now we can use spark.sql to query it
df_sql = spark.sql("SELECT Name, Age FROM people WHERE Age > 28")

print("DataFrame created using spark.sql:")
df_sql.show()

# You can also use spark.sql with inline data using VALUES
df_sql_inline = spark.sql("""
    SELECT * FROM VALUES 
    ('Alice', 25, 'New York'),
    ('Bob', 30, 'London'),
    ('Charlie', 35, 'Tokyo')
    AS t(Name, Age, City)
""")

print("\nDataFrame created using spark.sql with inline VALUES:")
df_sql_inline.show()


DataFrame created using spark.sql:
+-------+---+
|   Name|Age|
+-------+---+
|    Bob| 30|
|Charlie| 35|
+-------+---+


DataFrame created using spark.sql with inline VALUES:
+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 25|New York|
|    Bob| 30|  London|
|Charlie| 35|   Tokyo|
+-------+---+--------+



### 3. Using `spark.table` (Reading from Tables/Views)

You can create DataFrames by reading from existing tables or views in Spark's catalog.

**When to Use**: 
- When you have registered tables or views
- When working with Hive tables
- When you want to read from a table that was created earlier

**Example**:
```python
df = spark.table("employees")  # Reads from a table named "employees"
df = spark.table("database.employees")  # Reads from a table in a specific database
```


In [11]:
# Example 3: Using spark.table to create DataFrame from a table/view
# First, register the DataFrame as a temporary view
df.createOrReplaceTempView("people_table")

# Now read from the table using spark.table
df_table = spark.table("people_table")

print("DataFrame created using spark.table:")
df_table.show()

print("\nNote: spark.table() is equivalent to spark.sql('SELECT * FROM table_name')")
print("It's a convenient way to read from registered tables or views.")


DataFrame created using spark.table:
+-------+---+--------+
|   Name|Age|    City|
+-------+---+--------+
|  Alice| 25|New York|
|    Bob| 30|  London|
|Charlie| 35|   Tokyo|
|  Diana| 28|   Paris|
+-------+---+--------+


Note: spark.table() is equivalent to spark.sql('SELECT * FROM table_name')
It's a convenient way to read from registered tables or views.


### 4. Using `spark.range` (Creating Sequential Data)

`spark.range()` creates a DataFrame with a single column containing a sequence of numbers. This is useful for generating test data or creating sequences.

**Syntax**:
```python
spark.range(start, end, step, numPartitions)
```

**Parameters**:
- `start`: Starting value (inclusive, default: 0)
- `end`: Ending value (exclusive)
- `step`: Step size (default: 1)
- `numPartitions`: Number of partitions (optional)

**Important**: `spark.range()` creates a DataFrame with **one column** named `id` containing the sequence of numbers.

**When to Use**: 
- Generating test data
- Creating sequences for joins or iterations
- Creating sample data for testing


In [12]:
# Example 4: Using spark.range to create DataFrame with sequential numbers
# Creates a DataFrame with one column 'id' containing numbers from 0 to 9
df_range = spark.range(10)

print("DataFrame created using spark.range(10):")
df_range.show()
df_range.printSchema()

# Range with start and end
df_range2 = spark.range(5, 15)

print("\nDataFrame created using spark.range(5, 15):")
df_range2.show()

# Range with step
df_range3 = spark.range(0, 20, 2)

print("\nDataFrame created using spark.range(0, 20, 2) - step of 2:")
df_range3.show()

print("\nKey Point: spark.range() always creates a DataFrame with ONE column named 'id'")


DataFrame created using spark.range(10):
+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

root
 |-- id: long (nullable = false)


DataFrame created using spark.range(5, 15):
+---+
| id|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
+---+


DataFrame created using spark.range(0, 20, 2) - step of 2:
+---+
| id|
+---+
|  0|
|  2|
|  4|
|  6|
|  8|
| 10|
| 12|
| 14|
| 16|
| 18|
+---+


Key Point: spark.range() always creates a DataFrame with ONE column named 'id'


### 5. Using `spark.createDataFrame` (Creating from Local Data)

You can create DataFrames from local Python data structures like lists, tuples, or dictionaries. This is useful when you have data in memory.

**When to Use**: 
- When you have data in Python variables (lists, tuples, dictionaries)
- When creating test data
- When working with small datasets that fit in memory

**Three Ways to Use `createDataFrame`**:

#### Method 1: Simple List with Column Names
```python
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
```

#### Method 2: List with Column Names using `.toDF()`
```python
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data).toDF("Name", "Age")
```

#### Method 3: List with Explicit Schema
```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, schema)
```

**Key Differences**:
- **Method 1**: Column names provided as second argument
- **Method 2**: Uses `.toDF()` to specify column names (defaults to `_1`, `_2`, etc. if not specified)
- **Method 3**: Explicit schema definition (best for production, gives you control over data types)


In [13]:
# Example 5: Using spark.createDataFrame - Method 1: With column names
data = [("Alice", 25, "New York"), ("Bob", 30, "London")]

# Method 1: Provide column names as second argument
df_method1 = spark.createDataFrame(data, ["Name", "Age", "City"])

print("Method 1: spark.createDataFrame(data, column_names)")
df_method1.show()
df_method1.printSchema()

print("\n" + "="*60)

# Example 5b: Method 2: Using .toDF() to specify column names
# If you don't specify column names, defaults are _1, _2, _3, etc.
df_method2_default = spark.createDataFrame(data)
print("Method 2a: Without .toDF() - uses default column names:")
df_method2_default.show()

# With .toDF() to specify column names
df_method2 = spark.createDataFrame(data).toDF("Name", "Age", "City")
print("\nMethod 2b: With .toDF() to specify column names:")
df_method2.show()

print("\n" + "="*60)

# Example 5c: Method 3: With explicit schema (best for production)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

df_method3 = spark.createDataFrame(data, schema)

print("Method 3: spark.createDataFrame(data, schema) - Explicit schema")
df_method3.show()
df_method3.printSchema()

print("\n" + "="*60)
print("Summary of createDataFrame methods:")
print("="*60)
print("1. df = spark.createDataFrame(list, column_names)")
print("   → Simple, quick way to specify column names")
print("\n2. df = spark.createDataFrame(list).toDF(column_names)")
print("   → Uses .toDF() to specify column names")
print("   → If .toDF() is not used, defaults to _1, _2, _3, etc.")
print("\n3. df = spark.createDataFrame(list, schema)")
print("   → Best for production - explicit control over data types")
print("   → Recommended when you need specific data types")


Method 1: spark.createDataFrame(data, column_names)
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 25|New York|
|  Bob| 30|  London|
+-----+---+--------+

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
 |-- City: string (nullable = true)


Method 2a: Without .toDF() - uses default column names:
+-----+---+--------+
|   _1| _2|      _3|
+-----+---+--------+
|Alice| 25|New York|
|  Bob| 30|  London|
+-----+---+--------+


Method 2b: With .toDF() to specify column names:
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 25|New York|
|  Bob| 30|  London|
+-----+---+--------+


Method 3: spark.createDataFrame(data, schema) - Explicit schema
+-----+---+--------+
| Name|Age|    City|
+-----+---+--------+
|Alice| 25|New York|
|  Bob| 30|  London|
+-----+---+--------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- City: string (nullable = true)


Summary of createDataFrame methods:
1. df 

### Summary: Different Ways to Create DataFrames

| Method | Use Case | Example |
|--------|----------|---------|
| `spark.read` | Reading from files | `spark.read.csv("file.csv")` |
| `spark.sql` | Executing SQL queries | `spark.sql("SELECT * FROM table")` |
| `spark.table` | Reading from tables/views | `spark.table("employees")` |
| `spark.range` | Creating sequential numbers | `spark.range(10)` (creates one column) |
| `spark.createDataFrame` | Creating from local data | `spark.createDataFrame(data, schema)` |

**Key Takeaways**:
- **`spark.read`**: Most common for reading files (CSV, JSON, Parquet, etc.)
- **`spark.sql`**: Great for SQL-based queries and working with existing tables
- **`spark.table`**: Convenient way to read from registered tables/views
- **`spark.range`**: Useful for generating test data or sequences (creates one column DataFrame)
- **`spark.createDataFrame`**: Essential for creating DataFrames from in-memory Python data

**Best Practice**: Use explicit schema with `createDataFrame` for production code to ensure correct data types.


In [14]:
# Get basic information about the DataFrame
print(f"Number of rows: {df.count()}")
print(f"Number of columns: {len(df.columns)}")
print(f"Column names: {df.columns}")


Number of rows: 4
Number of columns: 3
Column names: ['Name', 'Age', 'City']


## Creating DataFrame with Schema

You can also create DataFrames with explicit schema definition. This is useful when you want to control data types.


In [15]:
# Define schema explicitly
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", DoubleType(), True)
])

# Create DataFrame with schema
data = [
    ("Alice", 25, 50000.0),
    ("Bob", 30, 60000.0),
    ("Charlie", 35, 70000.0)
]

df_with_schema = spark.createDataFrame(data, schema)
df_with_schema.show()
df_with_schema.printSchema()


+-------+---+-------+
|   Name|Age| Salary|
+-------+---+-------+
|  Alice| 25|50000.0|
|    Bob| 30|60000.0|
|Charlie| 35|70000.0|
+-------+---+-------+

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Salary: double (nullable = true)



## Creating DataFrame from Files

PySpark can read data from various file formats. Let's explore reading from CSV, JSON, and Parquet files.

### Standardized Reading Pattern

PySpark uses a consistent pattern for reading files that helps build a clear mental model:

```python
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferschema", "true") \
    .load("<file-path>")
```

**Pattern Components:**
- `spark.read`: Entry point for reading data
- `.format()`: Specifies the file format (csv, json, parquet, etc.)
- `.option()`: Sets format-specific options (header, inferschema, delimiter, etc.)
- `.schema()`: Optional - defines explicit schema (better for production)
- `.load()`: Specifies the file or directory path

**Note**: Option names are lowercase (e.g., `"inferschema"` not `"inferSchema"`).

**Key Points:**
- **CSV**: Common format, can infer schema automatically or use explicit schema
- **JSON**: Structured format, schema is usually inferred from the data
- **Parquet**: Columnar format, very efficient for analytics workloads


### Reading from JSON File

JSON files are commonly used for structured data. PySpark can automatically infer the schema from JSON files.


In [17]:
# Read JSON file using standardized format
# PySpark automatically infers schema from JSON
json_df = spark.read \
    .format("json") \
    .load("data/employees.json")

print("DataFrame from JSON:")
json_df.show()

print("\nSchema (automatically inferred):")
json_df.printSchema()


DataFrame from JSON:
+---------------+----+-----------+-----------+--------------+------+
|_corrupt_record| age| department|employee_id|          name|salary|
+---------------+----+-----------+-----------+--------------+------+
|              [|NULL|       NULL|       NULL|          NULL|  NULL|
|           NULL|  28|Engineering|          1|      John Doe| 75000|
|           NULL|  32|  Marketing|          2|    Jane Smith| 65000|
|           NULL|  45|      Sales|          3|   Bob Johnson| 80000|
|           NULL|  29|Engineering|          4|Alice Williams| 72000|
|           NULL|  38|         HR|          5| Charlie Brown| 60000|
|           NULL|  35|  Marketing|          6|  Diana Prince| 68000|
|           NULL|  42|      Sales|          7|  Frank Miller| 85000|
|           NULL|  31|Engineering|          8|     Grace Lee| 74000|
|              ]|NULL|       NULL|       NULL|          NULL|  NULL|
+---------------+----+-----------+-----------+--------------+------+


Schema (aut

### Reading from CSV File

CSV files are the most common format for data exchange. PySpark can:
- **Infer schema automatically** (reads a sample of data to determine types)
- **Use explicit schema** (better performance and type control)

**Note**: Schema inference requires reading the data twice (once to infer, once to process), so for production, it's better to define schema explicitly.


In [19]:
print(csv_df_inferred.columns)

['Name', 'Age', 'Department', 'Salary']


In [20]:
# Read CSV with schema inference using standardized format
# format("csv") specifies the file format
# option("header", "true") means first row contains column names
# option("inferschema", "true") tells Spark to automatically detect data types
# load() specifies the file path
csv_df_inferred = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferschema", "true") \
    .load("data/employees.csv")

print("DataFrame from CSV (with inferred schema):")
csv_df_inferred.show()

print("\nInferred Schema:")
csv_df_inferred.printSchema()

print(f"\nData types:")
# Use the names that appeared in your .columns list
print(f"Name: {csv_df_inferred.schema['Name'].dataType}")
print(f"Age: {csv_df_inferred.schema['Age'].dataType}")
print(f"Salary: {csv_df_inferred.schema['Salary'].dataType}")


DataFrame from CSV (with inferred schema):
+-------+---+----------+------+
|   Name|Age|Department|Salary|
+-------+---+----------+------+
|  Alice| 25|     Sales| 50000|
|    Bob| 30|        IT| 60000|
|Charlie| 35|     Sales| 70000|
|  Diana| 28|        IT| 55000|
|    Eve| 32|     Sales| 65000|
|  Frank| 27|        HR| 52000|
|  Grace| 29|        IT| 58000|
|  Henry| 31|     Sales| 62000|
+-------+---+----------+------+


Inferred Schema:
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- Salary: integer (nullable = true)


Data types:
Name: StringType()
Age: IntegerType()
Salary: IntegerType()


In [21]:
# Read CSV with explicit schema using standardized format (better for production)
# This is faster and more reliable than schema inference
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

csv_schema = StructType([
    StructField("employee_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True),
    StructField("salary", IntegerType(), True)
])

csv_df_explicit = spark.read \
    .format("csv") \
    .option("header", "true") \
    .schema(csv_schema) \
    .load("data/employees.csv")

print("DataFrame from CSV (with explicit schema):")
csv_df_explicit.show()

print("\nExplicit Schema:")
csv_df_explicit.printSchema()


DataFrame from CSV (with explicit schema):
+-----------+----+----+----------+------+
|employee_id|name| age|department|salary|
+-----------+----+----+----------+------+
|       NULL|  25|NULL|     50000|  NULL|
|       NULL|  30|NULL|     60000|  NULL|
|       NULL|  35|NULL|     70000|  NULL|
|       NULL|  28|NULL|     55000|  NULL|
|       NULL|  32|NULL|     65000|  NULL|
|       NULL|  27|NULL|     52000|  NULL|
|       NULL|  29|NULL|     58000|  NULL|
|       NULL|  31|NULL|     62000|  NULL|
+-----------+----+----+----------+------+


Explicit Schema:
root
 |-- employee_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)



25/12/28 21:29:45 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 5
CSV file: file:///Users/rohityadav/ry_workspace/dev_de_tr/12%20Pyspark%20Structured/data/employees.csv


**Key Differences: Schema Inference vs Explicit Schema**

| Aspect | Schema Inference | Explicit Schema |
|--------|------------------|----------------|
| **Performance** | Slower (reads data twice) | Faster (single read) |
| **Reliability** | May infer wrong types | Guaranteed correct types |
| **Use Case** | Exploration, prototyping | Production, large datasets |
| **Code** | Less code | More code (define schema) |

**Best Practice**: Use explicit schema in production for better performance and reliability.


### Reading from Parquet File

Parquet is a columnar storage format optimized for analytics workloads. It's the preferred format for big data processing because:
- **Efficient**: Columnar format allows reading only needed columns
- **Compressed**: Automatic compression reduces storage
- **Schema embedded**: Schema is stored in the file metadata
- **Partitioned**: Can be split into multiple files (parts) for parallel processing

**Note**: Parquet files are often stored as directories with multiple part files, which is normal and allows parallel reading.


In [22]:
# First, let's create a Parquet file with multiple parts for demonstration
# This simulates how Parquet files are typically stored in production

# Create sample product data
product_data = [
    (1, "Product A", 100, "Electronics"),
    (2, "Product B", 200, "Clothing"),
    (3, "Product C", 150, "Electronics"),
    (4, "Product D", 300, "Home"),
    (5, "Product E", 250, "Clothing"),
    (6, "Product F", 180, "Electronics"),
    (7, "Product G", 220, "Home"),
    (8, "Product H", 120, "Clothing"),
    (9, "Product I", 350, "Electronics"),
    (10, "Product J", 280, "Home")
]

product_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("price", IntegerType(), True),
    StructField("category", StringType(), True)
])

# Create DataFrame
products_df = spark.createDataFrame(product_data, product_schema)

# Write to Parquet with 2 partitions (creates 2 part files)
# This is how Parquet files are typically stored - as directories with multiple parts
products_df.coalesce(2).write.mode("overwrite").parquet("data/products.parquet")

print("Parquet file created with multiple parts in data/products.parquet/")
print("Parquet files are stored as directories with part files for parallel processing")


[Stage 42:>                                                         (0 + 2) / 2]

Parquet file created with multiple parts in data/products.parquet/
Parquet files are stored as directories with part files for parallel processing


                                                                                

In [23]:
# Read Parquet file using standardized format
# Parquet files preserve schema, so no need to specify it
parquet_df = spark.read \
    .format("parquet") \
    .load("data/products.parquet")

print("DataFrame from Parquet:")
parquet_df.show()

print("\nSchema (preserved from Parquet file):")
parquet_df.printSchema()

print(f"\nNumber of rows: {parquet_df.count()}")
print(f"Number of partitions: {parquet_df.rdd.getNumPartitions()}")


DataFrame from Parquet:
+----------+------------+-----+-----------+
|product_id|product_name|price|   category|
+----------+------------+-----+-----------+
|         5|   Product E|  250|   Clothing|
|         6|   Product F|  180|Electronics|
|         7|   Product G|  220|       Home|
|         8|   Product H|  120|   Clothing|
|         9|   Product I|  350|Electronics|
|        10|   Product J|  280|       Home|
|         1|   Product A|  100|Electronics|
|         2|   Product B|  200|   Clothing|
|         3|   Product C|  150|Electronics|
|         4|   Product D|  300|       Home|
+----------+------------+-----+-----------+


Schema (preserved from Parquet file):
root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- price: integer (nullable = true)
 |-- category: string (nullable = true)


Number of rows: 10
Number of partitions: 2


**Understanding Parquet Part Files**

When you write a DataFrame to Parquet, Spark creates a directory with multiple part files:
- Each part file contains a portion of the data
- This allows parallel reading and processing
- The number of parts depends on the number of partitions in your DataFrame
- This is normal and expected behavior - you read the directory, not individual files

**Example**: `data/products.parquet/` contains:
- `part-00000-*.parquet` (first partition)
- `part-00001-*.parquet` (second partition)
- `_SUCCESS` (indicates successful write)


In [24]:
# Verify the Parquet directory structure
import os

parquet_path = "data/products.parquet"
if os.path.exists(parquet_path):
    print(f"Parquet directory contents:")
    for item in os.listdir(parquet_path):
        print(f"  - {item}")
else:
    print("Parquet directory not found")


Parquet directory contents:
  - .part-00001-3e6ca9b2-190b-4d41-95c8-84652e85a7aa-c000.snappy.parquet.crc
  - ._SUCCESS.crc
  - part-00001-3e6ca9b2-190b-4d41-95c8-84652e85a7aa-c000.snappy.parquet
  - .part-00000-3e6ca9b2-190b-4d41-95c8-84652e85a7aa-c000.snappy.parquet.crc
  - _SUCCESS
  - part-00000-3e6ca9b2-190b-4d41-95c8-84652e85a7aa-c000.snappy.parquet


## Important: Lazy Evaluation

**Key Concept**: Spark uses **lazy evaluation**. This means:

- **Transformations** (like `filter`, `select`, `groupBy`) are NOT executed immediately
- They are recorded as a plan
- **Actions** (like `show()`, `count()`, `collect()`) trigger the actual execution
- Spark optimizes the entire plan before executing

**Why Lazy Evaluation?**
- Allows Spark to optimize the entire query plan
- Can combine multiple operations efficiently
- Only executes what's needed


In [25]:
# Example of lazy evaluation
# This doesn't execute immediately - it just creates a plan
filtered_df = df.filter(df.Age > 28)

print("This is just a plan, not executed yet!")
print(f"Type: {type(filtered_df)}")

# Only when we call an ACTION (like show()), the execution happens
print("\nNow executing the plan:")
filtered_df.show()


This is just a plan, not executed yet!
Type: <class 'pyspark.sql.dataframe.DataFrame'>

Now executing the plan:
+-------+---+------+
|   Name|Age|  City|
+-------+---+------+
|    Bob| 30|London|
|Charlie| 35| Tokyo|
+-------+---+------+



## Summary

In this notebook, you learned:

1. **What is PySpark**: Python API for Apache Spark, a distributed computing framework
2. **Why PySpark**: Handles big data, scalable, faster for large datasets
3. **SparkSession**: Entry point to PySpark (like a database connection)
4. **DataFrames**: Distributed data structures similar to Pandas DataFrames
5. **Lazy Evaluation**: Operations are optimized before execution
6. **Creating DataFrames**: 
   - From lists (in-memory data)
   - From CSV files (with schema inference or explicit schema)
   - From JSON files (schema automatically inferred)
   - From Parquet files (schema preserved, efficient columnar format)
7. **Schema Management**: Understanding when to use schema inference vs explicit schema

**Key Takeaways**: 
- PySpark is designed for big data processing. It uses lazy evaluation to optimize operations and can scale from a single machine to thousands of machines.
- Different file formats have different characteristics: CSV is common but slower, JSON is good for structured data, Parquet is optimal for analytics workloads.
- Schema inference is convenient for exploration but explicit schema is better for production.

**Next Steps**: In Module 2, we'll learn about reading and writing data from various file formats.
