# PySpark 01 — Spark Basics

This notebook introduces the basics of working with Apache Spark using PySpark. Topics covered:

1. Creating a SparkSession
2. Constructing a DataFrame from Python data
3. Exploring DataFrame structure
4. Viewing data

# 1. Create a Spark Session
The entry point to Spark is the SparkSession. It allows access to all Spark features from PySpark.

In PySpark, the **`SparkSession`** is part of the **Structured API** (which includes DataFrames and SQL-like operations). That API lives in the `pyspark.sql` module — even though you're not necessarily writing SQL.


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark Basics") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/05 19:55:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/05 19:55:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# 2. Constructing a DataFrame from Python data
You can create a DataFrame directly from a list of tuples and specify column names.

In [22]:
# Sample Python data (list of tuples)
data = [
    ("Alice", 25),
    ("Bob", 32),
    ("Cathy", 19)
]

# Define column names
columns = ["name", "age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the result
df.show()



+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 32|
|Cathy| 19|
+-----+---+



                                                                                

Even for small tasks, Spark is starting a whole execution engine under the hood, which includes:
* Starting the JVM (Java Virtual Machine)
* Initializing SparkContext
* Creating a thread pool
* Allocating memory for executors
* Logging setup and environment scanning
* This startup cost can easily take 10–20 seconds

In [23]:
df2 = spark.createDataFrame([("Dan", 40)], ["name", "age"])
df2.show()

[Stage 5:>                                                        (0 + 11) / 11]

+----+---+
|name|age|
+----+---+
| Dan| 40|
+----+---+



                                                                                

# 3. Exploring DataFrame structure
After creating a DataFrame, you can inspect its schema, column names, data types, and sample rows using these methods:

In [24]:
# Show the schema (column names and data types)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [25]:
# Get list of column names
df.columns

['name', 'age']

In [26]:
# Get (column name, data type) pairs
df.dtypes

[('name', 'string'), ('age', 'bigint')]

# 4. Viewing data

In [27]:
# Display first few rows in a tabular format
df.show()

                                                                                

[Stage 8:>                                                        (0 + 11) / 11]

+-----+---+
| name|age|
+-----+---+
|Alice| 25|
|  Bob| 32|
|Cathy| 19|
+-----+---+



                                                                                

In [28]:
# View first 3 rows as Row objects
df.head(3)

                                                                                

                                                                                

[Row(name='Alice', age=25), Row(name='Bob', age=32), Row(name='Cathy', age=19)]

In [29]:
# Stop Spark
spark.stop()