# Getting Started with Spark SQL

In this practice notebook, we'll learn the basics of working with Spark SQL. Follow the instructions in each section and fill in the code cells with your solutions.


## 1. SparkSession Initialization

The **SparkSession** is the entry point to all Spark functionality. It provides a unified interface for working with Spark SQL, DataFrames, and Datasets.

### Key Points:
- SparkSession replaces the older SparkContext + SQLContext pattern
- Use `SparkSession.builder` to create a session
- Configure application name and options during creation
- Built-in support for Hive features (HiveQL, UDFs, Hive tables)


In [5]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("Spark SQL Getting Started")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

print(f"Spark Version: {spark.version}")
print(f"Application Name: {spark.conf.get('spark.app.name')}")
print(f"Master: {spark.conf.get('spark.master')}")

Spark Version: 4.0.0
Application Name: Spark SQL Getting Started
Master: local[*]


## 2. Creating DataFrames from Python Data

DataFrames can be created from various Python data structures like lists, tuples, and dictionaries.


In [6]:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["name", "age"]

df_from_tuples = spark.createDataFrame(data, columns)
df_from_tuples.show()

                                                                                

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



In [7]:
data_dict = [
    {"name": "Alice", "age": 25, "city": "New York"},
    {"name": "Bob", "age": 30, "city": "San Francisco"},
    {"name": "Charlie", "age": 30, "city": "San Francisco"},
]

df_from_dict = spark.createDataFrame(data_dict)
df_from_dict.show()

+---+-------------+-------+
|age|         city|   name|
+---+-------------+-------+
| 25|     New York|  Alice|
| 30|San Francisco|    Bob|
| 30|San Francisco|Charlie|
+---+-------------+-------+



In [8]:
df_from_tuples.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [9]:
df_from_dict.printSchema()

root
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- name: string (nullable = true)



In [10]:
df_from_dict.count()

3

In [11]:
df_from_dict.columns

['age', 'city', 'name']

## 3. Creating DataFrames from Files

Spark can read data from various file formats including JSON, CSV, and Parquet. Let's create sample files and read them.


In [12]:
import json
import pandas as pd
import os

os.makedirs("../data", exist_ok=True)

# Sample data
sample_data = [
    {"name": "Michael", "age": None},
    {"name": "Andy", "age": 30},
    {"name": "Justin", "age": 19},
]

with open("../data/people.json", "w") as f:
    for record in sample_data:
        f.write(json.dumps(record) + "\n")

pd.DataFrame(sample_data).to_csv("../data/people.csv", index=False)

In [13]:
df_json = spark.read.json("../data/people.json")
df_json.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [14]:
df_csv = (
    spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv("../data/people.csv")
)
df_csv.show()

+-------+----+
|   name| age|
+-------+----+
|Michael|NULL|
|   Andy|30.0|
| Justin|19.0|
+-------+----+



In [15]:
df_json.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [16]:
df_csv.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: double (nullable = true)



## 4. Basic DataFrame Operations

Now let's explore fundamental DataFrame operations including selections, filtering, and transformations.


In [19]:
df = df_json
df.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [21]:
df.select("name").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [22]:
df.select(df["name"], df["age"] + 1).show()

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     NULL|
|   Andy|       31|
| Justin|       20|
+-------+---------+



In [25]:
df.filter(df["age"] > 21).show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [26]:
df.groupBy(df["age"]).count().show()

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|NULL|    1|
|  30|    1|
+----+-----+



In [27]:
df_with_adult = df.withColumn("is_adult", df["age"] >= 18)
df_with_adult.show()

+----+-------+--------+
| age|   name|is_adult|
+----+-------+--------+
|NULL|Michael|    NULL|
|  30|   Andy|    true|
|  19| Justin|    true|
+----+-------+--------+



In [28]:
df_renamed = df.withColumnRenamed("name", "full_name")
df_renamed.show()

+----+---------+
| age|full_name|
+----+---------+
|NULL|  Michael|
|  30|     Andy|
|  19|   Justin|
+----+---------+



In [30]:
df.orderBy("age").show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  19| Justin|
|  30|   Andy|
+----+-------+



In [32]:
df.orderBy(df["age"].desc()).show()

+----+-------+
| age|   name|
+----+-------+
|  30|   Andy|
|  19| Justin|
|NULL|Michael|
+----+-------+



## 5. Practice Exercises

Now it's your turn! Complete these exercises to practice what you've learned.

### Exercise 1: Create Your Own DataFrame
Create a DataFrame with information about your favorite books including: title, author, year_published, and rating.


In [33]:
# Exercise 1: Create your books DataFrame here
# TODO: Create a DataFrame with at least 5 books
# Include columns: title, author, year_published, rating (1-5)

books_data = [
    ("Tom & Jerry", "Tom Babu", 2013, 1),
    ("Jinthak", "Jinthak Babu", 2014, 2),
    ("Chintul", "Chintul Babu", 2010, 3),
    ("Kifalo", "Kifalo Babu", 2020, 4),
    ("Chivulalu", "Chivulalu Babu", 2005, 5),
]
columns = ["title", "author", "year_published", "rating"]

# Create DataFrame and show it
# df_books = spark.createDataFrame(books_data, ["title", "author", "year_published", "rating"])
# df_books.show()

df_books = spark.createDataFrame(books_data, columns)
df_books.show()

+-----------+--------------+--------------+------+
|      title|        author|year_published|rating|
+-----------+--------------+--------------+------+
|Tom & Jerry|      Tom Babu|          2013|     1|
|    Jinthak|  Jinthak Babu|          2014|     2|
|    Chintul|  Chintul Babu|          2010|     3|
|     Kifalo|   Kifalo Babu|          2020|     4|
|  Chivulalu|Chivulalu Babu|          2005|     5|
+-----------+--------------+--------------+------+



### Exercise 2: DataFrame Operations
Using the books DataFrame you created, perform the following operations:


In [None]:
current_year = 2024
df_books = df_books.withColumn(
    "age_of_book", lit(current_year) - df_books["year_published"]
)
df_books.show()

In [None]:
from pyspark.sql.functions import lit

# Exercise 2: DataFrame Operations
# TODO: Complete the following operations

# 1. Select only title and rating columns
df_books.select(["title", "rating"]).show()

# 2. Filter books with rating >= 4
df_books.filter(df_books["rating"] >= 4).show()

# 3. Add a new column 'age_of_book' (current year - year_published)
current_year = 2025
df_books.withColumn(
    "age_of_book", lit(current_year) - df_books["year_published"]
).show()

+-----------+------+
|      title|rating|
+-----------+------+
|Tom & Jerry|     1|
|    Jinthak|     2|
|    Chintul|     3|
|     Kifalo|     4|
|  Chivulalu|     5|
+-----------+------+

+---------+--------------+--------------+------+-----------+
|    title|        author|year_published|rating|age_of_book|
+---------+--------------+--------------+------+-----------+
|   Kifalo|   Kifalo Babu|          2020|     4|          4|
|Chivulalu|Chivulalu Babu|          2005|     5|         19|
+---------+--------------+--------------+------+-----------+

+-----------+--------------+--------------+------+-----------+
|      title|        author|year_published|rating|age_of_book|
+-----------+--------------+--------------+------+-----------+
|Tom & Jerry|      Tom Babu|          2013|     1|         12|
|    Jinthak|  Jinthak Babu|          2014|     2|         11|
|    Chintul|  Chintul Babu|          2010|     3|         15|
|     Kifalo|   Kifalo Babu|          2020|     4|          5|
| 

In [45]:
# 4. Sort books by rating in descending order
df_books.orderBy(df_books["rating"].desc()).show()

# 5. Group by author and count the number of books

+-----------+--------------+--------------+------+-----------+
|      title|        author|year_published|rating|age_of_book|
+-----------+--------------+--------------+------+-----------+
|  Chivulalu|Chivulalu Babu|          2005|     5|         19|
|     Kifalo|   Kifalo Babu|          2020|     4|          4|
|    Chintul|  Chintul Babu|          2010|     3|         14|
|    Jinthak|  Jinthak Babu|          2014|     2|         10|
|Tom & Jerry|      Tom Babu|          2013|     1|         11|
+-----------+--------------+--------------+------+-----------+



### Exercise 3: File Operations
Create a CSV file with employee data and read it back into a DataFrame.


In [48]:
# Exercise 3: File Operations
# TODO: Complete the following

# 1. Create employee data as a list of dictionaries
# Include: employee_id, name, department, salary
employees = [
    {"employee_id": 1, "name": "Alice Smith", "department": "HR", "salary": 60000},
    {
        "employee_id": 2,
        "name": "Bob Johnson",
        "department": "Engineering",
        "salary": 95000,
    },
    {
        "employee_id": 3,
        "name": "Charlie Lee",
        "department": "Marketing",
        "salary": 70000,
    },
    {"employee_id": 4, "name": "Dana White", "department": "Finance", "salary": 80000},
    {"employee_id": 5, "name": "Evan Brown", "department": "Sales", "salary": 65000},
]

# 2. Convert to pandas DataFrame and save as CSV
df = pd.DataFrame(employees)
df.to_csv("../data/employees.csv", index=False, sep=",")

In [51]:
# 3. Read the CSV file back using Spark
df = (
    spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv("../data/employees.csv")
)

# df = spark.read.csv("../data/employees.csv", header=True, sep=",")

df.show()

+-----------+-----------+-----------+------+
|employee_id|       name| department|salary|
+-----------+-----------+-----------+------+
|          1|Alice Smith|         HR| 60000|
|          2|Bob Johnson|Engineering| 95000|
|          3|Charlie Lee|  Marketing| 70000|
|          4| Dana White|    Finance| 80000|
|          5| Evan Brown|      Sales| 65000|
+-----------+-----------+-----------+------+



In [52]:
# 4. Display the DataFrame and its schema
df.show()

+-----------+-----------+-----------+------+
|employee_id|       name| department|salary|
+-----------+-----------+-----------+------+
|          1|Alice Smith|         HR| 60000|
|          2|Bob Johnson|Engineering| 95000|
|          3|Charlie Lee|  Marketing| 70000|
|          4| Dana White|    Finance| 80000|
|          5| Evan Brown|      Sales| 65000|
+-----------+-----------+-----------+------+



In [53]:
df.printSchema()

root
 |-- employee_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: integer (nullable = true)

