# Spark SQL Getting Started - Practice Notebook

This notebook covers the fundamentals of Spark SQL based on the [official Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html).

## Learning Objectives
- Understand SparkSession as the entry point to Spark functionality
- Create DataFrames from various sources (lists, files)
- Perform basic DataFrame operations
- Understand the difference between transformations and actions

## Sections
1. **SparkSession Initialization**
2. **Creating DataFrames from Python Data**
3. **Creating DataFrames from Files**
4. **Basic DataFrame Operations**
5. **Practice Exercises**

---


## 1. SparkSession Initialization

The **SparkSession** is the entry point to all Spark functionality. It provides a unified interface for working with Spark SQL, DataFrames, and Datasets.

### Key Points:
- SparkSession replaces the older SparkContext + SQLContext pattern
- Use `SparkSession.builder` to create a session
- Configure application name and options during creation
- Built-in support for Hive features (HiveQL, UDFs, Hive tables)


In [3]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("Spark SQL getting started")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

print(f"Spark Version: {spark.version}")

print(f"Application Name: {spark.conf.get('spark.app.name')}")
print(f"Master: {spark.conf.get('spark.master')}")

Spark Version: 4.0.0
Application Name: Spark SQL getting started
Master: local[*]


## 2. Creating DataFrames from Python Data

DataFrames can be created from various Python data structures like lists, tuples, and dictionaries.


In [8]:
# Method 1: Create DataFrame from list of tuples
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["name", "age"]

df_from_tuples = spark.createDataFrame(data, columns)
print("DataFrame from tuples:")
df_from_tuples.show()

# Method 2: Create DataFrame from list of dictionaries
data_dict = [
    {"name": "Alice", "age": 25, "city": "New York"},
    {"name": "Bob", "age": 30, "city": "San Francisco"},
    {"name": "Charlie", "age": 35, "city": "Chicago"},
]

df_from_dict = spark.createDataFrame(data_dict)
print("\nDataFrame from dictionaries:")
df_from_dict.show()

DataFrame from tuples:
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+


DataFrame from dictionaries:
+---+-------------+-------+
|age|         city|   name|
+---+-------------+-------+
| 25|     New York|  Alice|
| 30|San Francisco|    Bob|
| 35|      Chicago|Charlie|
+---+-------------+-------+



In [10]:
# Print schema information
print("Schema of df_from_tuples:")
df_from_tuples.printSchema()

print("\nSchema of df_from_dict:")
df_from_dict.printSchema()

# Show DataFrame info
print(f"\nNumber of rows in df_from_dict: {df_from_dict.count()}")
print(f"Number of columns in df_from_dict: {len(df_from_dict.columns)}")
print(f"Column names: {df_from_dict.columns}")

Schema of df_from_tuples:
root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)


Schema of df_from_dict:
root
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- name: string (nullable = true)


Number of rows in df_from_dict: 3
Number of columns in df_from_dict: 3
Column names: ['age', 'city', 'name']


## 3. Creating DataFrames from Files

Spark can read data from various file formats including JSON, CSV, and Parquet. Let's create sample files and read them.


In [12]:
import json
import pandas as pd
import os

os.makedirs("../data/raw", exist_ok=True)

sample_data = [
    {"name": "Bahadur", "age": None},
    {"name": "Jitesh", "age": 25},
    {"name": "Ramanand", "age": 32},
]

with open("../data/raw/people.json", "w") as f:
    for record in sample_data:
        f.write(json.dumps(record) + "\n")

pd.DataFrame(sample_data).to_csv("../data/raw/people.csv", index=False)

print("Sample data files created successfully")

Sample data files created successfully


In [18]:
# Read JSON file
df_json = spark.read.json("../data/raw/people.json")
print("DataFrame from JSON:")
df_json.show()

# Read CSV file
df_csv = (
    spark.read.option("header", "true")
    .option("inferSchema", "true")
    .csv("../data/raw/people.csv")
)
print("\nDataFrame from CSV:")
df_csv.show()

# Compare schemas
print("\nJSON DataFrame schema:")
df_json.printSchema()

print("\nCSV DataFrame schema:")
df_csv.printSchema()

DataFrame from JSON:
+----+--------+
| age|    name|
+----+--------+
|NULL| Bahadur|
|  25|  Jitesh|
|  32|Ramanand|
+----+--------+


DataFrame from CSV:
+--------+----+
|    name| age|
+--------+----+
| Bahadur|NULL|
|  Jitesh|25.0|
|Ramanand|32.0|
+--------+----+


JSON DataFrame schema:
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)


CSV DataFrame schema:
root
 |-- name: string (nullable = true)
 |-- age: double (nullable = true)



## 4. Basic DataFrame Operations

Now let's explore fundamental DataFrame operations including selections, filtering, and transformations.


In [24]:
# Use the JSON DataFrame for operations
df = df_json

# 1. Select specific columns
print("1. Select only 'name' column:")
df.select("name").show()

# 2. Select multiple columns with expressions
print("\n2. Select name and age + 1:")
df.select(df["name"], df["age"] + 1).show()

# 3. Filter rows
print("\n3. Filter people older than 21:")
df.filter(df["age"] > 21).show()

# 4. Group by and count
print("\n4. Count people by age:")
df.groupBy("age").count().show()

1. Select only 'name' column:
+--------+
|    name|
+--------+
| Bahadur|
|  Jitesh|
|Ramanand|
+--------+


2. Select name and age + 1:
+--------+---------+
|    name|(age + 1)|
+--------+---------+
| Bahadur|     NULL|
|  Jitesh|       26|
|Ramanand|       33|
+--------+---------+


3. Filter people older than 21:
+---+--------+
|age|    name|
+---+--------+
| 25|  Jitesh|
| 32|Ramanand|
+---+--------+


4. Count people by age:
+----+-----+
| age|count|
+----+-----+
|  32|    1|
|  25|    1|
|NULL|    1|
+----+-----+



In [30]:
# 5. Add new columns
print("5. Add a new column 'is_adult':")
df_with_adult = df.withColumn("is_adult", df["age"] >= 18)
df_with_adult.show()

# 6. Rename columns
print("\n6. Rename 'name' to 'full_name':")
df_renamed = df.withColumnRenamed("name", "full_name")
df_renamed.show()

# 7. Sort data
print("\n7. Sort by age (ascending):")
df.orderBy("age").show()

print("\n8. Sort by age (descending):")
df.orderBy(df["age"].desc()).show()

5. Add a new column 'is_adult':
+----+--------+--------+
| age|    name|is_adult|
+----+--------+--------+
|NULL| Bahadur|    NULL|
|  25|  Jitesh|    true|
|  32|Ramanand|    true|
+----+--------+--------+


6. Rename 'name' to 'full_name':
+----+---------+
| age|full_name|
+----+---------+
|NULL|  Bahadur|
|  25|   Jitesh|
|  32| Ramanand|
+----+---------+


7. Sort by age (ascending):
+----+--------+
| age|    name|
+----+--------+
|NULL| Bahadur|
|  25|  Jitesh|
|  32|Ramanand|
+----+--------+


8. Sort by age (descending):
+----+--------+
| age|    name|
+----+--------+
|  32|Ramanand|
|  25|  Jitesh|
|NULL| Bahadur|
+----+--------+



## 5. Practice Exercises

Now it's your turn! Complete these exercises to practice what you've learned.

### Exercise 1: Create Your Own DataFrame
Create a DataFrame with information about your favorite books including: title, author, year_published, and rating.


In [31]:
# Exercise 1: Create your books DataFrame here
# TODO: Create a DataFrame with at least 5 books
# Include columns: title, author, year_published, rating (1-5)

books_data = [
    {"title": "ramayan", "author": "valmiki", "year_published": "1200", "rating": 4},
    {"title": "mahabharat", "author": "badul", "year_published": "1300", "rating": 3},
    {"title": "abhil", "author": "chimbal", "year_published": "2000", "rating": 5},
    {"title": "fetul", "author": "samarin", "year_published": "2004", "rating": 2},
    {"title": "chidel", "author": "yakuzo", "year_published": "1996", "rating": 4},
]

# Create DataFrame and show it
# df_books = spark.createDataFrame(books_data, ["title", "author", "year_published", "rating"])
# df_books.show()

df_books = spark.createDataFrame(books_data)
df_books.show()

+-------+------+----------+--------------+
| author|rating|     title|year_published|
+-------+------+----------+--------------+
|valmiki|     4|   ramayan|          1200|
|  badul|     3|mahabharat|          1300|
|chimbal|     5|     abhil|          2000|
|samarin|     2|     fetul|          2004|
| yakuzo|     4|    chidel|          1996|
+-------+------+----------+--------------+



### Exercise 2: DataFrame Operations
Using the books DataFrame you created, perform the following operations:


In [41]:
# Exercise 2: DataFrame Operations
# TODO: Complete the following operations

# 1. Select only title and rating columns
df_books.select("title", "rating").show()

# 2. Filter books with rating >= 4
df_books.filter(df_books["rating"] >= 4).show()

# 3. Add a new column 'age_of_book' (current year - year_published)
from pyspark.sql import functions as F

df_books.withColumn("age_of_book", F.lit(2025) - df_books["year_published"]).show()

# 4. Sort books by rating in descending order
df_books.orderBy(df_books["rating"].desc()).show()

# 5. Group by author and count the number of books
df_books.groupBy("rating").count().show()

+----------+------+
|     title|rating|
+----------+------+
|   ramayan|     4|
|mahabharat|     3|
|     abhil|     5|
|     fetul|     2|
|    chidel|     4|
+----------+------+

+-------+------+-------+--------------+
| author|rating|  title|year_published|
+-------+------+-------+--------------+
|valmiki|     4|ramayan|          1200|
|chimbal|     5|  abhil|          2000|
| yakuzo|     4| chidel|          1996|
+-------+------+-------+--------------+

+-------+------+----------+--------------+-----------+
| author|rating|     title|year_published|age_of_book|
+-------+------+----------+--------------+-----------+
|valmiki|     4|   ramayan|          1200|        825|
|  badul|     3|mahabharat|          1300|        725|
|chimbal|     5|     abhil|          2000|         25|
|samarin|     2|     fetul|          2004|         21|
| yakuzo|     4|    chidel|          1996|         29|
+-------+------+----------+--------------+-----------+

+-------+------+----------+--------------+

### Exercise 3: File Operations
Create a CSV file with employee data and read it back into a DataFrame.


## Summary

In this notebook, you learned:

1. **SparkSession** - The entry point to Spark functionality
2. **Creating DataFrames** - From Python data structures and files
3. **Basic Operations** - Select, filter, group, sort, and transform data
4. **Schema Inspection** - Understanding DataFrame structure

## Next Steps

Continue to the next notebook: `02_dataframe_operations.ipynb` to dive deeper into DataFrame transformations and operations.

## References

- [Spark SQL Getting Started Guide](https://spark.apache.org/docs/latest/sql-getting-started.html)
- [PySpark SQL Module Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html)
