 # 📝 What is PySpark?

What is PySpark?

- Python library for Apache Spark.
- Used for big data processing and analytics.
- Works with large datasets using DataFrames.
- Allows operations like:
   - Filter rows
   - Group data
   - Join tables
   - Aggregate data efficiently

In [1]:
!pip install pyspark



In [2]:
from pyspark.sql import SparkSession

#create Spark Session
spark = SparkSession.builder.appName("Sri").getOrCreate()

# check spark version
print("Apache Spark version:", spark.version)

Apache Spark version: 3.5.1


 # Creating a DataFrame

In [3]:
# sample data
data = [("Rahul", 21), ("Priya", 22), ("Sri", 21)]

#define schema (columns)
columns = ["Name", "Age"]

#create DataFrame
df = spark.createDataFrame(data, columns)

#show DataFrame
df.show()

+-----+---+
| Name|Age|
+-----+---+
|Rahul| 21|
|Priya| 22|
|  Sri| 21|
+-----+---+



 # DataFrame Operations

In [4]:
# select column
df.select("Name").show()           # df -> data frame

#filter rows
df.filter(df["Age"] > 21).show()

#count rows
print("Total rows:", df.count())

# group by column
df.groupBy("Age").count().show()

+-----+
| Name|
+-----+
|Rahul|
|Priya|
|  Sri|
+-----+

+-----+---+
| Name|Age|
+-----+---+
|Priya| 22|
+-----+---+

Total rows: 3
+---+-----+
|Age|count|
+---+-----+
| 21|    2|
| 22|    1|
+---+-----+



# Reading CSV Data in Python

- csv_data → CSV content stored as a string.
- StringIO → Treats the string as a file so Python can read it like a CSV file.
- DictReader → Reads CSV rows as dictionaries with column names as keys.

In [1]:
import csv
import io

# Step 1: Create CSV data as a string
csv_data = """id,name,department,salary
1,Rahul Sharma,IT,55000
2,Priya Singh,HR,60000
3,Aman Kumar,Finance,48000
4,Sneha Reddy,Marketing,52000
5,Arjun Mehta,IT,75000
"""

#step 2: use stringIO to treat string like a file
file_like = io.StringIO(csv_data)

# step 3: Read CSV using DictReader
reader = csv.DictReader(file_like)

# Step 4: Iterate through the rows and print them
for row in reader:
    print(f"{row['id']} - {row['name']} ({row['department']}) -> rs.{row['salary']}")

1 - Rahul Sharma (IT) -> rs.55000
2 - Priya Singh (HR) -> rs.60000
3 - Aman Kumar (Finance) -> rs.48000
4 - Sneha Reddy (Marketing) -> rs.52000
5 - Arjun Mehta (IT) -> rs.75000


# Reading JSON Data in Python

- json_data → JSON content stored as a string.
- json.loads() → Converts JSON string into a Python list of dictionaries.
- Iteration → Prints each record in a readable format.

In [2]:
import json

#step 1 : create json as a string
json_data = '''
[
  { "id": 1, "name": "Rahul Sharma", "age": 21, "city": "Bangalore" },
  { "id": 2, "name": "Priya Singh", "age": 22, "city": "Delhi" },
  { "id": 3, "name": "Aman Kumar", "age": 20, "city": "Hyderabad" }
]
'''

#step 2 : parse json string + python list of dictionaries
students = json.loads(json_data)

#step 3: process the data
print("Student Records:")
for s in students:
    print(f"{s['id']} - {s['name']} ({s['city']}) -> Age {s['age']}")

Student Records:
1 - Rahul Sharma (Bangalore) -> Age 21
2 - Priya Singh (Delhi) -> Age 22
3 - Aman Kumar (Hyderabad) -> Age 20


# Updating JSON Data in Python

In [3]:
import json

# step 1 : json data in memory
json_data = '''
[
  { "id": 1, "name": "Rahul Sharma", "age": 21, "city": "Bangalore" },
  { "id": 2, "name": "Priya Singh", "age": 22, "city": "Delhi" }
]
'''

# step2: load json into python list
students = json.loads(json_data)

# step 3: Add a new student
new_student = {
    "id": 3,
    "name": "Aman Kumar",
    "age": 20,
    "city": "Hyderabad"
}

students.append(new_student)

# step 4: update an existing student
for s in students:
  if s["id"] == 1:
      s["city"] = "pune"

# step 5: convert back to json string
updated_json = json.dumps(students, indent=2)

# print results
print("Updated JSON Data:\n", updated_json)


Updated JSON Data:
 [
  {
    "id": 1,
    "name": "Rahul Sharma",
    "age": 21,
    "city": "pune"
  },
  {
    "id": 2,
    "name": "Priya Singh",
    "age": 22,
    "city": "Delhi"
  },
  {
    "id": 3,
    "name": "Aman Kumar",
    "age": 20,
    "city": "Hyderabad"
  }
]


In [6]:
!pip install pyspark

from pyspark.sql import SparkSession

#create Spark Session
spark = SparkSession.builder.appName("Sri").getOrCreate()



In [7]:
import io

csv_data = """id,name,department,salary
1,Rahul Sharma,IT,55000
2,Priya Singh,HR,60000
3,Aman Kumar,Finance,48000
4,Sneha Reddy,Marketing,52000
5,Arjun Mehta,IT,75000
6,Divya Nair,Finance,67000
"""

with open("employees.csv", "w") as f:
  f.write(csv_data)

In [8]:
df = spark.read.csv("employees.csv", header = True, inferSchema = True)
df.show()

+---+------------+----------+------+
| id|        name|department|salary|
+---+------------+----------+------+
|  1|Rahul Sharma|        IT| 55000|
|  2| Priya Singh|        HR| 60000|
|  3|  Aman Kumar|   Finance| 48000|
|  4| Sneha Reddy| Marketing| 52000|
|  5| Arjun Mehta|        IT| 75000|
|  6|  Divya Nair|   Finance| 67000|
+---+------------+----------+------+



# Transformations


---

# 📝 Key Points about Transformations

* **Lazy Execution**:
  Spark doesn’t run transformations right away. Instead, it builds a **logical plan** (a DAG – Directed Acyclic Graph).
  The computation only runs when an **action** (like `.show()` or `.count()`) is called.

* **Return Type**:
  A transformation always returns a **new DataFrame or RDD**. It does **not modify the existing one**.

* **Two Types of Transformations**:

  1. **Narrow Transformations** → Each input partition contributes to only one output partition.
     (e.g., `map()`, `filter()`, `select()`)
  2. **Wide Transformations** → Data is shuffled across partitions.
     (e.g., `groupBy()`, `join()`)

---

In [9]:
# select name & salary
df.select("name", "salary").show()

# filter employess with > 60,000
df.filter(df["salary"] > 60000).show()

# order by salary descending
df.orderBy(df["salary"].desc()).show()

+------------+------+
|        name|salary|
+------------+------+
|Rahul Sharma| 55000|
| Priya Singh| 60000|
|  Aman Kumar| 48000|
| Sneha Reddy| 52000|
| Arjun Mehta| 75000|
|  Divya Nair| 67000|
+------------+------+

+---+-----------+----------+------+
| id|       name|department|salary|
+---+-----------+----------+------+
|  5|Arjun Mehta|        IT| 75000|
|  6| Divya Nair|   Finance| 67000|
+---+-----------+----------+------+

+---+------------+----------+------+
| id|        name|department|salary|
+---+------------+----------+------+
|  5| Arjun Mehta|        IT| 75000|
|  6|  Divya Nair|   Finance| 67000|
|  2| Priya Singh|        HR| 60000|
|  1|Rahul Sharma|        IT| 55000|
|  4| Sneha Reddy| Marketing| 52000|
|  3|  Aman Kumar|   Finance| 48000|
+---+------------+----------+------+



# Aggregation
 📝 What is Aggregation?

* An operation that **groups data** and applies a **summary function** (like sum, avg, count, min, max).
* Used to answer questions like:

  * *“What is the average salary per department?”*
  * *“How many employees are in each department?”*
  * *“What is the highest salary in Finance?”*

In [10]:
# Average salary per department
df.groupBy("department").avg("salary").show()

# Maximum salary per department
df.groupBy("department").max("salary").show()

# Count employees per department
df.groupBy("department").count().show()

+----------+-----------+
|department|avg(salary)|
+----------+-----------+
|        HR|    60000.0|
|   Finance|    57500.0|
| Marketing|    52000.0|
|        IT|    65000.0|
+----------+-----------+

+----------+-----------+
|department|max(salary)|
+----------+-----------+
|        HR|      60000|
|   Finance|      67000|
| Marketing|      52000|
|        IT|      75000|
+----------+-----------+

+----------+-----+
|department|count|
+----------+-----+
|        HR|    1|
|   Finance|    2|
| Marketing|    1|
|        IT|    2|
+----------+-----+



In [11]:
df.createOrReplaceTempView("employees")

spark.sql("SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department").show()

+----------+----------+
|department|avg_salary|
+----------+----------+
|        HR|   60000.0|
|   Finance|   57500.0|
| Marketing|   52000.0|
|        IT|   65000.0|
+----------+----------+

