# **Module 1: Setup & SparkSession Initialization**
**Tasks:**

Initialize Spark with:

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("BotCampus PySpark Practice") \
.master("local[*]") \
.getOrCreate()

Create a DataFrame from:

In [14]:
data = [
("Anjali", "Bangalore", 24),
("Ravi", "Hyderabad", 28),
("Kavya", "Delhi", 22),
("Meena", "Chennai", 25),
("Arjun", "Mumbai", 30)
]
columns = ["name", "city", "age"]

df = spark.createDataFrame(data, columns)

df.show()

+------+---------+---+
|  name|     city|age|
+------+---------+---+
|Anjali|Bangalore| 24|
|  Ravi|Hyderabad| 28|
| Kavya|    Delhi| 22|
| Meena|  Chennai| 25|
| Arjun|   Mumbai| 30|
+------+---------+---+



Show schema, explain data types.

In [5]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- age: long (nullable = true)



In [9]:
df.dtypes

[('name', 'string'), ('city', 'string'), ('age', 'bigint')]

Convert to RDD and Print .collect() and df.rdd.map() output.

In [13]:
rdd = df.rdd
print(rdd.collect())
print(rdd.map(lambda x: (x.name.upper(), x.age + 1)).collect())

[Row(name='Anjali', city='Bangalore', age=24), Row(name='Ravi', city='Hyderabad', age=28), Row(name='Kavya', city='Delhi', age=22), Row(name='Meena', city='Chennai', age=25), Row(name='Arjun', city='Mumbai', age=30)]
[('ANJALI', 25), ('RAVI', 29), ('KAVYA', 23), ('MEENA', 26), ('ARJUN', 31)]


# **Module 2: RDDs & Transformations**

**Scenario: You received app feedback from users in free-text.**

In [15]:
sc = spark.sparkContext

feedback = sc.parallelize([
"Ravi from Bangalore loved the delivery",
"Meena from Hyderabad had a late order",
"Ajay from Pune liked the service",
"Anjali from Delhi faced UI issues",
"Rohit from Mumbai gave positive feedback"
])

**Tasks:**

Split each line into words ( flatMap ).

In [17]:
word = feedback.flatMap(lambda x: x.split(" "))
print(word.collect())

['Ravi', 'from', 'Bangalore', 'loved', 'the', 'delivery', 'Meena', 'from', 'Hyderabad', 'had', 'a', 'late', 'order', 'Ajay', 'from', 'Pune', 'liked', 'the', 'service', 'Anjali', 'from', 'Delhi', 'faced', 'UI', 'issues', 'Rohit', 'from', 'Mumbai', 'gave', 'positive', 'feedback']


Remove stop words ( from , the , etc.).

In [18]:
value = {"from", "a", "the", "had", "gave"}

fil_word = word.filter(lambda x: x.lower() not in value)
count = fil_word.count()
print("The number of words are: ", count)

The number of words are:  21


Count each word frequency using reduceByKey .

In [23]:
word_dict = dict(word.map(lambda x: (x.lower(), 1)).reduceByKey(lambda a, b: a + b).collect())
print("Count of each word:", word_dict)

Count of each word: {'from': 5, 'loved': 1, 'liked': 1, 'service': 1, 'anjali': 1, 'faced': 1, 'issues': 1, 'rohit': 1, 'mumbai': 1, 'positive': 1, 'feedback': 1, 'ravi': 1, 'bangalore': 1, 'the': 2, 'delivery': 1, 'meena': 1, 'hyderabad': 1, 'had': 1, 'a': 1, 'late': 1, 'order': 1, 'ajay': 1, 'pune': 1, 'delhi': 1, 'ui': 1, 'gave': 1}


Find top 3 most frequent non-stop words.

In [25]:
from collections import Counter

common_words = word.map(lambda x: x.lower()).countByValue()
common_words = dict(Counter(common_words).most_common(3))
print("The top 3 most common words are: ", common_words)

The top 3 most common words are:  {'from': 5, 'the': 2, 'ravi': 1}


# **Module 3: DataFrames & Transformation (With Joins)**
**DataFrames:**

In [26]:
students = [
("Amit", "10-A", 89),
("Kavya", "10-B", 92),
("Anjali", "10-A", 78),
("Rohit", "10-B", 85),
("Sneha", "10-C", 80)
]
columns = ["name", "section", "marks"]

df_students = spark.createDataFrame(students, columns)

df_students.show()

+------+-------+-----+
|  name|section|marks|
+------+-------+-----+
|  Amit|   10-A|   89|
| Kavya|   10-B|   92|
|Anjali|   10-A|   78|
| Rohit|   10-B|   85|
| Sneha|   10-C|   80|
+------+-------+-----+



In [27]:
attendance = [
("Amit", 24),
("Kavya", 22),
("Anjali", 20),
("Rohit", 25),
("Sneha", 19)
]
columns2 = ["name", "days_present"]

df_attendance = spark.createDataFrame(attendance, columns2)

df_attendance.show()

+------+------------+
|  name|days_present|
+------+------------+
|  Amit|          24|
| Kavya|          22|
|Anjali|          20|
| Rohit|          25|
| Sneha|          19|
+------+------------+



Join both DataFrames on name .

In [30]:
from pyspark.sql.functions import col, when

df_joined = df_students.join(df_attendance, "name")
df_joined.show()

+------+-------+-----+------------+
|  name|section|marks|days_present|
+------+-------+-----+------------+
|  Amit|   10-A|   89|          24|
|Anjali|   10-A|   78|          20|
| Kavya|   10-B|   92|          22|
| Rohit|   10-B|   85|          25|
| Sneha|   10-C|   80|          19|
+------+-------+-----+------------+



Create a new column: attendance_rate = days_present / 25 .

In [31]:
df_joined = df_joined.withColumn("attendance_rate", col("days_present") / 25)
df_joined.show()

+------+-------+-----+------------+---------------+
|  name|section|marks|days_present|attendance_rate|
+------+-------+-----+------------+---------------+
|  Amit|   10-A|   89|          24|           0.96|
|Anjali|   10-A|   78|          20|            0.8|
| Kavya|   10-B|   92|          22|           0.88|
| Rohit|   10-B|   85|          25|            1.0|
| Sneha|   10-C|   80|          19|           0.76|
+------+-------+-----+------------+---------------+



Grade students using when :
A: >90, B: 80–90, C: <80.

In [32]:
df_graded = df_joined.withColumn("grade",
    when(col("marks") > 90, "A")
    .when((col("marks") >= 80) & (col("marks") <= 90), "B")
    .otherwise("C")
)

df_graded.show()

+------+-------+-----+------------+---------------+-----+
|  name|section|marks|days_present|attendance_rate|grade|
+------+-------+-----+------------+---------------+-----+
|  Amit|   10-A|   89|          24|           0.96|    B|
|Anjali|   10-A|   78|          20|            0.8|    C|
| Kavya|   10-B|   92|          22|           0.88|    A|
| Rohit|   10-B|   85|          25|            1.0|    B|
| Sneha|   10-C|   80|          19|           0.76|    B|
+------+-------+-----+------------+---------------+-----+



Filter students with good grades but poor attendance (<80%).

In [33]:
df_filtered = df_graded.filter((col("grade").isin("A", "B")) & (col("attendance_rate") < 0.8))

df_filtered.show()

+-----+-------+-----+------------+---------------+-----+
| name|section|marks|days_present|attendance_rate|grade|
+-----+-------+-----+------------+---------------+-----+
|Sneha|   10-C|   80|          19|           0.76|    B|
+-----+-------+-----+------------+---------------+-----+



# **Module 4: Ingest CSV & JSON, Save to Parquet**

**Tasks:**

1. Ingest CSV:

In [34]:
data = """emp_id,name,dept,city,salary
101,Anil,IT,Bangalore,80000
102,Kiran,HR,Mumbai,65000
103,Deepa,Finance,Chennai,72000
"""

with open("employee.csv", "w") as f:
    f.write(data)

2. Ingest JSON:

In [36]:
json_data = [
{
"id": 201,
"name": "Nandini",
"contact": {
"email": "nandi@example.com",
"city": "Hyderabad"
},
"skills": ["Python", "Spark", "SQL"]
}
]

with open("employee_nested.json", "w") as f:
    f.write(str(json_data))

**Tasks:**

Read both formats into DataFrames.

In [38]:
df_csv = spark.read.option("header", True).csv("employee.csv")
df_csv.show()

+------+-----+-------+---------+------+
|emp_id| name|   dept|     city|salary|
+------+-----+-------+---------+------+
|   101| Anil|     IT|Bangalore| 80000|
|   102|Kiran|     HR|   Mumbai| 65000|
|   103|Deepa|Finance|  Chennai| 72000|
+------+-----+-------+---------+------+



In [39]:
df_json = spark.read.json("employee_nested.json")
df_json.show()

+--------------------+---+-------+--------------------+
|             contact| id|   name|              skills|
+--------------------+---+-------+--------------------+
|{Hyderabad, nandi...|201|Nandini|[Python, Spark, SQL]|
+--------------------+---+-------+--------------------+



Flatten nested JSON using select , col , alias , explode .

In [41]:
from pyspark.sql.functions import col, explode

df_flat = df_json.select(
    col("id"),
    col("name"),
    col("contact.city").alias("city"),
    explode(col("skills")).alias("skill")
)

df_flat.show()

+---+-------+---------+------+
| id|   name|     city| skill|
+---+-------+---------+------+
|201|Nandini|Hyderabad|Python|
|201|Nandini|Hyderabad| Spark|
|201|Nandini|Hyderabad|   SQL|
+---+-------+---------+------+



Save both as Parquet files partitioned by city.

In [42]:
df_csv.write.mode("overwrite").partitionBy("city").parquet("output/csv_parquet")
df_flat.write.mode("overwrite").partitionBy("city").parquet("output/json_parquet")

# **Module 5: Spark SQL with Temp Views**
**Tasks:**

Register the students DataFrame as students_view .

In [44]:
df_graded.createOrReplaceTempView("students_view")

Write and run the following queries:

a) Average marks per section

In [45]:
spark.sql("SELECT section, AVG(marks) as avg_marks FROM students_view GROUP BY section").show()

+-------+---------+
|section|avg_marks|
+-------+---------+
|   10-C|     80.0|
|   10-A|     83.5|
|   10-B|     88.5|
+-------+---------+



b) Top scorer in each section

In [46]:
spark.sql("SELECT section, name, marks FROM (SELECT *, RANK() OVER (PARTITION BY section ORDER BY marks DESC) as rnk FROM students_view ) WHERE rnk = 1").show()

+-------+-----+-----+
|section| name|marks|
+-------+-----+-----+
|   10-A| Amit|   89|
|   10-B|Kavya|   92|
|   10-C|Sneha|   80|
+-------+-----+-----+



c) Count of students in each grade category

In [47]:
spark.sql("SELECT grade, COUNT(*) as count FROM students_view GROUP BY grade").show()

+-----+-----+
|grade|count|
+-----+-----+
|    B|    3|
|    C|    1|
|    A|    1|
+-----+-----+



d) Students with marks above class average

In [48]:
spark.sql("SELECT name, marks FROM students_view WHERE marks > (SELECT AVG(marks) FROM students_view)").show()

+-----+-----+
| name|marks|
+-----+-----+
| Amit|   89|
|Kavya|   92|
|Rohit|   85|
+-----+-----+



e) Attendance-adjusted performance

In [49]:
spark.sql("SELECT name, section, marks, attendance_rate, marks * attendance_rate as adjusted_score FROM students_view").show()

+------+-------+-----+---------------+------------------+
|  name|section|marks|attendance_rate|    adjusted_score|
+------+-------+-----+---------------+------------------+
|  Amit|   10-A|   89|           0.96|             85.44|
|Anjali|   10-A|   78|            0.8|62.400000000000006|
| Kavya|   10-B|   92|           0.88|             80.96|
| Rohit|   10-B|   85|            1.0|              85.0|
| Sneha|   10-C|   80|           0.76|              60.8|
+------+-------+-----+---------------+------------------+



# **Module 6: Partitioned Data & Incremental Loading**
**Step 1: Full Load**

In [55]:
df_students.write.mode("overwrite").partitionBy("section").parquet("output/students")

**Step 2: Incremental Load**

In [56]:
incremental = [("Tejas", "10-A", 91)]
df_inc = spark.createDataFrame(incremental, ["name", "section", "marks"])
df_inc.write.mode("append").partitionBy("section").parquet("output/students/")

**Tasks:**

List files in output/students/ using Python.

In [57]:
import os
print(os.listdir("output/students"))

['._SUCCESS.crc', 'section=10-A', 'section=10-B', '_SUCCESS', 'section=10-C']


Read only partition 10-A and list students.

In [58]:
df_10a = spark.read.parquet("output/students/section=10-A")
df_10a.show()

+------+-----+
|  name|marks|
+------+-----+
|Anjali|   78|
| Tejas|   91|
|  Amit|   89|
+------+-----+



Compare before/after counts for section 10-A .

In [59]:
print(f"Count: {df_10a.count()}")

Count: 3


# **Module 7: ETL Pipeline – End to End**
**Given Raw Data (CSV):**

In [60]:
datas = """emp_id,name,dept,salary,bonus
1,Arjun,IT,75000,5000
2,Kavya,HR,62000,
3,Sneha,Finance,68000,4000
4,Ramesh,Sales,58000
"""

**Tasks:**

Load CSV with inferred schema.

In [64]:
with open("employee_raw.csv", "w") as f:
    f.write(datas)

df_raw = spark.read.option("header", True).option("inferSchema", True).csv("employee_raw.csv")

df_raw.show()

+------+------+-------+------+-----+
|emp_id|  name|   dept|salary|bonus|
+------+------+-------+------+-----+
|     1| Arjun|     IT| 75000| 5000|
|     2| Kavya|     HR| 62000| NULL|
|     3| Sneha|Finance| 68000| 4000|
|     4|Ramesh|  Sales| 58000| NULL|
+------+------+-------+------+-----+



Fill null bonuses with 2000 .

In [65]:
df_clean = df_raw.fillna({"bonus": 2000})
df_clean.show()

+------+------+-------+------+-----+
|emp_id|  name|   dept|salary|bonus|
+------+------+-------+------+-----+
|     1| Arjun|     IT| 75000| 5000|
|     2| Kavya|     HR| 62000| 2000|
|     3| Sneha|Finance| 68000| 4000|
|     4|Ramesh|  Sales| 58000| 2000|
+------+------+-------+------+-----+



Create total_ctc = salary + bonus .

In [66]:
df_clean = df_clean.withColumn("total_ctc", col("salary") + col("bonus"))
df_clean.show()

+------+------+-------+------+-----+---------+
|emp_id|  name|   dept|salary|bonus|total_ctc|
+------+------+-------+------+-----+---------+
|     1| Arjun|     IT| 75000| 5000|    80000|
|     2| Kavya|     HR| 62000| 2000|    64000|
|     3| Sneha|Finance| 68000| 4000|    72000|
|     4|Ramesh|  Sales| 58000| 2000|    60000|
+------+------+-------+------+-----+---------+



Filter employees with total_ctc > 65000 .

In [67]:
df_filtered = df_clean.filter(col("total_ctc") > 65000)
df_filtered.show()

+------+-----+-------+------+-----+---------+
|emp_id| name|   dept|salary|bonus|total_ctc|
+------+-----+-------+------+-----+---------+
|     1|Arjun|     IT| 75000| 5000|    80000|
|     3|Sneha|Finance| 68000| 4000|    72000|
+------+-----+-------+------+-----+---------+



Save result in: JSON format.

In [68]:
df_filtered.write.mode("overwrite").json("output/high_ctc_json")

Parquet format partitioned by department.

In [69]:
df_filtered.write.mode("overwrite").partitionBy("dept").parquet("output/high_ctc_parquet")