<a href="https://colab.research.google.com/github/Joyan9/pyspark-learning-journey/blob/main/PySpark_Daily_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing PySpark in Google Colab

In [16]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.window import Window

spark= SparkSession \
       .builder \
       .appName("Our First Spark Example") \
       .config('spark.ui.port', '4050') \
       .getOrCreate()

spark

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
48 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 

In [2]:
from google.colab import output
output.serve_kernel_port_as_window(4050, path ='/jobs/index.html')

Try `serve_kernel_port_as_iframe` instead. [0m


<IPython.core.display.Javascript object>

# Reading Data

For this example, I am going to use a data set from this [github repo](https://github.com/afaqueahmad7117/spark-experiments.git)


In [3]:
# Clone the repo
!git clone https://github.com/afaqueahmad7117/spark-experiments.git

# Load datasets from the cloned repo
transactions_df = spark.read.parquet("spark-experiments/data/data_skew/transactions.parquet")
customers_df = spark.read.parquet("spark-experiments/data/data_skew/customers.parquet")

print("Transactions Dataset Schema:")
transactions_df.printSchema()
print("Customers Dataset Schema:")
customers_df.printSchema()

Cloning into 'spark-experiments'...
remote: Enumerating objects: 544, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 544 (delta 7), reused 0 (delta 0), pack-reused 536 (from 1)[K
Receiving objects: 100% (544/544), 702.60 MiB | 24.83 MiB/s, done.
Resolving deltas: 100% (112/112), done.
Updating files: 100% (351/351), done.
Transactions Dataset Schema:
root
 |-- cust_id: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- txn_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- expense_type: string (nullable = true)
 |-- amt: string (nullable = true)
 |-- city: string (nullable = true)

Customers Dataset Schema:
root
 |-- cust_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- gen

# 🚩 **Day 1 - 2025/04/14**

**1. Schema Validation & Type Conversion (Easy)**

Your ETL pipeline ingests raw data with all columns as strings. Convert the `amt` (transaction amount) to DoubleType and `age` to IntegerType. Validate by showing the schema post-conversion.

In [4]:
transactions_df.show(5)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|   amt|       city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|   10|  7|Entertainment| 10.42|     boston|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|    3| 27| Motor/Travel| 44.34|   portland|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TETSXIK4BLXHJ6W|2011-04-11|2011|    4| 11|Entertainment|  3.18|    chicago|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TQKL1QFJY3EM8LO|2018-02-22|2018|    2| 22|    Groceries|268.97|los_angeles|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYL6DFP09PPXMVB|2010-10-16|2010|   10| 16|Entertainment|  2.66|    chicago|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
only showi

In [6]:
transactions_df = transactions_df.withColumn("amt", F.col("amt").cast(DoubleType()))
transactions_df.printSchema()

transactions_df.show(3) # this triggers the transformation

root
 |-- cust_id: string (nullable = true)
 |-- start_date: string (nullable = true)
 |-- end_date: string (nullable = true)
 |-- txn_id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- expense_type: string (nullable = true)
 |-- amt: double (nullable = true)
 |-- city: string (nullable = true)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+-----+--------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|  amt|    city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+-----+--------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|   10|  7|Entertainment|10.42|  boston|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|    3| 27| Motor/Travel|44.34|portland|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TETSXIK4BLXH

**2. Time-Based Aggregations (Medium)**
Scenario: The business wants monthly expense reports. Calculate total monthly expenses per customer, preserving the original schema's year and month columns. Handle potential nulls in amt.

```
# Expected output schema
# |-- cust_id: string
# |-- year: string
# |-- month: string
# |-- total_expense: double
```

In [7]:
monthly_transactions_per_customer = transactions_df.groupBy("cust_id","year","month").agg(F.sum("amt").alias("total_expense"))\
                                                    .orderBy("year","month","total_expense")
monthly_transactions_per_customer.show(5)

+----------+----+-----+------------------+
|   cust_id|year|month|     total_expense|
+----------+----+-----+------------------+
|C42POJ8QKI|2010|    1|298.94000000000005|
|CC5E1YOY7N|2010|    1|302.53000000000003|
|CCQ557SM5V|2010|    1|332.96000000000004|
|CLESRZVUWU|2010|    1|334.14000000000004|
|COV6YRAYE9|2010|    1|362.21999999999997|
+----------+----+-----+------------------+
only showing top 5 rows



**3. Data Quality Check (Medium)**
Scenario: Your team needs to validate customer data. Find:

1. Customers with invalid ZIP codes (non-5-digit format)
2. Transactions with future dates (dates beyond today's date)

Return counts for both anomalies.

In [10]:
customers_df.select("cust_id", "zip")\
            .filter(F.length(F.col("zip")) != 5)\
            .show()

+-------+---+
|cust_id|zip|
+-------+---+
+-------+---+



*There are no records with zip codes with not equal to 5 digits*

**4. Customer-Transaction Enrichment (Hard)**

Scenario: Create a master dataset showing each transaction enriched with customer demographics. Optimize for:

Fast joins using broadcast join where appropriate

Handling null values in customer data

Preserving original transaction order*

```
# Target schema
# |-- txn_id: string
# |-- date: string
# |-- expense_type: string
# |-- amt: double
# |-- city: string
# |-- name: string
# |-- age: int
# |-- gender: string
```

In [14]:
transactions_df_for_join = transactions_df.select("txn_id", "cust_id", "date", "expense_type", "amt", F.col("city").alias("txn_city"))

customers_df_for_join = customers_df.select("cust_id", "name", "age", "gender")

master_df = transactions_df_for_join.join(F.broadcast(customers_df_for_join),
                              customers_df_for_join.cust_id == transactions_df_for_join.cust_id,
                              how='inner')

master_df.show(5)

+---------------+----------+----------+-------------+------+-----------+----------+--------+---+------+
|         txn_id|   cust_id|      date| expense_type|   amt|   txn_city|   cust_id|    name|age|gender|
+---------------+----------+----------+-------------+------+-----------+----------+--------+---+------+
|TZ5SMKZY9S03OQJ|C0YDPQWPBJ|2018-10-07|Entertainment| 10.42|     boston|C0YDPQWPBJ|Ada Lamb| 32|Female|
|TYIAPPNU066CJ5R|C0YDPQWPBJ|2016-03-27| Motor/Travel| 44.34|   portland|C0YDPQWPBJ|Ada Lamb| 32|Female|
|TETSXIK4BLXHJ6W|C0YDPQWPBJ|2011-04-11|Entertainment|  3.18|    chicago|C0YDPQWPBJ|Ada Lamb| 32|Female|
|TQKL1QFJY3EM8LO|C0YDPQWPBJ|2018-02-22|    Groceries|268.97|los_angeles|C0YDPQWPBJ|Ada Lamb| 32|Female|
|TYL6DFP09PPXMVB|C0YDPQWPBJ|2010-10-16|Entertainment|  2.66|    chicago|C0YDPQWPBJ|Ada Lamb| 32|Female|
+---------------+----------+----------+-------------+------+-----------+----------+--------+---+------+
only showing top 5 rows



**5. Window Function Analysis (Hard)**

Scenario: For customer retention analysis, calculate:

a) Days since previous transaction per customer

b) Rolling 30-day average spend per customer

Use appropriate window functions and handle partition boundaries.

In [15]:
# Days since previous transaction per customer
# I'm assuming we are comparing it to today's date

# group by cust_id, get last transaction date, check the difference between today's date and last transaction date
transactions_df.groupBy("cust_id").agg(F.max("date").alias("last_transaction_date"))\
                .withColumn("days_since_last_txn", F.date_diff(F.current_date(), F.col("last_transaction_date")))\
                .show(5)

+----------+---------------------+-------------------+
|   cust_id|last_transaction_date|days_since_last_txn|
+----------+---------------------+-------------------+
|C007YEYTX9|           2020-09-27|               1662|
|C00B971T1J|           2020-12-27|               1571|
|C00WRSJF1Q|           2020-12-27|               1571|
|C01AZWQMF3|           2019-03-27|               2212|
|C01BKUFRHA|           2020-09-27|               1662|
+----------+---------------------+-------------------+
only showing top 5 rows



In [17]:
# Ensure 'date' is in timestamp format
transactions_df = transactions_df.withColumn("date_ts", F.col("date").cast("timestamp"))

# Create a time-based window: last 30 days per customer
window_30_days = Window.partitionBy("cust_id")\
                       .orderBy(F.col("date_ts").cast("long"))\
                       .rangeBetween(-30 * 86400, 0)  # last 30 days in seconds

# First aggregate daily total per customer
daily_total_df = transactions_df.groupBy("cust_id", "date_ts").agg(F.sum("amt").alias("daily_total"))

# Apply rolling average over the 30-day window
result_df = daily_total_df.withColumn("rolling_30_days_avg", F.avg("daily_total").over(window_30_days))

result_df.show(5)



+----------+-------------------+-----------+-------------------+
|   cust_id|            date_ts|daily_total|rolling_30_days_avg|
+----------+-------------------+-----------+-------------------+
|C007YEYTX9|2012-02-01 00:00:00|      74.62|              74.62|
|C007YEYTX9|2012-02-02 00:00:00|     293.11|            183.865|
|C007YEYTX9|2012-02-03 00:00:00|      146.7|  171.4766666666667|
|C007YEYTX9|2012-02-04 00:00:00|     3647.9|          1040.5825|
|C007YEYTX9|2012-02-05 00:00:00|     261.46|            884.758|
+----------+-------------------+-----------+-------------------+
only showing top 5 rows



# 🚩 **Day 2 - 2025/04/16**