# 📝 Problem 1: PySpark – Calculate Rolling 3-Day Average of Sales

### **Problem Statement**

You have a PySpark DataFrame containing daily sales data. Write a PySpark program to **calculate the rolling 3-day average sales** for each date, ordered by the date column.

### **Sample Input** (`daily_sales`)

| sale\_date | sales |
| ---------- | ----- |
| 2025-01-01 | 100   |
| 2025-01-02 | 200   |
| 2025-01-03 | 300   |
| 2025-01-04 | 400   |
| 2025-01-05 | 500   |

### **Expected Output**

| sale\_date | sales | rolling\_3\_day\_avg |
| ---------- | ----- | -------------------- |
| 2025-01-01 | 100   | 100.0                |
| 2025-01-02 | 200   | 150.0                |
| 2025-01-03 | 300   | 200.0                |
| 2025-01-04 | 400   | 300.0                |
| 2025-01-05 | 500   | 400.0                |

---

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window as W
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType

In [2]:
spark = SparkSession.builder.appName("DailyCodingProblem-26-08-2025").getOrCreate()

In [3]:
data = [
    ("2025-01-01", 100),
    ("2025-01-02", 200),
    ("2025-01-03", 300),
    ("2025-01-04", 400),
    ("2025-01-05", 500)
]

schema = StructType([
    StructField("sale_date", StringType(), True),
    StructField("sales", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)

In [4]:
df.printSchema()

root
 |-- sale_date: string (nullable = true)
 |-- sales: integer (nullable = true)



In [25]:
w = W.orderBy('sale_date').rowsBetween(-2, 0)

In [7]:
df = df.withColumn(
    "sale_date",
    F.to_date(F.col("sale_date"))
)

In [8]:
df.show()

+----------+-----+
| sale_date|sales|
+----------+-----+
|2025-01-01|  100|
|2025-01-02|  200|
|2025-01-03|  300|
|2025-01-04|  400|
|2025-01-05|  500|
+----------+-----+



In [26]:
# window = Window.partitionBy("category").orderBy("id").rangeBetween(Window.currentRow, 1)
# df.withColumn("sum", func.sum("id").over(window)).sort("id", "category").show()



df = df.withColumn(
    "rolling_3_day_avg",
    F.avg("sales").over(w)
)

df.show()





+----------+-----+-----------------+
| sale_date|sales|rolling_3_day_avg|
+----------+-----+-----------------+
|2025-01-01|  100|            100.0|
|2025-01-02|  200|            150.0|
|2025-01-03|  300|            200.0|
|2025-01-04|  400|            300.0|
|2025-01-05|  500|            400.0|
+----------+-----+-----------------+

