<pre>
Problem Statement
You have a PySpark DataFrame containing daily sales data.
Write a PySpark program to calculate the rolling 3-day average sales for each date,
ordered by the date column.

Sample Input (daily_sales)
sale_date	sales
2025-01-01	100
2025-01-02	200
2025-01-03	300
2025-01-04	400
2025-01-05	500
Expected Output
sale_date	sales	rolling_3_day_avg
2025-01-01	100	100.0
2025-01-02	200	150.0
2025-01-03	300	200.0
2025-01-04	400	300.0
2025-01-05	500	400.0
</pre>

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

In [2]:
data= [("2025-01-01",	100),
("2025-01-02",200),
("2025-01-03",	300),
("2025-01-04"	,400),
("2025-01-05",	500)]
schema = ['sales_date','sales']
spark = SparkSession.builder.appName('Daily-Day8').getOrCreate()
daily_sales = spark.createDataFrame(data,schema)

In [5]:
windowspec = Window.partitionBy('sales_date').rowsBetween(-3,0)

In [6]:
daily_sales = daily_sales.withColumn('rolling_3_day_avg',F.avg('sales').over(windowspec))
daily_sales.show()

+----------+-----+-----------------+
|sales_date|sales|rolling_3_day_avg|
+----------+-----+-----------------+
|2025-01-01|  100|            100.0|
|2025-01-02|  200|            200.0|
|2025-01-03|  300|            300.0|
|2025-01-04|  400|            400.0|
|2025-01-05|  500|            500.0|
+----------+-----+-----------------+



<pre>
Problem 2: SQL – Find Customers with Increasing Purchase Amounts
Problem Statement
You have a SQL table purchases(customer_id, purchase_date, amount).
Write a query to find customers whose purchase amounts strictly increased
with each new purchase date.

Sample Input (purchases)
customer_id	purchase_date	amount
C1	2025-01-01	100
C1	2025-01-05	200
C1	2025-01-10	300
C2	2025-01-02	150
C2	2025-01-06	120
C3	2025-01-03	200
C3	2025-01-09	250
Expected Output
customer_id
C1
C3
</pre>

In [13]:
data = [("C1"	,"2025-01-01"	,100),
("C1"	,"2025-01-05"	,200),
("C1"	,"2025-01-10",	300),
("C2",	"2025-01-02"	,150),
("C2"	,"2025-01-06",	120),
("C3"	,"2025-01-03"	,200),
("C3"	,"2025-01-09",	250)]
schema = ["customer_id","purchase_date","amount"]
purchases = spark.createDataFrame(data,schema)
purchases = purchases.withColumn("purchase_date",F.col('purchase_date').cast('date'))
purchases.createOrReplaceTempView('purchases')

In [15]:
result = spark.sql('''
WITH diffs AS (
  SELECT
    customer_id,
    purchase_date,
    amount,
    LAG(amount) OVER (PARTITION BY customer_id ORDER BY purchase_date) AS prev_amount
  FROM purchases
),

valid_customers AS (
  SELECT
    customer_id,
    MIN(CASE WHEN prev_amount IS NULL OR amount > prev_amount THEN 1 ELSE 0 END) AS all_increasing
  FROM diffs
  GROUP BY customer_id
)

SELECT customer_id
FROM valid_customers
WHERE all_increasing = 1;

''')
result.show()

+-----------+
|customer_id|
+-----------+
|         C1|
|         C3|
+-----------+

