Problem Statement
<pre>
You have a PySpark DataFrame containing product sales. Each product belongs to a category, and you need to find the top 2 products by sales amount within each category.


---
Sample Input (products)
category	product	sales
Electronics	Laptop	1200
Electronics	Phone	900
Electronics	Tablet	700
Clothing	Shirt	400
Clothing	Jeans	600
Clothing	Jacket	800
---
Expected Output
category	product	sales
Electronics	Laptop	1200
Electronics	Phone	900
Clothing	Jacket	800
Clothing	Jeans	600

</pre>

In [17]:
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql import functions as F
from datetime import date

In [3]:
spark = SparkSession.builder.appName('Daily-Coding-Day2').getOrCreate()

In [5]:
data = [('Electronics','Laptop',1200),
('Electronics','Phone',900),
('Electronics','Tablet',700),
('Clothing','Shirt',400),
('Clothing','Jeans',600),
('Clothing','Jacket',800)]

schema = ['category','product','sales']
sales_df = spark.createDataFrame(data=data,schema=schema)
sales_df.printSchema()
sales_df.show()

root
 |-- category: string (nullable = true)
 |-- product: string (nullable = true)
 |-- sales: long (nullable = true)

+-----------+-------+-----+
|   category|product|sales|
+-----------+-------+-----+
|Electronics| Laptop| 1200|
|Electronics|  Phone|  900|
|Electronics| Tablet|  700|
|   Clothing|  Shirt|  400|
|   Clothing|  Jeans|  600|
|   Clothing| Jacket|  800|
+-----------+-------+-----+



In [14]:
windowspec = Window.partitionBy('category').orderBy(F.col('sales').desc())
sales_df = sales_df.withColumn('Rank',F.dense_rank().over(windowspec))
sales_df.show()

+-----------+-------+-----+----+
|   category|product|sales|Rank|
+-----------+-------+-----+----+
|   Clothing| Jacket|  800|   1|
|   Clothing|  Jeans|  600|   2|
|   Clothing|  Shirt|  400|   3|
|Electronics| Laptop| 1200|   1|
|Electronics|  Phone|  900|   2|
|Electronics| Tablet|  700|   3|
+-----------+-------+-----+----+



In [16]:
top2df = sales_df.filter(F.col('Rank')<3).select(F.col('category'),F.col('product'),F.col('sales'))
top2df.show()

+-----------+-------+-----+
|   category|product|sales|
+-----------+-------+-----+
|   Clothing| Jacket|  800|
|   Clothing|  Jeans|  600|
|Electronics| Laptop| 1200|
|Electronics|  Phone|  900|
+-----------+-------+-----+



<pre>
Problem Statement
You are given a SQL table transactions(user_id, txn_date, amount) where amount can be positive (credit) or negative (debit). Write a SQL query to calculate the running balance for each user ordered by txn_date.

Sample Input (transactions)
user_id	txn_date	amount
1	2025-01-01	500
1	2025-01-03	-200
1	2025-01-05	300
2	2025-01-02	1000
2	2025-01-04	-400
Expected Output
user_id	txn_date	amount	running_balance
1	2025-01-01	500	500
1	2025-01-03	-200	300
1	2025-01-05	300	600
2	2025-01-02	1000	1000
2	2025-01-04	-400	600
</pre>

In [18]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window


data = [
    (1, date(2025, 1, 1), 500),
    (1, date(2025, 1, 3), -200),
    (1, date(2025, 1, 5), 300),
    (2, date(2025, 1, 2), 1000),
    (2, date(2025, 1, 4), -400),
]
columns = ["user_id", "txn_date", "amount"]
transactions_df = spark.createDataFrame(data, schema=columns)
transactions_df.show()


+-------+----------+------+
|user_id|  txn_date|amount|
+-------+----------+------+
|      1|2025-01-01|   500|
|      1|2025-01-03|  -200|
|      1|2025-01-05|   300|
|      2|2025-01-02|  1000|
|      2|2025-01-04|  -400|
+-------+----------+------+



In [20]:
windowspec = Window.partitionBy('user_id').orderBy(F.col('txn_date'))
running = transactions_df.withColumn('running_balance',F.sum(F.col('amount')).over(windowspec))
running.show()

+-------+----------+------+---------------+
|user_id|  txn_date|amount|running_balance|
+-------+----------+------+---------------+
|      1|2025-01-01|   500|            500|
|      1|2025-01-03|  -200|            300|
|      1|2025-01-05|   300|            600|
|      2|2025-01-02|  1000|           1000|
|      2|2025-01-04|  -400|            600|
+-------+----------+------+---------------+

