<pre>
Problem Statement

You have a PySpark DataFrame containing employee salaries across departments. Write a PySpark program to rank employees within each department based on salary in descending order.
Sample Input (employee_salaries)
emp_id 	dept 	salary
1 	HR 	60000
2 	HR 	75000
3 	HR 	50000
4 	IT 	90000
5 	IT 	85000
Expected Output
emp_id 	dept 	salary 	rank
2 	HR 	75000 	1
1 	HR 	60000 	2
3 	HR 	50000 	3
4 	IT 	90000 	1
5 	IT 	85000 	2
</pre>

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

In [3]:
spark = SparkSession.builder.appName("Daily-Day6").getOrCreate()

25/08/26 10:26:07 WARN Utils: Your hostname, neosoft-Latitude-5420 resolves to a loopback address: 127.0.1.1; using 10.0.61.174 instead (on interface wlp0s20f3)
25/08/26 10:26:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/26 10:26:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
windowSpec = Window.partitionBy("dept").orderBy(F.desc("salary"))

In [5]:
data = [(1 	,'HR', 	60000),
(2 ,	'HR' ,	75000),
(3 ,	'HR' ,	50000),
(4 	,'IT' ,	90000),
(5 ,	'IT', 	85000)]
schema = ["id", "dept", "salary"]
empolyee_salaries = spark.createDataFrame(data, schema)

In [6]:
salary_rank = empolyee_salaries.withColumn("rank", F.rank().over(windowSpec))
salary_rank.show()

                                                                                

+---+----+------+----+
| id|dept|salary|rank|
+---+----+------+----+
|  2|  HR| 75000|   1|
|  1|  HR| 60000|   2|
|  3|  HR| 50000|   3|
|  4|  IT| 90000|   1|
|  5|  IT| 85000|   2|
+---+----+------+----+



<pre>Problem Statement

You have a table product_sales(product_id, sale_date) representing product sales dates. Write a SQL query to find products that were sold in every month of 2025.
Sample Input (product_sales)
product_id 	sale_date
P1 	2025-01-10
P1 	2025-02-15
P1 	2025-03-20
P2 	2025-01-05
P2 	2025-02-10
Expected Output
product_id
P1
</pre>

In [9]:
data = [
    ("P1", "2025-01-10"),
    ("P1", "2025-02-12"),
    ("P1", "2025-03-09"),
    ("P1", "2025-04-18"),
    ("P1", "2025-05-03"),
    ("P1", "2025-06-27"),
    ("P1", "2025-07-14"),
    ("P1", "2025-08-21"),
    ("P1", "2025-09-02"),
    ("P1", "2025-10-11"),
    ("P1", "2025-11-06"),
    ("P1", "2025-12-05"),
    ("P2", "2025-01-05"),
    ("P2", "2025-02-10"),
    ("P2", "2025-03-15"),
    ("P2", "2025-05-20"),
    ("P2", "2025-06-08"),
    ("P2", "2025-07-22"),
    ("P2", "2025-08-30"),
    ("P2", "2025-10-01"),
    ("P2", "2025-11-19"),
    ("P2", "2025-12-07"),
    ("P3", "2025-01-02"),
    ("P3", "2025-02-14"),
    ("P3", "2025-03-03"),
    ("P3", "2025-04-25"),
    ("P3", "2025-05-09"),
    ("P3", "2025-06-16"),
    ("P3", "2025-07-07"),
    ("P3", "2025-08-12"),
    ("P3", "2025-09-28"),
    ("P3", "2025-10-20"),
    ("P3", "2025-11-03"),
    ("P3", "2025-12-29"),
]

schema = ["product_id", "sale_date"]
product_sales = spark.createDataFrame(data, schema)
product_sales = product_sales.withColumn("sale_date", F.to_date(F.col("sale_date"), "yyyy-MM-dd"))
product_sales.createOrReplaceTempView("product_sales")

In [12]:
result = spark.sql("""SELECT product_id
FROM product_sales
WHERE YEAR(sale_date) = 2025
GROUP BY product_id
HAVING COUNT(DISTINCT MONTH(sale_date)) = 12
order BY product_id;

""")
result.show()

+----------+
|product_id|
+----------+
|        P1|
|        P3|
+----------+

