<pre>
Problem Statement
You have a PySpark DataFrame with daily transaction amounts.
For each day, identify transactions greater than the average \
amount for that day (outliers).

Sample Input (transactions)
txn_date	txn_id	amount
2025-01-01	T1	100
2025-01-01	T2	200
2025-01-01	T3	500
2025-01-02	T4	300
2025-01-02	T5	400
2025-01-02	T6	600
Expected Output
txn_date	txn_id	amount
2025-01-01	T3	500
2025-01-02	T6	600
</pre>

In [16]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window


In [2]:
spark = SparkSession.builder.appName('Daily-Day4').getOrCreate()

In [3]:
data = [('2025-01-01','T1',100),
('2025-01-01',	'T2',	200),
('2025-01-01'	,'T3'	,500),
('2025-01-02',	'T4',	300),
('2025-01-02'	,'T5'	,400),
('2025-01-02',	'T6',	600)]

In [4]:
schema= ['txn_date'	,'txn_id'	,'amount']

In [5]:
transactions = spark.createDataFrame(data=data,schema=schema)

In [7]:
window = Window.partitionBy('txn_date').orderBy('txn_date')

In [17]:
avg = transactions.withColumn('AveragePerDay', F.avg('amount').over(window))
avg.show()

+----------+------+------+-----------------+
|  txn_date|txn_id|amount|    AveragePerDay|
+----------+------+------+-----------------+
|2025-01-01|    T1|   100|266.6666666666667|
|2025-01-01|    T2|   200|266.6666666666667|
|2025-01-01|    T3|   500|266.6666666666667|
|2025-01-02|    T4|   300|433.3333333333333|
|2025-01-02|    T5|   400|433.3333333333333|
|2025-01-02|    T6|   600|433.3333333333333|
+----------+------+------+-----------------+



In [18]:
result = avg.filter(F.col('amount')>F.col('AveragePerDay')).select(F.col('txn_date'),F.col('txn_id'),F.col('amount'))
result.show()

+----------+------+------+
|  txn_date|txn_id|amount|
+----------+------+------+
|2025-01-01|    T3|   500|
|2025-01-02|    T6|   600|
+----------+------+------+



<pre>You have a table employee_manager(emp_id, manager_id, change_date)
showing employees' managers over time.
Write a SQL query to find employees who never changed their manager across all records.

Sample Input (employee_manager)
emp_id	manager_id	change_date
1	10	2025-01-01
1	10	2025-02-01
2	11	2025-01-01
2	12	2025-03-01
3	13	2025-01-05
Expected Output
emp_id
1
3
</pre>

In [25]:
data = [(1	,10	,'2025-01-01'),
(1	,10	,'2025-02-01'),
(2,	11	,'2025-01-01'),
(2	,12,	'2025-03-01'),
(3,13,'2025-01-05')]
schema =['emp_id'	,'manager_id'	,'change_date']

employee_manager = spark.createDataFrame(data=data,schema=schema)
employee_manager = employee_manager.withColumn("change_date", F.to_date("change_date", "yyyy-MM-dd"))
employee_manager.createOrReplaceTempView("employee_manage")

In [30]:
result = spark.sql('''
                    SELECT emp_id
                    FROM employee_manage
                    GROUP BY emp_id
                    HAVING count(DISTINCT manager_id) = 1''')
result.show()

+------+
|emp_id|
+------+
|     1|
|     3|
+------+

