<pre>Problem Statement

You have a PySpark DataFrame containing customer purchase data.
Each row represents a purchase.
Write a PySpark program to find the first purchase date for each customer and the amount spent on that date.
Sample Input (purchases)
customer_id 	purchase_date 	amount
101 	2025-01-03 	250
101 	2025-01-05 	300
102 	2025-01-01 	150
102 	2025-01-02 	200
103 	2025-01-04 	500
Expected Output
customer_id 	first_purchase_date 	amount
101 	2025-01-03 	250
102 	2025-01-01 	150
103 	2025-01-04 	500
</pre>



In [26]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window

In [11]:
spark = SparkSession.builder.appName('Daily-Coding-Day3').getOrCreate()

In [10]:
data = [[101,"2025-01-03",250],
[101   ,  "2025-01-05"   ,  300],
[102  ,   "2025-01-01"   , 150],
[102 ,    "2025-01-02"    , 200],
[103 ,    "2025-01-04"    , 500]]

In [12]:
schema =['customer_id','purchase_date','amount']

In [13]:
customer_df = spark.createDataFrame(data=data , schema=schema)

In [21]:
customer_df =  customer_df.withColumn('purchase_date',F.col('purchase_date').cast("date"))

In [24]:
customer_df.printSchema()
customer_df.show()

root
 |-- customer_id: long (nullable = true)
 |-- purchase_date: date (nullable = true)
 |-- amount: long (nullable = true)

+-----------+-------------+------+
|customer_id|purchase_date|amount|
+-----------+-------------+------+
|        101|   2025-01-03|   250|
|        101|   2025-01-05|   300|
|        102|   2025-01-01|   150|
|        102|   2025-01-02|   200|
|        103|   2025-01-04|   500|
+-----------+-------------+------+



In [33]:
window_spec = Window.partitionBy("customer_id").orderBy("purchase_date")
result = (customer_df.withColumn("rn", F.row_number().over(window_spec)).filter(F.col("rn") == 1).select("customer_id", F.col("purchase_date").alias("first_purchase_date"), "amount"))

In [34]:
result.show()

+-----------+-------------------+------+
|customer_id|first_purchase_date|amount|
+-----------+-------------------+------+
|        101|         2025-01-03|   250|
|        102|         2025-01-01|   150|
|        103|         2025-01-04|   500|
+-----------+-------------------+------+



<pre>
Problem 2: SQL – Detect Employees with Salary Changes
Problem Statement

You have a table employee_salaries(emp_id, effective_date, salary) containing
salary history for employees.
Write a SQL query to find employees whose salary changed more than once and
 display the number of changes for each.
Sample Input (employee_salaries)
emp_id 	effective_date 	salary
1 	2025-01-01 	50000
1 	2025-02-01 	55000
1 	2025-03-01 	60000
2 	2025-01-15 	40000
2 	2025-03-01 	45000
3 	2025-01-10 	30000
Expected Output
emp_id 	salary_change_count
1 	2
2 	1
</pre>

In [35]:
data = [
    (1, "2025-01-01", 50000),
    (1, "2025-02-01", 55000),
    (1, "2025-03-01", 60000),
    (2, "2025-01-15", 40000),
    (2, "2025-03-01", 45000),
    (3, "2025-01-10", 30000),
]

columns = ["emp_id", "effective_date", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Cast effective_date to date type
df = df.withColumn("effective_date", F.to_date("effective_date", "yyyy-MM-dd"))

# Register as a temporary SQL table (view)
df.createOrReplaceTempView("employee_salaries")

In [36]:
result = spark.sql('''with salary_diff as (
    select
        emp_id,
        salary,
        lag(salary) over (partition by emp_id order by effective_date) as prev_salary
    from employee_salaries
)
select
    emp_id,
    count(*) as salary_change_count
from salary_diff
where prev_salary is not null and salary <> prev_salary
group by emp_id;''')

In [37]:
result.show()

+------+-------------------+
|emp_id|salary_change_count|
+------+-------------------+
|     1|                  2|
|     2|                  1|
+------+-------------------+

