# Scenario 4
In a large-scale data processing project, multiple PySpark notebooks require the same set of custom transformation functions, such as date formatting, null handling, and data validation. Instead of duplicating the code across notebooks, a **Python class** is created to store these reusable functions. This ensures consistency, reduces maintenance effort and improves code readability across the project.

## **PYTHON CLASS**

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window

In [0]:
class DataValidation:

  def __init__(self, df):
    self.df = df

  def dedup(self, keyCol, cdcCol):
    df = self.df.withColumn("dedup", row_number().over(Window.partitionBy(keyCol).orderBy(desc(cdcCol))))
    return df.filter(col("dedup")==1).drop("dedup")
  
  def removeNulls(self, nullCol):
    return self.df.filter(col(nullCol).isNotNull)

In [0]:
df = spark.createDataFrame([
  (1, 'a', '2020-01-01', 100),
  (2, 'b', '2020-01-02', 200),
  (3, 'c', '2020-01-03', 300),
  (4, 'd', '2020-01-04', 400),
  (4, 'd', '2020-01-05', 500)], 
['id', 'name', 'date', 'price'])

display(df)

id,name,date,price
1,a,2020-01-01,100
2,b,2020-01-02,200
3,c,2020-01-03,300
4,d,2020-01-04,400
4,d,2020-01-05,500


In [0]:
cls_obj = DataValidation(df)


In [0]:
df_dedup = cls_obj.dedup("id", "date")

display(df_dedup)

id,name,date,price
1,a,2020-01-01,100
2,b,2020-01-02,200
3,c,2020-01-03,300
4,d,2020-01-05,500
