### Scenario - In a large-scale data processing project, multiple PySpark notebooks require the same set of custom transformation functions, such as state formatting, null handling, and data validation. Instead of duplicating the code across notebooks, a Python class is created to store these reusable functions. This ensures consistency, reduces maintenance efforts, and improves code readability across the project.

## **Python Class**

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
class DataValidation:

    def __init__(self, df):
        self.df = df

    def dedup(self, keyCol, cdcCol):
        WindowSpec = Window.partitionBy(keyCol)\
                            .orderBy(desc(cdcCol))
        df = self.df.withColumn(
                    "dedup",row_number().over(WindowSpec)
                    )
        df = df.filter(col("dedup")==1)\
                .drop("dedup")
        return df
    
    def removeNulls(self, nullCol):
        df = self.df.filter(col(nullCol).isNotNull())
        return df

In [0]:
df = spark.createDataFrame([("1","2020-01-01",100),("1","2020-01-02",260),("2","2020-01-02",200),("3","2020-01-03",300)],["id","date","value"])
df.display()

id,date,value
1,2020-01-01,100
1,2020-01-02,260
2,2020-01-02,200
3,2020-01-03,300


In [0]:
cls_obj = DataValidation(df)

In [0]:
df_dedup = cls_obj.dedup("id","date")
df_dedup.display()

id,date,value
1,2020-01-02,260
2,2020-01-02,200
3,2020-01-03,300
