**Handling Duplicate Rows in PySpark DataFrames: Methods and Examples**

Handling duplicate rows in a DataFrame is a common operation in data processing. In PySpark, this can be done using several methods provided by the pyspark.sql.DataFrame API. The most commonly used methods for handling duplicates are `dropDuplicates()` and `distinct()`. Let's look at examples of how to use these methods.

**1. Removing Duplicate Rows Using dropDuplicates()**

The dropDuplicates() method removes duplicate rows based on all columns or a subset of columns.

**Example 1: Removing Duplicates Based on All Columns**

In [None]:
import findspark
findspark.init
import getpass
from pyspark.sql import SparkSession

username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config("spark.sql.catalogImplementation", "hive"). \
    config("spark.sql.warehouse.dir",f"/Users/{username}/Documents/data/warehouse"). \
    enableHiveSupport(). \
    master("local"). \
    getOrCreate()

In [2]:
# Sample data
data = [
    ("Alice", 30, "F"),
    ("Bob", 25, "M"),
    ("Alice", 30, "F"),  # Duplicate row
    ("Charlie", 35, "M")
]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age", "Gender"])

In [3]:
# Show original DataFrame
print("Original DataFrame:")
df.show()

Original DataFrame:


                                                                                

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 30|     F|
|    Bob| 25|     M|
|  Alice| 30|     F|
|Charlie| 35|     M|
+-------+---+------+



In [4]:
# Remove duplicate rows based on all columns
df_no_duplicates = df.dropDuplicates()

In [5]:
# Show DataFrame after removing duplicates
print("DataFrame after removing duplicates:")
df_no_duplicates.show()

DataFrame after removing duplicates:
+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 30|     F|
|    Bob| 25|     M|
|Charlie| 35|     M|
+-------+---+------+



In this example, the row ("Alice", 30, "F") was duplicated, and the dropDuplicates() method removed one of them.

**Example 2: Removing Duplicates Based on Specific Columns**

You can also remove duplicates based on a subset of columns.

In [6]:
# Remove duplicates based on specific columns (e.g., "Name")
df_no_duplicates_name = df.dropDuplicates(["Name"])

# Show DataFrame after removing duplicates based on "Name" column
df_no_duplicates_name.show()

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 30|     F|
|    Bob| 25|     M|
|Charlie| 35|     M|
+-------+---+------+



In this case, only the first occurrence of each Name is kept, and duplicates based on the Name column are removed.

**2. Using distinct() to Remove Duplicates**

The distinct() method removes duplicate rows based on all columns, similar to dropDuplicates() but without the option to specify a subset of columns.

In [7]:
df_distinct_rows = df.distinct()
df_distinct_rows.show()

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 30|     F|
|    Bob| 25|     M|
|Charlie| 35|     M|
+-------+---+------+



In this case, the result is the same as using dropDuplicates(), as distinct() considers all columns when removing duplicates.

**3. Handling Duplicates with Conditions**

Sometimes, you might want to keep the first occurrence of a duplicate row, or perhaps aggregate the data before removing duplicates.

If you want to keep the first row of each duplicate group based on a certain condition (e.g., keep the most recent row based on a timestamp), you can use the groupBy() method followed by an aggregation.

In [8]:
# Sample data with a timestamp column
data_with_timestamp = [
    ("Alice", 30, "F", "2023-01-01"),
    ("Bob", 25, "M", "2023-01-02"),
    ("Alice", 30, "F", "2023-01-02"),  # Duplicate row with a different timestamp
    ("Charlie", 35, "M", "2023-01-01")
]

In [9]:
# Create DataFrame with timestamp column
df_timestamp = spark.createDataFrame(data_with_timestamp, ["Name", "Age", "Gender", "Timestamp"])

In [10]:
# Show original DataFrame
print("Original DataFrame:")
df_timestamp.show()

Original DataFrame:
+-------+---+------+----------+
|   Name|Age|Gender| Timestamp|
+-------+---+------+----------+
|  Alice| 30|     F|2023-01-01|
|    Bob| 25|     M|2023-01-02|
|  Alice| 30|     F|2023-01-02|
|Charlie| 35|     M|2023-01-01|
+-------+---+------+----------+



In [11]:
import pyspark.sql.functions as F

In [12]:
# Use groupBy to remove duplicates and keep the most recent Timestamp for each Name
df_grouped = df_timestamp.groupBy("Name", "Age", "Gender").agg(
    F.max("Timestamp").alias("LatestTimestamp")
)

In [13]:
# Show DataFrame after grouping and aggregation
print("DataFrame after removing duplicates with aggregation:")
df_grouped.show()

DataFrame after removing duplicates with aggregation:
+-------+---+------+---------------+
|   Name|Age|Gender|LatestTimestamp|
+-------+---+------+---------------+
|  Alice| 30|     F|     2023-01-02|
|    Bob| 25|     M|     2023-01-02|
|Charlie| 35|     M|     2023-01-01|
+-------+---+------+---------------+



Here, we used groupBy() and max() to keep the most recent timestamp for each Name, effectively removing duplicates but retaining the most recent entry for each person.

Here is a complete example with sample data to demonstrate the use of the row_number() function to handle duplicates:

In [14]:
# Sample data
data = [
    ("Alice", 30, "F", "2023-01-01"),
    ("Bob", 25, "M", "2023-01-02"),
    ("Alice", 30, "F", "2023-01-02"),  # Duplicate row
    ("Charlie", 35, "M", "2023-01-01")
]

In [15]:
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age", "Gender", "Timestamp"])

In [28]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Define a window specification to partition by "Name" and "Age", and order by "Timestamp"
window_spec = Window.partitionBy("Name", "Age").orderBy("Timestamp")

# Add a row number column based on the window specification
df_with_row_num = df.withColumn("row_num", row_number().over(window_spec))

# Filter to keep only the first occurrence of each duplicate group
df_cleaned = df_with_row_num.filter(df_with_row_num.row_num == 1).drop("row_num")

# Show the cleaned DataFrame
df_cleaned.show()


+-------+---+------+----------+
|   Name|Age|Gender| Timestamp|
+-------+---+------+----------+
|  Alice| 30|     F|2023-01-01|
|    Bob| 25|     M|2023-01-02|
|Charlie| 35|     M|2023-01-01|
+-------+---+------+----------+



In [27]:
# Define window specification to partition by "Name" and "Age", and order by "Timestamp" in descending order
window_spec_desc = Window.partitionBy("Name", "Age").orderBy(df["Timestamp"].desc())

# Assign row numbers based on the descending order of Timestamp (latest first)
df_with_row_num_desc = df.withColumn("row_num", row_number().over(window_spec_desc))

# Filter to keep only the latest occurrence (row_num == 1)
df_cleaned_latest = df_with_row_num_desc.filter(df_with_row_num_desc.row_num == 1).drop("row_num")

# Show the cleaned DataFrame with the latest Timestamp for each person
df_cleaned_latest.show()


+-------+---+------+----------+
|   Name|Age|Gender| Timestamp|
+-------+---+------+----------+
|  Alice| 30|     F|2023-01-02|
|    Bob| 25|     M|2023-01-02|
|Charlie| 35|     M|2023-01-01|
+-------+---+------+----------+



This approach is particularly useful when you want to keep the first occurrence based on a certain column (e.g., Timestamp or any other column you use for ordering) and remove other duplicates within the same group.

**Conclusion:**

-   dropDuplicates(): Removes duplicates based on all columns or a specified subset of columns.
-   distinct(): Removes duplicates based on all columns (equivalent to dropDuplicates() without column specification).
-   groupBy() + Aggregation: Allows you to remove duplicates and apply custom logic, such as keeping the most recent entry.
