# PySpark Distinct to Drop Duplicate Rows
            by Aishwarya Raut

PySpark `distinct()` function is used to drop/remove the duplicate rows(all columns) from DF and `dropDuplicates()` is used to drop rows based on selected (one or multiple) columns. 

In [2]:
# Import PySpark 
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

# create SparkSession 
spark= SparkSession.builder.appName("PySpark").getOrCreate()

# prepare data
data = [("James", "Sales", 3000), \
    ("Michael", "Sales", 4600), \
    ("Robert", "Sales", 4100), \
    ("Maria", "Finance", 3000), \
    ("James", "Sales", 3000), \
    ("Scott", "Finance", 3300), \
    ("Jen", "Finance", 3900), \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000), \
    ("Saif", "Sales", 4100) \
  ]

# create Dataframe 
columns=["emp_name","department","salary"]
df=spark.createDataFrame(data,schema=columns)
df.printSchema()

root
 |-- emp_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)



In [4]:
# df.show()

In [9]:
# read csv file
file_path = "C:\\Users\\pcc\\Desktop\\daily-website-visitors.csv"
df=spark.read.csv(file_path,header=True,inferSchema=True)

df=df.withColumnsRenamed({"Day.Of.Week":"Day_Of_Week","Page.Loads":"Page_Loads",
                          "Unique.Visits":"Unique_Visits","First.Time.Visits":"First_Time_Visits",
                          "Returning.Visits":"Returning_Visits"})

df.show()

+---+---------+-----------+----------+----------+-------------+-----------------+----------------+
|Row|      Day|Day_Of_Week|      Date|Page_Loads|Unique_Visits|First_Time_Visits|Returning_Visits|
+---+---------+-----------+----------+----------+-------------+-----------------+----------------+
|  1|   Sunday|          1| 9/14/2014|      2146|         1582|             1430|             152|
|  2|   Monday|          2| 9/15/2014|      3621|         2528|             2297|             231|
|  3|  Tuesday|          3| 9/16/2014|      3698|         2630|             2352|             278|
|  4|Wednesday|          4| 9/17/2014|      3667|         2614|             2327|             287|
|  5| Thursday|          5| 9/18/2014|      3316|         2366|             2130|             236|
|  6|   Friday|          6| 9/19/2014|      2815|         1863|             1622|             241|
|  7| Saturday|          7| 9/20/2014|      1658|         1118|              985|             133|
|  8|   Su

In [11]:
df.count()

2167

# 1. Get Distinct Rows(By comparing All Columns)

In [10]:
distictDF= df.distinct()
print("Distict Count: "+str(distictDF.count()))
distictDF.show(truncate=False)

Distict Count: 2167
+----+---------+-----------+----------+----------+-------------+-----------------+----------------+
|Row |Day      |Day_Of_Week|Date      |Page_Loads|Unique_Visits|First_Time_Visits|Returning_Visits|
+----+---------+-----------+----------+----------+-------------+-----------------+----------------+
|294 |Saturday |7          |07-04-2015|1602      |1104         |900              |204             |
|721 |Saturday |7          |09-03-2016|1966      |1376         |1101             |275             |
|1180|Wednesday|4          |12-06-2017|4595      |3452         |2766             |686             |
|1333|Tuesday  |3          |05-08-2018|6702      |4577         |3700             |877             |
|1420|Friday   |6          |08-03-2018|3405      |2345         |1833             |512             |
|1550|Tuesday  |3          |12-11-2018|7659      |5267         |4330             |937             |
|1942|Tuesday  |3          |01-07-2020|3760      |2915         |2492            

`distict()` function on DataFrame returns a new DataFrame after removing the duplicate records. 

Alternatively, use `dropDuplicates()` function which return a new dataframe after removing duplicate rows.

In [12]:
df2=df.dropDuplicates()
print("Distinct count: "+str(df2.count()))
df2.show(truncate=False)

Distinct count: 2167
+----+---------+-----------+----------+----------+-------------+-----------------+----------------+
|Row |Day      |Day_Of_Week|Date      |Page_Loads|Unique_Visits|First_Time_Visits|Returning_Visits|
+----+---------+-----------+----------+----------+-------------+-----------------+----------------+
|294 |Saturday |7          |07-04-2015|1602      |1104         |900              |204             |
|721 |Saturday |7          |09-03-2016|1966      |1376         |1101             |275             |
|1180|Wednesday|4          |12-06-2017|4595      |3452         |2766             |686             |
|1333|Tuesday  |3          |05-08-2018|6702      |4577         |3700             |877             |
|1420|Friday   |6          |08-03-2018|3405      |2345         |1833             |512             |
|1550|Tuesday  |3          |12-11-2018|7659      |5267         |4330             |937             |
|1942|Tuesday  |3          |01-07-2020|3760      |2915         |2492           

# 2. PySpark Distinct of Selected Multiple Columns

In [13]:
dropDistictDF= df.dropDuplicates(["Day","Day_Of_Week"])
print("Distinct count of department & salary : "+str(dropDistictDF.count()))
dropDistictDF.show(truncate=False)

Distinct count of department & salary : 7
+---+---------+-----------+---------+----------+-------------+-----------------+----------------+
|Row|Day      |Day_Of_Week|Date     |Page_Loads|Unique_Visits|First_Time_Visits|Returning_Visits|
+---+---------+-----------+---------+----------+-------------+-----------------+----------------+
|6  |Friday   |6          |9/19/2014|2815      |1863         |1622             |241             |
|2  |Monday   |2          |9/15/2014|3621      |2528         |2297             |231             |
|7  |Saturday |7          |9/20/2014|1658      |1118         |985              |133             |
|1  |Sunday   |1          |9/14/2014|2146      |1582         |1430             |152             |
|5  |Thursday |5          |9/18/2014|3316      |2366         |2130             |236             |
|3  |Tuesday  |3          |9/16/2014|3698      |2630         |2352             |278             |
|4  |Wednesday|4          |9/17/2014|3667      |2614         |2327          