### Duplicate data can mislead analysis, introduce bias, and affect machine learning model performance. 

Identifying and handling duplicate values is a crucial step in data cleaning during Exploratory Data Analysis (EDA).


Why Remove Duplicate Data?
Duplicate data can arise due to:

Errors in data collection (e.g., importing the same data twice).

Data entry mistakes (e.g., users entering the same information multiple times).

Merging multiple datasets without proper deduplication.

Keeping duplicate data can lead to incorrect conclusions and distorted results.

In [14]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
print("Libraries are imported")

Libraries are imported


In [18]:
df=pd.read_csv("AB_NYC_2019.csv")
df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


In [26]:
df.shape



(48895, 16)

In [64]:
df[df.duplicated()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


In [57]:
df.duplicated().sum()

0

In [59]:
df[df.duplicated()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


### 1. Identifying Duplicate Data
   
We first need to identify duplicate rows in a dataset. Let's load the AB_NYC_2019.csv dataset.

In [71]:
import pandas as pd

df = pd.read_csv("AB_NYC_2019.csv")

df.duplicated().sum()

0

#### Viewing the Duplicate Rows


To see which rows are duplicated, use:




In [78]:
df[df.duplicated()]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


#### 2. Handling Duplicate Values

There are different ways to deal with duplicate values, depending on the requirements of our analysis.

(i) Keeping the First Occurrence
By default, df.duplicated() marks all but the first occurrence of a duplicate row as True.


In [84]:
df[df.duplicated(keep="first")]


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


#### (ii) Keeping the Last Occurrence
If you want to keep the last occurrence and mark earlier duplicates:

In [89]:
df[df.duplicated(keep="last")]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


#### (iii) Identifying All Duplicates
If we want to find all duplicate values (regardless of first or last occurrence)

In [92]:
df[df.duplicated(keep=False)]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365


### 3. Removing Duplicates
Once identified, we can remove duplicate values.

(i) Removing All Duplicate Rows

In [96]:
df = df.drop_duplicates()

#### 4. Handling Duplicates in a Specific Column
Sometimes, we need to check for duplicates in a specific column instead of the entire dataset.

Consider the following dataset:

City	Ranking	Score
New York	1	9.8
Los Angeles	2	9.5
Chicago	3	9.2
Houston	4	8.7
Phoenix	5	8.5
Los Angeles	6	8.2
Chicago	7	8.1