# Task 4 -  Remove Duplication
                                                                        

### Problem Statement :

- Identify and remove duplicate values in a dataset.

## Solution 

        
    
### Why should we remove duplicate values from a dataset ? 

  
Removing duplicate values from a dataset is essential for several reasons:

* **Data Accuracy :** Duplicate values can introduce inaccuracies in analysis by skewing statistical measures and metrics. For instance, mean, median, and standard deviation can be affected, leading to misinterpretation of data.

* **Computational Efficiency :** Duplicate values contribute unnecessary computational overhead when performing analyses or running machine learning algorithms. Removing duplicates can improve processing speed and efficiency.

* **Model Performance :** In machine learning, duplicate data points can mislead models, causing them to overfit to specific patterns or biases present in the duplicates. Removing duplicates helps in training models on diverse and representative data.

* **Data Consistency :** Duplicate values can lead to inconsistencies and errors in reporting. It is crucial for maintaining data integrity and ensuring that analyses and visualizations accurately reflect the underlying patterns in the dataset.

* **Storage Optimization :** In large datasets, removing duplicates can save storage space, particularly in situations where storage resources are limited or expensive.

* **Enhanced Data Understanding :** Working with a dataset free from duplicates simplifies data exploration and interpretation, providing a clearer picture of the unique patterns and trends within the data.



### Various Techniques used for Handling Duplicate Data  

Handling duplicate data involves strategies to either eliminate or manage the duplicate values. 

**1. Removing Duplicates :**
   * Dropping Duplicates : Use Pandas' drop_duplicates() method to remove entire rows that are duplicates.
        ``` df_no_duplicates = df.drop_duplicates() ```
        

**2. Aggregating Duplicates :**
   * Aggregation Functions : If duplicates need to be summarized, use aggregation functions (e.g., sum, mean) to combine duplicate values.
        ``` df_aggregated = df.groupby('column_name').agg({'numeric_column': 'sum', 'another_column': 'mean'}) ```
        

**3. Keeping First or Last Occurrence :**
   * Keeping First Occurrence : Use keep='first' in drop_duplicates() to keep the first occurrence and remove subsequent duplicates.
   * Keeping Last Occurrence : Use keep='last' to keep the last occurrence.
        ``` df_first_occurrence = df.drop_duplicates(keep='first') ```
        ``` df_last_occurrence = df.drop_duplicates(keep='last') ```
        

**4. Marking Duplicates :**
   * Creating a Duplicate Flag : Add a new column indicating whether a row is a duplicate or not.
        ``` df['is_duplicate'] = df.duplicated() ```
        

**5. Handling Duplicates Based on Specific Columns :**
   * Subset of Columns : Check for duplicates based on a subset of columns.
        ``` df_no_duplicates_subset = df.drop_duplicates(subset=['column1', 'column2']) ```
 







### Importing Required Libraries  

In [1]:
import pandas as pd 
import numpy as np 

import warnings
warnings.filterwarnings('ignore')

### Understanding the Dataset 

* The Iris dataset consists of four independent features representing various measurements related to the morphology of iris flowers, namely sepal length, sepal width, petal length, and petal width. The dataset is structured as a classification problem, where the goal is to predict the class or species of an iris flower based on these four feature values.

In [2]:
# Loading the Iris DataSet 

# Since the dataset is in .csv format using Pandas function pd.read_csv() to load the dataset 

df = pd.read_csv('Iris.csv')

In [3]:
# Displaying the first few rows of the Iris DataFrame 

df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
# Getting the shape (rows,cols) of the DataFrame

df.shape

(150, 6)

In [5]:
# Getting the dataType of all the columns 

df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

### Identifying the Duplicate Values  

In [6]:
# Identify duplicate rows
duplicate_rows = df.duplicated()

# Display duplicate rows
duplicate_df = df[duplicate_rows]
print("Duplicate Rows:")
print(duplicate_df)

Duplicate Rows:
Empty DataFrame
Columns: [Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
Index: []


* The output indicates an "Empty DataFrame" when checking for duplicate rows, it means that no duplicates were found in the dataset

### Creating a DataFrame with Duplicate Values 


In [7]:
import pandas as pd

# Creating a DataFrame with duplicate values
data = {
    'Name': ['Ram', 'Sam', 'Cam', 'Ram', 'Zam', 'Cam'],
    'Age': [25, 30, 22, 25, 35, 22],
    'City': ['NY', 'Jaipur', 'Mumbai', 'NY', 'Kashmir', 'Mumbai']
}

dupli_df = pd.DataFrame(data)

# Displaying the DataFrame
dupli_df


Unnamed: 0,Name,Age,City
0,Ram,25,NY
1,Sam,30,Jaipur
2,Cam,22,Mumbai
3,Ram,25,NY
4,Zam,35,Kashmir
5,Cam,22,Mumbai


In [8]:
# Getting the shape (rows,cols) of the DataFrame

dupli_df.shape

(6, 3)

In [9]:
# Getting the dataType of all the columns 

dupli_df.dtypes

Name    object
Age      int64
City    object
dtype: object

In [10]:
# Identify duplicate rows
duplicate_rows_cn = dupli_df.duplicated()

# Display duplicate rows
duplicate_df = dupli_df[duplicate_rows_cn]
print("Duplicate Rows:")
print(duplicate_df)

Duplicate Rows:
  Name  Age    City
3  Ram   25      NY
5  Cam   22  Mumbai


* The above dummy dataset contains two duplicate values 

### Handling the Duplicate Values 

In [11]:
# Remove duplicate rows
no_dupli_df = dupli_df.drop_duplicates()

no_dupli_df

Unnamed: 0,Name,Age,City
0,Ram,25,NY
1,Sam,30,Jaipur
2,Cam,22,Mumbai
4,Zam,35,Kashmir


In [12]:
# Again checking the same modified DataFrame for duplicates 

# Identify duplicate rows
duplicate_rows_nd = no_dupli_df.duplicated()

# Display duplicate rows
no_duplicate_df = no_dupli_df[duplicate_rows_cn]
print("Duplicate Rows:")
print(no_duplicate_df)

Duplicate Rows:
Empty DataFrame
Columns: [Name, Age, City]
Index: []


* Removing duplicate values is just one way to deal with repeated entries; there are other methods available as well.

Therefore, we have successfully identified and handled the duplicate values from a dummy DataFrame