<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Duplicates Lab**


Estimated time needed: **30** minutes


## Introduction


Data wrangling is a critical step in preparing datasets for analysis, and handling duplicates plays a key role in ensuring data accuracy. In this lab, you will focus on identifying and removing duplicate entries from your dataset. 


## Objectives


In this lab, you will perform the following:


1. Identify duplicate rows in the dataset and analyze their characteristics.
2. Visualize the distribution of duplicates based on key attributes.
3. Remove duplicate values strategically based on specific criteria.
4. Outline the process of verifying and documenting duplicate removal.


## Hands on Lab


Install the needed library


In [1]:
!pip install pandas
!pip install matplotlib

Collecting pandas
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m144.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m148.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-2.3.1 pandas-2.3.1 tzdata-2025.2
Collecting matplotlib
  Downloading matplotlib-3.10.3-cp312-cp3

Import pandas module


In [2]:
import pandas as pd


Import matplotlib


In [3]:
import matplotlib.pyplot as plt


## **Load the dataset into a dataframe**


<h2>Read Data</h2>
<p>
We utilize the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


In [4]:
# Load the dataset directly from the URL
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv"
df = pd.read_csv(file_path)

# Display the first few rows
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

Load the data into a pandas dataframe:



Note: If you are working on a local Jupyter environment, you can use the URL directly in the pandas.read_csv() function as shown below:



In [5]:
 df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


## Identify and Analyze Duplicates


### Task 1: Identify Duplicate Rows
1. Count the number of duplicate rows in the dataset.
3. Display the first few duplicate rows to understand their structure.


In [18]:
## Write your code here
num_duplicate_rows = df.duplicated().sum()
if num_duplicate_rows > 0:
        print("--- First few duplicate rows (keeping all occurrences) ---")
        # df.duplicated(keep=False) marks all occurrences of a duplicate row as True
        duplicate_rows = df[df.duplicated(keep=False)]
        print(duplicate_rows.head())
        

### Task 2: Analyze Characteristics of Duplicates
1. Identify duplicate rows based on selected columns such as MainBranch, Employment, and RemoteWork. Analyse which columns frequently contain identical values within these duplicate rows.
2. Analyse the characteristics of rows that are duplicates based on a subset of columns, such as MainBranch, Employment, and RemoteWork. Determine which columns frequently have identical values across these rows.
   


In [23]:
## Write your code here
columns_for_duplicate_check = ['MainBranch', 'Employment', 'RemoteWork']

    # Count the number of duplicate rows based on the specified subset of columns
    # df.duplicated(subset=columns_for_duplicate_check).sum() counts rows where the subset is duplicated
num_duplicate_rows_subset = df.duplicated(subset=columns_for_duplicate_check).sum()
print(f"Number of duplicate rows based on '{', '.join(columns_for_duplicate_check)}': {num_duplicate_rows_subset}")
print("-------------------------------------\n")

    # Display the first few duplicate rows based on the subset
if num_duplicate_rows_subset > 0:
    print(f"--- First few duplicate rows based on '{', '.join(columns_for_duplicate_check)}' (keeping all occurrences) ---")
        # df.duplicated(subset=columns_for_duplicate_check, keep=False) marks all occurrences as True
    duplicate_rows_subset = df[df.duplicated(subset=columns_for_duplicate_check, keep=False)]
    print(duplicate_rows_subset.head())
    print("-----------------------------------------------------------\n")

        # Analyze which columns frequently contain identical values within these duplicate rows
    print(f"--- Analysis of identical values in duplicate rows for '{', '.join(columns_for_duplicate_check)}' ---")
    for col in columns_for_duplicate_check:
        if col in duplicate_rows_subset.columns:
            print(f"\nFrequency of values in '{col}' within duplicate rows:")
            print(duplicate_rows_subset[col].value_counts())
            print("-" * 30)
        else:
            print(f"Column '{col}' not found in the duplicate rows DataFrame.")
    print("-----------------------------------------------------------\n")

else:
    print(f"No duplicate rows found based on '{', '.join(columns_for_duplicate_check)}' to display or analyze.")




Number of duplicate rows based on 'MainBranch, Employment, RemoteWork': 64876
-------------------------------------

--- First few duplicate rows based on 'MainBranch, Employment, RemoteWork' (keeping all occurrences) ---
   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                   

### Task 3: Visualize Duplicates Distribution
1. Create visualizations to show the distribution of duplicates across different categories.
2. Use bar charts or pie charts to represent the distribution of duplicates by Country and Employment.


In [31]:
## Write your code here
import pandas as pd # Ensure pandas is imported as pd
import matplotlib.pyplot as plt # Import for plotting


# Define the file_name, assuming it was successfully downloaded and saved in a previous step
# Use the filename directly, as it should be in the current working directory of the environment.


print(f"Attempting to read data from: {file_name}")

try:
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")
 

    print(f"Successfully loaded '{file_name}' into a DataFrame.")
    print("\n--- First 5 rows of the DataFrame ---")
    print(df.head())
    print("-------------------------------------\n")

    # Define the subset of columns for duplicate check
    columns_for_duplicate_check = ['MainBranch', 'Employment', 'RemoteWork']

    # Count the number of duplicate rows based on the specified subset of columns
    # df.duplicated(subset=columns_for_duplicate_check).sum() counts rows where the subset is duplicated
    num_duplicate_rows_subset = df.duplicated(subset=columns_for_duplicate_check).sum()
    print(f"Number of duplicate rows based on '{', '.join(columns_for_duplicate_check)}': {num_duplicate_rows_subset}")
    print("-------------------------------------\n")

    # Display the first few duplicate rows based on the subset
    if num_duplicate_rows_subset > 0:
        print(f"--- First few duplicate rows based on '{', '.join(columns_for_duplicate_check)}' (keeping all occurrences) ---")
        # df.duplicated(subset=columns_for_duplicate_check, keep=False) marks all occurrences as True
        duplicate_rows_subset = df[df.duplicated(subset=columns_for_duplicate_check, keep=False)]
        print(duplicate_rows_subset.head())
        print("-----------------------------------------------------------\n")

        # Analyze which columns frequently contain identical values within these duplicate rows
        print(f"--- Analysis of identical values in duplicate rows for '{', '.join(columns_for_duplicate_check)}' ---")
        for col in columns_for_duplicate_check:
            if col in duplicate_rows_subset.columns:
                print(f"\nFrequency of values in '{col}' within duplicate rows:")
                print(duplicate_rows_subset[col].value_counts())
                print("-" * 30)
            else:
                print(f"Column '{col}' not found in the duplicate rows DataFrame.")
        print("-----------------------------------------------------------\n")

        # --- Create visualizations for distribution of duplicates ---
        print("--- Visualizing Distribution of Duplicates ---")

        # Set a style for better aesthetics
        sns.set_style("whitegrid")

        # Distribution by Country
        if 'Country' in duplicate_rows_subset.columns:
            plt.figure(figsize=(12, 6))
            country_counts = duplicate_rows_subset['Country'].value_counts().nlargest(10) # Top 10 countries
            sns.barplot(x=country_counts.index, y=country_counts.values, palette='viridis')
            plt.title('Top 10 Countries with Duplicate Entries (based on MainBranch, Employment, RemoteWork)')
            plt.xlabel('Country')
            plt.ylabel('Number of Duplicate Entries')
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.show()
        else:
            print("Column 'Country' not found in the duplicate rows DataFrame for visualization.")

        # Distribution by Employment
        if 'Employment' in duplicate_rows_subset.columns:
            plt.figure(figsize=(10, 5))
            employment_counts = duplicate_rows_subset['Employment'].value_counts()
            sns.barplot(x=employment_counts.index, y=employment_counts.values, palette='magma')
            plt.title('Distribution of Duplicate Entries by Employment Type')
            plt.xlabel('Employment Type')
            plt.ylabel('Number of Duplicate Entries')
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.show()
        else:
            print("Column 'Employment' not found in the duplicate rows DataFrame for visualization.")

        print("-----------------------------------------------------------\n")

    else:
        print(f"No duplicate rows found based on '{', '.join(columns_for_duplicate_check)}' to display or analyze.")


except FileNotFoundError:
    print(f"Error: The file '{file_name}' was not found. Please ensure it exists in the current directory or was successfully downloaded/created.")
except pd.errors.EmptyDataError: # Specific error for empty files
    print(f"Error: The file '{file_name}' is empty or contains no data to parse. Please ensure the file has content.")
except Exception as e:
    print(f"An unexpected error occurred while reading the CSV: {e}")

            

Attempting to read data from: survey_data.csv
Successfully loaded 'survey_data.csv' into a DataFrame.

--- First 5 rows of the DataFrame ---
   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Ho

### Task 4: Strategic Removal of Duplicates
1. Decide which columns are critical for defining uniqueness in the dataset.
2. Remove duplicates based on a subset of columns if complete row duplication is not a good criterion.


In [34]:
## Write your code here
# Define the file_name, assuming it was successfully downloaded and saved in a previous step
# Use the filename directly, as it should be in the current working directory of the environment.


print(f"Attempting to read data from: {file_name}")

try:
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")

    print(f"Successfully loaded '{file_name}' into a DataFrame.")
    print("\n--- First 5 rows of the DataFrame ---")
    print(df.head())
    print("-------------------------------------\n")

    # --- Duplicate Analysis (Full Row Duplicates) ---
    num_full_duplicate_rows = df.duplicated().sum()
    print(f"Number of full duplicate rows in the dataset: {num_full_duplicate_rows}")
    if num_full_duplicate_rows > 0:
        print("--- First few full duplicate rows (keeping all occurrences) ---")
        print(df[df.duplicated(keep=False)].head())
    else:
        print("No full duplicate rows found.")
    print("-------------------------------------\n")


    # --- Decide on Critical Columns for Uniqueness and Remove Duplicates ---
    # For a survey dataset, a unique respondent is often identified by a combination of key demographic or identifier columns.
    # Assuming no explicit unique ID like 'ResponseId', we'll use a combination of demographic and core response columns.
    potential_critical_uniqueness_columns = ['MainBranch', 'Employment', 'RemoteWork', 'Country', 'EdLevel', 'Age']

    # Filter to only include columns that actually exist in the DataFrame
    critical_uniqueness_columns = [col for col in potential_critical_uniqueness_columns if col in df.columns]

    if not critical_uniqueness_columns:
        print("Warning: No critical uniqueness columns found in the dataset. Cannot remove duplicates based on subset.")
    else:
        print(f"\nDeciding critical columns for uniqueness: {', '.join(critical_uniqueness_columns)}")
        # Count duplicates based on these critical columns before removal
        num_duplicates_before_removal = df.duplicated(subset=critical_uniqueness_columns).sum()
        print(f"Number of duplicate entries based on these critical columns before removal: {num_duplicates_before_removal}")

        # Remove duplicates based on the identified critical columns
        # inplace=True modifies the DataFrame directly
        df.drop_duplicates(subset=critical_uniqueness_columns, inplace=True)
        print(f"Duplicates removed based on '{', '.join(critical_uniqueness_columns)}'.")
        print(f"New DataFrame shape after removing duplicates: {df.shape}")
        print("-------------------------------------\n")


    # --- Identify and Analyze Duplicates based on Specific Subset (MainBranch, Employment, RemoteWork) ---
    # This analysis is performed on the potentially de-duplicated 'df' now
    columns_for_analysis_subset = ['MainBranch', 'Employment', 'RemoteWork']
    
    # Check if these columns exist in the DataFrame before proceeding
    if all(col in df.columns for col in columns_for_analysis_subset):
        num_duplicate_rows_analysis_subset = df.duplicated(subset=columns_for_analysis_subset).sum()
        print(f"Number of duplicate rows based on '{', '.join(columns_for_analysis_subset)}' (after main de-duplication): {num_duplicate_rows_analysis_subset}")
        print("-------------------------------------\n")

        if num_duplicate_rows_analysis_subset > 0:
            print(f"--- First few duplicate rows based on '{', '.join(columns_for_analysis_subset)}' (keeping all occurrences) ---")
            duplicate_rows_analysis_subset = df[df.duplicated(subset=columns_for_analysis_subset, keep=False)]
            print(duplicate_rows_analysis_subset.head())
            print("-----------------------------------------------------------\n")

            print(f"--- Analysis of identical values in duplicate rows for '{', '.join(columns_for_analysis_subset)}' ---")
            for col in columns_for_analysis_subset:
                print(f"\nFrequency of values in '{col}' within duplicate rows:")
                print(duplicate_rows_analysis_subset[col].value_counts())
                print("-" * 30)
            print("-----------------------------------------------------------\n")
        else:
            print(f"No duplicate rows found based on '{', '.join(columns_for_analysis_subset)}' after main de-duplication to display or analyze.")
    else:
        print(f"One or more columns ({', '.join(columns_for_analysis_subset)}) not found in the DataFrame for analysis.")


    # --- Create visualizations to show the distribution of duplicates across different categories ---
    # These visualizations will now reflect the de-duplicated DataFrame 'df'
    print("--- Visualizing Distribution of Categories (after de-duplication) ---")

    # Set a style for better aesthetics
    sns.set_style("whitegrid")

    # Distribution by Country
    if 'Country' in df.columns: # Check if 'Country' column exists in the (potentially de-duplicated) df
        plt.figure(figsize=(12, 6))
        country_counts = df['Country'].value_counts().nlargest(10) # Top 10 countries
        sns.barplot(x=country_counts.index, y=country_counts.values, palette='viridis')
        plt.title('Top 10 Countries Distribution (after de-duplication)')
        plt.xlabel('Country')
        plt.ylabel('Number of Entries')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.show()
    else:
        print("Column 'Country' not found in the DataFrame for visualization.")

    # Distribution by Employment
    if 'Employment' in df.columns: # Check if 'Employment' column exists in the (potentially de-duplicated) df
        plt.figure(figsize=(10, 5))
        employment_counts = df['Employment'].value_counts()
        sns.barplot(x=employment_counts.index, y=employment_counts.values, palette='magma')
        plt.title('Employment Type Distribution (after de-duplication)')
        plt.xlabel('Employment Type')
        plt.ylabel('Number of Entries')
        plt.xticks(rotation=45, ha='right')
        plt.tight_layout()
        plt.show()
    else:
        print("Column 'Employment' not found in the DataFrame for visualization.")

    print("-----------------------------------------------------------\n")


except FileNotFoundError:
    print(f"Error: The file '{file_name}' was not found. Please ensure it exists in the current directory or was successfully downloaded/created.")
except pd.errors.EmptyDataError: # Specific error for empty files
    print(f"Error: The file '{file_name}' is empty or contains no data to parse. Please ensure the file has content.")
except Exception as e:
    print(f"An unexpected error occurred while reading the CSV: {e}")


Attempting to read data from: survey_data.csv
Successfully loaded 'survey_data.csv' into a DataFrame.

--- First 5 rows of the DataFrame ---
   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Ho

## Verify and Document Duplicate Removal Process


### Task 5: Documentation
1. Document the process of identifying and removing duplicates.


2. Explain the reasoning behind selecting specific columns for identifying and removing duplicates.


### Summary and Next Steps
**In this lab, you focused on identifying and analyzing duplicate rows within the dataset.**

- You employed various techniques to explore the nature of duplicates and applied strategic methods for their removal.
- For additional analysis, consider investigating the impact of duplicates on specific analyses and how their removal affects the results.
- This version of the lab is more focused on duplicate analysis and handling, providing a structured approach to deal with duplicates in a dataset effectively.


<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11- 05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-28|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
