<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [1]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.3.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m162.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m97.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-2.3.2 pandas-2.3.1 tzdata-2025.2


### Step 1: Import Required Libraries


In [2]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [3]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [6]:
## Write your code here

# --- Duplicate Analysis (Full Row Duplicates) ---
    # Count the number of duplicate rows
num_full_duplicate_rows = df.duplicated().sum()
print(f"Number of full duplicate rows in the dataset: {num_full_duplicate_rows}")
print("-------------------------------------\n")


Number of full duplicate rows in the dataset: 0
-------------------------------------



In [9]:
if num_full_duplicate_rows > 0:
        print("--- First few full duplicate rows (keeping all occurrences) ---")
        # df.duplicated(keep=False) marks all occurrences of a duplicate row as True
        # This allows us to see both the original and its duplicates, providing context.
        duplicate_rows = df[df.duplicated(keep=False)]
        print(duplicate_rows.head())
        print("-----------------------------------------------------------\n")
else:
        print("No full duplicate rows found to display.")

No full duplicate rows found to display.


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [32]:
## Write your code here
# Remove full duplicate rows from the dataset
df.drop_duplicates(inplace=True)
print(f"All full duplicate rows removed. New DataFrame shape: {df.shape}")
print("-----------------------------------------------------------\n")


All full duplicate rows removed. New DataFrame shape: (65437, 114)
-----------------------------------------------------------



In [33]:
# Verify full duplicate removal
num_full_duplicate_rows_after_removal = df.duplicated().sum()
print(f"Number of full duplicate rows after initial removal: {num_full_duplicate_rows_after_removal}")
print("-------------------------------------\n")

Number of full duplicate rows after initial removal: 0
-------------------------------------



### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [43]:
## Write your code here
print("--- Identifying Missing Values Across All Columns ---")
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_info = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
    })
    # Filter to show only columns with missing values and sort by percentage
missing_info = missing_info[missing_info['Missing Count'] > 0].sort_values(
        by='Missing Percentage', ascending=False
    )

if not missing_info.empty:
        print("Columns with Missing Values:")
        print(missing_info)
else:
        print("No missing values found in any column.")
print("---------------------------------------------------\n")


--- Identifying Missing Values Across All Columns ---
Columns with Missing Values:
                            Missing Count  Missing Percentage
AINextMuch less integrated          64289           98.245641
AINextLess integrated               63082           96.401119
AINextNo change                     52939           80.900714
AINextMuch more integrated          51999           79.464217
EmbeddedAdmired                     48704           74.428840
...                                   ...                 ...
YearsCode                            5568            8.508948
NEWSOSites                           5151            7.871693
LearnCode                            4949            7.563000
EdLevel                              4653            7.110656
AISelect                             4530            6.922689

[109 rows x 2 columns]
---------------------------------------------------



In [44]:
# --- Identify Missing Values ---
print("--- Identifying Missing Values Across All Columns ---")
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_info = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_percentage
})
# Filter to show only columns with missing values and sort by percentage
missing_info = missing_info[missing_info['Missing Count'] > 0].sort_values(
    by='Missing Percentage', ascending=False
)

if not missing_info.empty:
    print("Columns with Missing Values:")
    print(missing_info)
else:
    print("No missing values found in any column.")
print("---------------------------------------------------\n")

# --- Impute Missing Values in 'EdLevel' with Mode ---
column_to_impute = 'EdLevel'
if column_to_impute in df.columns:
    if df[column_to_impute].isnull().sum() > 0:
        print(f"--- Imputing Missing Values in '{column_to_impute}' ---")
        # Calculate the mode (most frequent value)
        # .mode()[0] is used because mode() can return multiple values if there's a tie
        mode_value = df[column_to_impute].mode()[0]
        print(f"Most frequent value (mode) for '{column_to_impute}': '{mode_value}'")

        # Impute missing values with the mode
        df[column_to_impute].fillna(mode_value, inplace=True)
        print(f"Missing values in '{column_to_impute}' imputed with '{mode_value}'.")

        # Verify imputation
        missing_after_imputation = df[column_to_impute].isnull().sum()
        print(f"Missing values in '{column_to_impute}' after imputation: {missing_after_imputation}")
        print("---------------------------------------------------\n")
    else:
        print(f"No missing values found in '{column_to_impute}'. No imputation performed.")
else:
    print(f"Column '{column_to_impute}' not found in the DataFrame. Cannot perform imputation.")

--- Identifying Missing Values Across All Columns ---
Columns with Missing Values:
                            Missing Count  Missing Percentage
AINextMuch less integrated          64289           98.245641
AINextLess integrated               63082           96.401119
AINextNo change                     52939           80.900714
AINextMuch more integrated          51999           79.464217
EmbeddedAdmired                     48704           74.428840
...                                   ...                 ...
YearsCode                            5568            8.508948
NEWSOSites                           5151            7.871693
LearnCode                            4949            7.563000
EdLevel                              4653            7.110656
AISelect                             4530            6.922689

[109 rows x 2 columns]
---------------------------------------------------

--- Imputing Missing Values in 'EdLevel' ---
Most frequent value (mode) for 'EdLevel': 'Bachelor

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column_to_impute].fillna(mode_value, inplace=True)


### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [58]:
## Write your code here
# --- Compensation Analysis using 'ConvertedCompYearly' ---

compensation_column = 'ConvertedCompYearly'
if compensation_column in df.columns:
 print(f"--- Compensation Analysis using '{compensation_column}' ---")

# Convert the column to numeric, coercing errors to NaN
df[compensation_column] = pd.to_numeric(df[compensation_column], errors='coerce')

# Remove any NaN values that might result from coercion before analysis
# Or decide how to handle them (e.g., fill with median/mean, drop rows)
# For now, let's drop NaNs specifically for compensation analysis to get valid stats
compensation_data = df.dropna(subset=[compensation_column])

if not compensation_data.empty:
    # Display descriptive statistics
    print("\nDescriptive Statistics for Annual Compensation:")
    print(compensation_data[compensation_column].describe())

    # Visualize the distribution
    plt.figure(figsize=(12, 6))
    sns.histplot(compensation_data[compensation_column], bins=50, kde=True)
    plt.title(f'Distribution of Annual Compensation ({compensation_column})')
    plt.xlabel('Annual Compensation (USD)')
    plt.ylabel('Number of Respondents')
    plt.ticklabel_format(style='plain', axis='x') # Prevent scientific notation on x-axis
    plt.tight_layout()
    plt.show()

    # Optional: Box plot for outliers
    plt.figure(figsize=(12, 2))
    sns.boxplot(x=compensation_data[compensation_column])
    plt.title(f'Box Plot of Annual Compensation ({compensation_column})')
    plt.xlabel('Annual Compensation (USD)')
    plt.ticklabel_format(style='plain', axis='x')
    plt.tight_layout()
    plt.show()



--- Compensation Analysis using 'ConvertedCompYearly' ---

Descriptive Statistics for Annual Compensation:
count    2.343500e+04
mean     8.615529e+04
std      1.867570e+05
min      1.000000e+00
25%      3.271200e+04
50%      6.500000e+04
75%      1.079715e+05
max      1.625660e+07
Name: ConvertedCompYearly, dtype: float64


NameError: name 'plt' is not defined

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [62]:
## Write your code here
df=pd.read_csv("survey-data.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'survey-data.csv'

<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
