<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **45 to 60** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Step 1: Import Required Libraries


In [59]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame


#### **Read Data**


If you are using JupyterLite, use the code below to download the dataset into your environment. If you are using a local environment, you can use the direct URL with <code>pd.read_csv()</code>.


**Load the data into a pandas dataframe:**


In [60]:
file_name = "./survey_data_with_duplicate.csv"

df = pd.read_csv(file_name)

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



##### df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [61]:
dupes = df.duplicated()
df[dupes].head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
65437,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
65438,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
65439,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
65440,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
65441,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [62]:
df = df.drop_duplicates()
print("Number of duplicates after removal:", len(df[df.duplicated()]))

Number of duplicates after removal: 0


### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [63]:
print('Ratio of missing values: ')
missing_ratio = (df.isna().sum() / len(df)).sort_values(ascending=False)
missing_ratio

Ratio of missing values: 


AINextMuch less integrated    0.982456
AINextLess integrated         0.964011
AINextNo change               0.809007
AINextMuch more integrated    0.794642
EmbeddedAdmired               0.744288
                                ...   
MainBranch                    0.000000
Check                         0.000000
Employment                    0.000000
Age                           0.000000
ResponseId                    0.000000
Length: 114, dtype: float64

In [64]:
# Get the cols where more than 20% values are missing
significant_missing_cols = missing_ratio[missing_ratio > 0.2].index 
print(f"Columns with more than 20% missing values: {len(significant_missing_cols)}") 

Columns with more than 20% missing values: 87


In [65]:
for col in significant_missing_cols:
    if df[col].dtype == 'object':           # For Categorical columns
        mode_value = df[col].mode()[0]      # Get the mode
        df[col] = df[col].fillna(mode_value)
    else:                                   # For Numerical columns
        mean_value = df[col].mean()         # Get the mean
        df[col] = df[col].fillna(mean_value)

print("Missing values imputed.")


Missing values imputed.


In [66]:
print('Updated ratio of missing values: ')
missing_ratio = (df.isna().sum() / len(df)).sort_values(ascending=False)
missing_ratio

Updated ratio of missing values: 


ToolsTechHaveWorkedWith          0.197977
OpSysProfessional use            0.190473
CodingActivities                 0.167657
RemoteWork                       0.162462
OfficeStackSyncHaveWorkedWith    0.151168
                                   ...   
MiscTechWantToWorkWith           0.000000
MiscTechHaveWorkedWith           0.000000
EmbeddedAdmired                  0.000000
EmbeddedWantToWorkWith           0.000000
JobSat                           0.000000
Length: 114, dtype: float64

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [75]:
print('Missing yearly compensation values:', df['ConvertedCompYearly'].isna().sum())

Missing yearly compensation values: 0


### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


## <h3 align="center"> © IBM Corporation. All rights reserved. <h3/>
