<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Impute Missing Values**


Estimated time needed: **30** minutes


In this lab, you will practice essential data wrangling techniques using the Stack Overflow survey dataset. The primary focus is on handling missing data and ensuring data quality. You will:

- **Load the Data:** Import the dataset into a DataFrame using the pandas library.

- **Clean the Data:** Identify and remove duplicate entries to maintain data integrity.

- **Handle Missing Values:** Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.

This lab equips you with the skills to effectively preprocess and clean real-world datasets, a crucial step in any data analysis project.


## Objectives


In this lab, you will perform the following:


-   Identify missing values in the dataset.

-   Apply techniques to impute missing values in the dataset.
  
-   Use suitable techniques to normalize data in the dataset.


-----


#### Install needed library


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [2]:
import pandas as pd

### Step 2: Load the Dataset Into a Dataframe


#### **Read Data**
<p>
The functions below will download the dataset into your browser:
</p>


In [3]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

### Step 3. Finding and Removing Duplicates
##### Task 1: Identify duplicate rows in the dataset.


In [4]:
## Write your code here
# Check for duplicate rows in the dataset
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows in the dataset: {duplicate_count}")

# Display the duplicate rows if any exist
if duplicate_count > 0:
    print("\nFirst few duplicate rows:")
    print(df[df.duplicated()].head())
else:
    print("No duplicate rows found in the dataset.")

# Additional information: Check which columns might be causing duplicates
if duplicate_count > 0:
    print("\nDuplicate rows by subset of columns:")
    print(df.duplicated(subset=['Age', 'RemoteWork', 'YearsCodePro']).sum())

Number of duplicate rows in the dataset: 0
No duplicate rows found in the dataset.


##### Task 2: Remove the duplicate rows from the dataframe.



In [5]:
## Write your code here
# Get the shape before removing duplicates
original_shape = df.shape
print(f"Original dataframe shape: {original_shape}")

# Remove duplicate rows and keep the first occurrence
df_no_duplicates = df.drop_duplicates(keep='first')

# Get the shape after removing duplicates
new_shape = df_no_duplicates.shape
print(f"Dataframe shape after removing duplicates: {new_shape}")
print(f"Number of duplicate rows removed: {original_shape[0] - new_shape[0]}")

# Use the cleaned dataframe for subsequent operations
df = df_no_duplicates

Original dataframe shape: (65437, 114)
Dataframe shape after removing duplicates: (65437, 114)
Number of duplicate rows removed: 0


### Step 4: Finding Missing Values
##### Task 3: Find the missing values for all columns.


In [6]:
## Write your code here
# Check for missing values in each column
print("Missing Values Count for Each Column:")
missing_values = df.isnull().sum()
print(missing_values)

# Calculate the percentage of missing values in each column
print("\nPercentage of Missing Values in Each Column:")
missing_percentage = (df.isnull().sum() / len(df) * 100).round(2)
print(missing_percentage)

# Display columns with missing values (more than 0)
print("\nColumns with Missing Values:")
columns_with_missing = missing_values[missing_values > 0]
print(columns_with_missing)

Missing Values Count for Each Column:
ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64

Percentage of Missing Values in Each Column:
ResponseId              0.00
MainBranch              0.00
Age                     0.00
Employment              0.00
RemoteWork             16.25
                       ...  
JobSatPoints_11        55.00
SurveyLength           14.14
SurveyEase             14.06
ConvertedCompYearly    64.19
JobSat                 55.49
Length: 114, dtype: float64

Columns with Missing Values:
RemoteWork             10631
CodingActivities       10971
EdLevel                 4653
LearnCode               4949
LearnCodeOnline        16200
                       ...  
JobSatPoints_11     

##### Task 4: Find out how many rows are missing in the column RemoteWork.


In [7]:
## Write your code here
# Count missing values in the RemoteWork column
missing_remote_work = df['RemoteWork'].isnull().sum()
print(f"Number of missing values in RemoteWork column: {missing_remote_work}")

# Calculate the percentage of missing values in RemoteWork column
missing_percentage = (missing_remote_work / len(df) * 100).round(2)
print(f"Percentage of missing values in RemoteWork column: {missing_percentage}%")

# Display a few rows with missing RemoteWork values
print("\nSample of rows with missing RemoteWork values:")
print(df[df['RemoteWork'].isnull()].head())

Number of missing values in RemoteWork column: 10631
Percentage of missing values in RemoteWork column: 16.25%

Sample of rows with missing RemoteWork values:
    ResponseId                                         MainBranch  \
3            4                              I am learning to code   
4            5                     I am a developer by profession   
5            6                        I code primarily as a hobby   
7            8                              I am learning to code   
13          14  I used to be a developer by profession, but no...   

                   Age                                         Employment  \
3      18-24 years old                                 Student, full-time   
4      18-24 years old                                 Student, full-time   
5   Under 18 years old                                 Student, full-time   
7      18-24 years old  Student, full-time;Not employed, but looking f...   
13     35-44 years old             Not em

### Step 5. Imputing Missing Values
##### Task 5: Find the value counts for the column RemoteWork.


In [8]:
## Write your code here
# Get value counts for the RemoteWork column
remote_work_counts = df['RemoteWork'].value_counts(dropna=False)
print("Value counts for RemoteWork column:")
print(remote_work_counts)

# Visualize the distribution as percentages
remote_work_percent = df['RemoteWork'].value_counts(normalize=True, dropna=False) * 100
print("\nPercentage distribution of RemoteWork values:")
print(remote_work_percent.round(2))

Value counts for RemoteWork column:
RemoteWork
Hybrid (some remote, some in-person)    23015
Remote                                  20831
In-person                               10960
NaN                                     10631
Name: count, dtype: int64

Percentage distribution of RemoteWork values:
RemoteWork
Hybrid (some remote, some in-person)    35.17
Remote                                  31.83
In-person                               16.75
NaN                                     16.25
Name: proportion, dtype: float64


##### Task 6: Identify the most frequent (majority) value in the RemoteWork column.



In [9]:
## Write your code here
# Find the most frequent value in the RemoteWork column (excluding NaN)
most_frequent_remote_work = df['RemoteWork'].mode()[0]
print(f"Most frequent value in RemoteWork column: '{most_frequent_remote_work}'")

# Count occurrences of the most frequent value
count_most_frequent = df['RemoteWork'].value_counts()[most_frequent_remote_work]
print(f"Count of most frequent value: {count_most_frequent}")

# Calculate percentage of the most frequent value
percentage_most_frequent = (count_most_frequent / df['RemoteWork'].count() * 100).round(2)
print(f"Percentage of most frequent value: {percentage_most_frequent}%")

Most frequent value in RemoteWork column: 'Hybrid (some remote, some in-person)'
Count of most frequent value: 23015
Percentage of most frequent value: 41.99%


##### Task 7: Impute (replace) all the empty rows in the column RemoteWork with the majority value.



In [10]:
## Write your code here
# Make a copy of the dataframe to preserve the original
df_imputed = df.copy()

# Identify the most frequent value in RemoteWork column
most_frequent_remote_work = df['RemoteWork'].mode()[0]
print(f"Imputing missing values with: '{most_frequent_remote_work}'")

# Count missing values before imputation
missing_before = df_imputed['RemoteWork'].isnull().sum()
print(f"Missing values before imputation: {missing_before}")

# Impute missing values with the most frequent value
df_imputed['RemoteWork'].fillna(most_frequent_remote_work, inplace=True)

# Count missing values after imputation
missing_after = df_imputed['RemoteWork'].isnull().sum()
print(f"Missing values after imputation: {missing_after}")

# Verify the imputation with value counts
print("\nValue counts after imputation:")
print(df_imputed['RemoteWork'].value_counts())

# Update the main dataframe
df = df_imputed

Imputing missing values with: 'Hybrid (some remote, some in-person)'
Missing values before imputation: 10631
Missing values after imputation: 0

Value counts after imputation:
RemoteWork
Hybrid (some remote, some in-person)    33646
Remote                                  20831
In-person                               10960
Name: count, dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_imputed['RemoteWork'].fillna(most_frequent_remote_work, inplace=True)


##### Task 8: Check for any compensation-related columns and describe their distribution.



In [11]:
## Write your code here
# Identify compensation-related columns
compensation_columns = [col for col in df.columns if 'salary' in col.lower() 
                         or 'compensation' in col.lower() 
                         or 'income' in col.lower() 
                         or 'pay' in col.lower()]

print(f"Compensation-related columns found: {compensation_columns}")

# If compensation columns are found, describe their distribution
if compensation_columns:
    print("\nSummary statistics for compensation-related columns:")
    print(df[compensation_columns].describe())
    
    # Check for missing values in compensation columns
    print("\nMissing values in compensation columns:")
    print(df[compensation_columns].isnull().sum())
    
    # Sample data from compensation columns
    print("\nSample data from compensation columns:")
    print(df[compensation_columns].head())
else:
    print("\nNo compensation-related columns found in the dataset.")
    
    # Show all column names to help identify potential compensation columns
    print("\nAll column names in the dataset:")
    for i, col in enumerate(df.columns):
        print(f"{i+1}. {col}")

Compensation-related columns found: []

No compensation-related columns found in the dataset.

All column names in the dataset:
1. ResponseId
2. MainBranch
3. Age
4. Employment
5. RemoteWork
6. Check
7. CodingActivities
8. EdLevel
9. LearnCode
10. LearnCodeOnline
11. TechDoc
12. YearsCode
13. YearsCodePro
14. DevType
15. OrgSize
16. PurchaseInfluence
17. BuyNewTool
18. BuildvsBuy
19. TechEndorse
20. Country
21. Currency
22. CompTotal
23. LanguageHaveWorkedWith
24. LanguageWantToWorkWith
25. LanguageAdmired
26. DatabaseHaveWorkedWith
27. DatabaseWantToWorkWith
28. DatabaseAdmired
29. PlatformHaveWorkedWith
30. PlatformWantToWorkWith
31. PlatformAdmired
32. WebframeHaveWorkedWith
33. WebframeWantToWorkWith
34. WebframeAdmired
35. EmbeddedHaveWorkedWith
36. EmbeddedWantToWorkWith
37. EmbeddedAdmired
38. MiscTechHaveWorkedWith
39. MiscTechWantToWorkWith
40. MiscTechAdmired
41. ToolsTechHaveWorkedWith
42. ToolsTechWantToWorkWith
43. ToolsTechAdmired
44. NEWCollabToolsHaveWorkedWith
45. NEWC

### Summary 


**In this lab, you focused on imputing missing values in the dataset.**

- Use the <code>pandas.read_csv()</code> function to load a dataset from a CSV file into a DataFrame.

- Download the dataset if it's not available online and specify the correct file path.



<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-29|1.2|Madhusudhan Moole|Updated lab|
|2024-09-27|1.1|Madhusudhan Moole|Updated lab|
|2024-09-26|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
