<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Impute Missing Values**


Estimated time needed: **30** minutes


In this lab, you will practice essential data wrangling techniques using the Stack Overflow survey dataset. The primary focus is on handling missing data and ensuring data quality. You will:

- **Load the Data:** Import the dataset into a DataFrame using the pandas library.

- **Clean the Data:** Identify and remove duplicate entries to maintain data integrity.

- **Handle Missing Values:** Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.

This lab equips you with the skills to effectively preprocess and clean real-world datasets, a crucial step in any data analysis project.


## Objectives


In this lab, you will perform the following:


-   Identify missing values in the dataset.

-   Apply techniques to impute missing values in the dataset.
  
-   Use suitable techniques to normalize data in the dataset.


-----


#### Install needed library


In [1]:
!pip install pandas



### Step 1: Import Required Libraries


In [2]:
import pandas as pd

### Step 2: Load the Dataset Into a Dataframe


#### **Read Data**
<p>
The functions below will download the dataset into your browser:
</p>


In [3]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv"
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

### Step 3. Finding and Removing Duplicates
##### Task 1: Identify duplicate rows in the dataset.


In [4]:
## Write your code here
# Makes a dataframe for repeated rows
df_duplicate_rows = df[df.duplicated()]

# Returns total count of duplicated rows and prints total
duplicate_count = len(df_duplicate_rows)
print('Number of duplicated rows from dataset:', duplicate_count)
df_duplicate_rows

Number of duplicated rows from dataset: 20


Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
65437,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
65438,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
65439,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
65440,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
65441,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,
65442,6,I code primarily as a hobby,Under 18 years old,"Student, full-time",,Apples,,Primary/elementary school,"School (i.e., University, College, etc);Online...",,...,,,,,,,Appropriate in length,Easy,,
65443,7,"I am not primarily a developer, but I write co...",35-44 years old,"Employed, full-time",Remote,Apples,I don’t code outside of work,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)","Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Written...,...,,,,,,,Too long,Neither easy nor difficult,,
65444,8,I am learning to code,18-24 years old,"Student, full-time;Not employed, but looking f...",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Video-based Online Cou...,...,,,,,,,Appropriate in length,Difficult,,
65445,9,I code primarily as a hobby,45-54 years old,"Employed, full-time",In-person,Apples,Hobby,"Professional degree (JD, MD, Ph.D, Ed.D, etc.)",Books / Physical media;Other online resources ...,Stack Overflow;Written-based Online Courses,...,,,,,,,Appropriate in length,Neither easy nor difficult,,
65446,10,I am a developer by profession,35-44 years old,"Independent contractor, freelancer, or self-em...",Remote,Apples,Bootstrapping a business,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too long,Easy,,


##### Task 2: Remove the duplicate rows from the dataframe.



In [5]:
## Write your code here
df_cleaned = df.drop_duplicates()
print('Before:', len(df))
print('After:', len(df_cleaned))
print('Difference:', len(df) - len(df_cleaned))

Before: 65457
After: 65437
Difference: 20


### Step 4: Finding Missing Values
##### Task 3: Find the missing values for all columns.


In [6]:
## Write your code here
# Finding null values in data
missing_values = df.isnull().sum()

# Filters list to all columns with missing values
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)

# Prints summary
print(missing_values)

AINextMuch less integrated    64309
AINextLess integrated         63102
AINextNo change               52955
AINextMuch more integrated    52018
EmbeddedAdmired               48718
                              ...  
YearsCode                      5570
NEWSOSites                     5151
LearnCode                      4950
EdLevel                        4654
AISelect                       4531
Length: 109, dtype: int64


##### Task 4: Find out how many rows are missing in the column RemoteWork.


In [7]:
## Write your code here
df_cleaned['RemoteWork'].isnull().value_counts()

RemoteWork
False    54806
True     10631
Name: count, dtype: int64

We can see that there are 10631 rows that have missing values from the column RemoteWork.

### Step 5. Imputing Missing Values
##### Task 5: Find the value counts for the column RemoteWork.


In [8]:
## Write your code here
len(df_cleaned['RemoteWork'])

65437

##### Task 6: Identify the most frequent (majority) value in the RemoteWork column.



In [9]:
## Write your code here
most_remotework = df_cleaned['RemoteWork'].value_counts().idxmax()
most_remotework

'Hybrid (some remote, some in-person)'

##### Task 7: Impute (replace) all the empty rows in the column RemoteWork with the majority value.



In [10]:
## Write your code here
# Replaces NaN values in column
df_cleaned.loc[:, 'RemoteWork'] = df_cleaned['RemoteWork'].fillna(most_remotework)

# Check values are replaced
df_cleaned['RemoteWork'].isnull().value_counts()

RemoteWork
False    65437
Name: count, dtype: int64

##### Task 8: Check for any compensation-related columns and describe their distribution.



In [11]:
## Write your code here
df_cleaned['ConvertedCompYearly'].describe(include='all')

count    2.343500e+04
mean     8.615529e+04
std      1.867570e+05
min      1.000000e+00
25%      3.271200e+04
50%      6.500000e+04
75%      1.079715e+05
max      1.625660e+07
Name: ConvertedCompYearly, dtype: float64

Looking at the .descibe() function output, we notice that the count is not 6.4. This indicates that most of the dataset for this column is empty (as noticed in previous labs, 64%). The mean of the available data is 86155, which would indicate either a well distributed range near eachother (but this is an unlikely scenario knowing that the dataset is based on a very wide range or education levels, geographical locations and more data that suggests the next to be true), or some outlier data that are highly inflated and the normal ranged values are much too lower to impact are the mean, causing this high value. We can also deduce this from the min, which is 1 while the max is 16 million. For better visual, best to plot this in box-plot to show the vast difference and outlier positioning.

### Summary 


**In this lab, you focused on imputing missing values in the dataset.**

- Use the <code>pandas.read_csv()</code> function to load a dataset from a CSV file into a DataFrame.

- Download the dataset if it's not available online and specify the correct file path.



<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-29|1.2|Madhusudhan Moole|Updated lab|
|2024-09-27|1.1|Madhusudhan Moole|Updated lab|
|2024-09-26|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
