<a href="https://colab.research.google.com/github/ElhassanGitUub/PyProj/blob/main/Impute_Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Impute Missing Values**

Estimated time needed:

In this lab, you will practice essential data wrangling techniques using the Stack Overflow survey dataset. The primary focus is on handling missing data and ensuring data quality. You will:

Load the Data: Import the dataset into a DataFrame using the pandas library.

Clean the Data: Identify and remove duplicate entries to maintain data integrity.

Handle Missing Values: Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.

This lab equips you with the skills to effectively preprocess and clean real-world datasets, a crucial step in any data analysis project.

Objectives
In this lab, you will perform the following:

Identify missing values in the dataset.

Apply techniques to impute missing values in the dataset.

Use suitable techniques to normalize data in the dataset.



### Step 1: Import Required Libraries

In [1]:
import pandas as pd

### Step 2: Load the Dataset Into a Dataframe


In [5]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())

   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Step 3. Finding and Removing Duplicates**
**bold text**

Task 1: Identify duplicate rows in the dataset.
Task 2: Remove the duplicate rows from the dataframe

In [3]:
import pandas as pd

# Sample dataframe (replace with your actual dataframe)
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 1],
    'B': [5, 6, 7, 8, 5],
    'C': [9, 10, 11, 12, 9]
})

# Identify duplicates
duplicates = df.duplicated()
print(duplicates)

0    False
1    False
2    False
3    False
4     True
dtype: bool


In [6]:
duplicates = df.duplicated()
print("Duplicate rows identified:")
print(duplicates)

# Task 2: Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

# Optional: If you'd like to keep the cleaned dataset, you can save it as a new CSV
df_no_duplicates.to_csv("survey_data_no_duplicates.csv", index=False)


Duplicate rows identified:
0        False
1        False
2        False
3        False
4        False
         ...  
65432    False
65433    False
65434    False
65435    False
65436    False
Length: 65437, dtype: bool


**Step 4: Finding Missing Values**

Task 3: Find the missing values for all columns.

Task 4: Find out how many rows are missing in the column RemoteWork.

In [7]:
# Task 3: Find the missing values for all columns

# Find missing values for all columns
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

Missing values per column:
ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64


In [8]:
#Task 4: Find out how many rows are missing in the column RemoteWork

# Find how many rows are missing in the 'RemoteWork' column
missing_remote_work = df['RemoteWork'].isnull().sum()
print(f"Rows missing in 'RemoteWork' column: {missing_remote_work}")

Rows missing in 'RemoteWork' column: 10631


**Step 5. Imputing Missing Values**

Task 5: Find the value counts for the column RemoteWork.

Task 6: Identify the most frequent (majority) value in the RemoteWork column.

Task 7: Impute (replace) all the empty rows in the column RemoteWork with the majority value.

Task 8: Check for any compensation-related columns and describe their distribution.


In [9]:
# Task 5: Find value counts for the 'RemoteWork' column
remote_work_counts = df['RemoteWork'].value_counts()
print("Value counts for 'RemoteWork' column:")
print(remote_work_counts)

Value counts for 'RemoteWork' column:
RemoteWork
Hybrid (some remote, some in-person)    23015
Remote                                  20831
In-person                               10960
Name: count, dtype: int64


In [10]:
# Task 6: Identify the most frequent value in the 'RemoteWork' column
most_frequent_value = df['RemoteWork'].mode()[0]
print(f"The most frequent value in 'RemoteWork' column is: {most_frequent_value}")

The most frequent value in 'RemoteWork' column is: Hybrid (some remote, some in-person)


In [11]:
# Task 7: Impute missing values in 'RemoteWork' column with the majority value
df['RemoteWork'] = df['RemoteWork'].fillna(most_frequent_value)
print("Missing values in 'RemoteWork' column have been replaced.")

Missing values in 'RemoteWork' column have been replaced.


In [12]:
# Task 8: Check for compensation-related columns
compensation_columns = [col for col in df.columns if 'salary' in col.lower() or 'compensation' in col.lower() or 'pay' in col.lower()]
print("Compensation-related columns:", compensation_columns)

# Describe their distribution
if compensation_columns:
    compensation_description = df[compensation_columns].describe()
    print("Distribution of compensation-related columns:")

Compensation-related columns: []


**Summary**

In this lab, you focused on imputing missing values in the dataset.

Use the pandas.read_csv() function to load a dataset from a CSV file into a DataFrame.

Download the dataset if it's not available online and specify the correct file path.

