<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Wrangling Lab**


Estimated time needed: **45 to 60** minutes


In this lab, you will perform data wrangling which involves transforming and preparing raw data into a structured and usable format for analysis. It involves various techniques to clean, normalize, and integrate data from different sources.


## Objectives


After completing this lab, you will be able to:


-   Apply techniques to identify and remove duplicate values from the given dataset.
  
-   Identify missing values in the dataset.

-   Use appropriate imputation strategies to handle missing values.

-   Use normalization techniques to prepare data for analysis.


## Load the dataset


#### Step 1: Import the necessary module.


In [None]:
!pip install pandas

#### Import the library


In [None]:
import pandas as pd

#### Step 2: Load the dataset into a dataframe.


You can load the Stack Overflow survey data using the <code>df = pd.read_csv(file_name)</code>

Replace the URL with the path to the dataset if it is available. For now, we use a placeholder link for illustration:


In [None]:
# Load the Stack Overflow survey data from the provided URL
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(dataset_url)
df

#### Step 3: Finding and Removing Duplicates


**3.1 Find Duplicate Values**

In this section, you will identify duplicate rows in the dataset.


In [None]:
## Write your code here
# Find duplicate rows in the DataFrame
duplicates = df[df.duplicated()]

# Display duplicate rows
print(duplicates)

**3.2 Remove Duplicate Values**

Remove all duplicate rows from the dataset.


In [None]:
## Write your code here
# Remove duplicate rows from the DataFrame
df = df.drop_duplicates()

# Display the updated DataFrame
print(df)

#### Step 4: Finding and Imputing Missing Values


**4.1 Finding Missing Values**

Check for missing values across all columns.


In [None]:
## Write your code here
# Count missing values in each column
missing_values = df.isnull().sum()

# Display the count of missing values per column
print(missing_values)

**4.2 Handling Missing Values for a Specific Column**
                           
Check for missing values in the RemoteWork column


In [None]:
df['RemoteWork']

In [None]:
## Write your code here
missing_count = df['RemoteWork'].isnull().sum()
print("Missing values in column:", missing_count)

**4.3 Impute Missing Values**

Impute missing values in `RemoteWork` with the most frequent value:


In [None]:
## Write your code here
test = df['RemoteWork'].value_counts()
test.iloc[0]

#### Step 5: Normalizing Data


For this task, we will work with the `CompFreq` and `CompTotal` columns to create a standardized `NormalizedAnnualCompensation` column.




**5.1 Analyze the Annual Compensation Data**


Examine the `ConvertedCompYearly` column to understand the range of annual compensation reported by respondents.


In [None]:
## Write your code here
data_column = df['ConvertedCompYearly']
data_column.value_counts()

**5.2 Normalize the Annual Compensation for Comparative Analysis**


Using Min-Max Scaling or Z-score normalization, create a `NormalizedAnnualCompensation` column that standardizes the `ConvertedCompYearly` data. This makes it easier to compare compensation levels within a common range or around a mean.



- Option 1: Remove Rows with `NaN` in `ConvertedCompYearly`




In [None]:
## Write your code here

- Option 2: Fill `NaN` with a Placeholder Value


In [None]:
## Write your code here

### Conclusion


In this lab, you successfully:

Identified and removed duplicate rows.
Found and imputed missing values.
Normalized compensation data to create an easily comparable metric.
You can extend these tasks to other columns in the dataset to deepen your data wrangling skills.


<!--## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description           |
| ----------------- | ------- | ---------- | ---------------------------- |
| 2024-11-05        | 3.0     | Madhusudan Moole | Updated Lab                  |
| 2024-05-25        | 2.0     | Madhusudan Moole | ID reviewed                  |
| 2023-09-18        | 1.0     | Raghul Ramesh    | Created lab   |        |

--!>

## <h3 align="center"> © IBM Corporation. All rights reserved. <h3/>


## <h3 align="center"> © IBM Corporation. All rights reserved. <h3/>
