<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Normalization Techniques**


Estimated time needed: **30** minutes


In this lab, you will focus on data normalization. This includes identifying compensation-related columns, applying normalization techniques, and visualizing the data distributions.


## Objectives


In this lab, you will perform the following:


- Identify duplicate rows and remove them.

- Check and handle missing values in key columns.

- Identify and normalize compensation-related columns.

- Visualize the effect of normalization techniques on data distributions.


-----


## Hands on Lab


#### Step 1: Install and Import Libraries


In [1]:
!pip install pandas



In [2]:
!pip install matplotlib



In [3]:
import pandas as pd
import matplotlib.pyplot as plt

### Step 2: Load the Dataset into a DataFrame


We use the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.


The functions below will download the dataset into your browser:


In [4]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

df = pd.read_csv(file_path)

# Display the first few rows to check if data is loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

In [5]:
#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")

### Section 1: Handling Duplicates
##### Task 1: Identify and remove duplicate rows.


In [7]:
## Write your code here
# Remove duplicate rows (keeping the first occurrence by default)
df_no_duplicates = df.drop_duplicates()

print("Original DataFrame:\n", df)
print("\nDataFrame with duplicates removed:\n", df_no_duplicates)

# Remove duplicates based on specific columns
df_no_duplicates_subset = df.drop_duplicates(subset=['col1', 'col2']) # Check duplicates in col1 and col2 only

print("\nDataFrame with duplicates removed based on subset:\n", df_no_duplicates_subset)


# Keep the last occurrence of duplicates
df_no_duplicates_last = df.drop_duplicates(keep='last')

print("\nDataFrame with duplicates removed (keeping last):\n", df_no_duplicates_last)

# Remove duplicates and modify the original DataFrame (in-place)
df.drop_duplicates(inplace=True)

print("\nOriginal DataFrame modified in-place:\n", df)

Original DataFrame:
        ResponseId                      MainBranch                 Age  \
0               1  I am a developer by profession  Under 18 years old   
1               2  I am a developer by profession     35-44 years old   
2               3  I am a developer by profession     45-54 years old   
3               4           I am learning to code     18-24 years old   
4               5  I am a developer by profession     18-24 years old   
...           ...                             ...                 ...   
65432       65433  I am a developer by profession     18-24 years old   
65433       65434  I am a developer by profession     25-34 years old   
65434       65435  I am a developer by profession     25-34 years old   
65435       65436  I am a developer by profession     18-24 years old   
65436       65437     I code primarily as a hobby     18-24 years old   

                Employment                            RemoteWork   Check  \
0      Employed, full-time

KeyError: Index(['col2', 'col1'], dtype='object')

### Section 2: Handling Missing Values
##### Task 2: Identify missing values in `CodingActivities`.


In [8]:
## Write your code here
import pandas as pd
import numpy as np  # Import numpy for NaN handling (if needed)

# Sample data (replace with your actual CodingActivities DataFrame)
data = {'RespondentID': [1, 2, 3, 4, 5],
        'CodingActivities': ['Coding', 'NaN', 'Coding', 'Not Coding', np.nan]}  # Include NaN
df = pd.DataFrame(data)

# 1. Identify Missing Values (Various Methods):

# a) Using isna() or isnull() (Most Common):
missing_values = df['CodingActivities'].isna()  # or df['CodingActivities'].isnull()
print("Missing Values (Boolean):\n", missing_values)

# b) Counting Missing Values:
num_missing = df['CodingActivities'].isna().sum()
print("\nNumber of Missing Values:", num_missing)

# c) Showing Rows with Missing Values:
rows_with_missing = df[df['CodingActivities'].isna()]
print("\nRows with Missing Values:\n", rows_with_missing)

# d) Using notna() to find valid values:
valid_values = df['CodingActivities'].notna()
print("\nValid Values (Boolean):\n", valid_values)

# 2. Handling Missing Values (Options):

# a) Filling with a specific value:
df['CodingActivities'].fillna('Unknown', inplace=True)  # Fills missing values with 'Unknown'
print("\nDataFrame after filling missing values:\n", df)

# b) Filling with the mean/median/mode (if appropriate):
# (Not applicable in this categorical example; more relevant for numerical data)
# For numerical columns:  df['NumericalColumn'].fillna(df['NumericalColumn'].mean(), inplace=True)

# c) Dropping rows with missing values:
df_no_missing = df.dropna(subset=['CodingActivities'])  # Removes rows where 'CodingActivities' is NaN
print("\nDataFrame after dropping rows with missing values:\n", df_no_missing)

# d) Replacing with numpy.nan (if you want to keep them as NaNs):
df['CodingActivities'].replace({'NaN': np.nan}, inplace=True) # Replace string 'NaN' with actual numpy NaN
print("\nDataFrame after replacing string 'NaN' with numpy NaN:\n", df)

Missing Values (Boolean):
 0    False
1    False
2    False
3    False
4     True
Name: CodingActivities, dtype: bool

Number of Missing Values: 1

Rows with Missing Values:
    RespondentID CodingActivities
4             5              NaN

Valid Values (Boolean):
 0     True
1     True
2     True
3     True
4    False
Name: CodingActivities, dtype: bool

DataFrame after filling missing values:
    RespondentID CodingActivities
0             1           Coding
1             2              NaN
2             3           Coding
3             4       Not Coding
4             5          Unknown

DataFrame after dropping rows with missing values:
    RespondentID CodingActivities
0             1           Coding
1             2              NaN
2             3           Coding
3             4       Not Coding
4             5          Unknown

DataFrame after replacing string 'NaN' with numpy NaN:
    RespondentID CodingActivities
0             1           Coding
1             2             

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CodingActivities'].fillna('Unknown', inplace=True)  # Fills missing values with 'Unknown'
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CodingActivities'].replace({'NaN': np.nan}, inplace=True) # Replace string 'NaN' with actual numpy NaN


##### Task 3: Impute missing values in CodingActivities with forward-fill.


In [10]:
## Write your code here
import pandas as pd

# Sample data (replace with your actual CodingActivities DataFrame)
data = {'RespondentID': [1, 2, 3, 4, 5, 6, 7],
        'CodingActivities': ['Coding', 'NaN', 'Coding', 'Not Coding', np.nan, np.nan, 'Coding']}  # Include NaN
df = pd.DataFrame(data)

# 1. Replace string "NaN" with actual np.nan if needed
df['CodingActivities'].replace({'NaN': np.nan}, inplace=True)

# 2. Forward-fill missing values in 'CodingActivities'
df['CodingActivities'].ffill(inplace=True)  # Forward fill

print("DataFrame after forward-fill:\n", df)



# Example with ConvertedCompYearly (handling NaNs before normalizing)
data2 = {'ConvertedCompYearly': [100000, np.nan, 120000, np.nan, 150000, 200000, np.nan]}
df2 = pd.DataFrame(data2)

# Option 1: Drop rows with NaN in 'ConvertedCompYearly'
df2_dropped = df2.dropna(subset=['ConvertedCompYearly'])
# Normalize df2_dropped['ConvertedCompYearly'] (code for normalization would go here)
print("\nDataFrame after dropping NaNs:\n", df2_dropped)


# Option 2: Impute with median (or mean)
median_comp = df2['ConvertedCompYearly'].median()  # Calculate the median
df2['ConvertedCompYearly'].fillna(median_comp, inplace=True)  # Fill with median
# Now normalize df2['ConvertedCompYearly'] (code for normalization would go here)
print("\nDataFrame after imputing with median:\n", df2)

# Option 3: Forward fill ConvertedCompYearly
df2['ConvertedCompYearly'].ffill(inplace=True)
print("\nDataFrame after forward-fill:\n", df2)

DataFrame after forward-fill:
    RespondentID CodingActivities
0             1           Coding
1             2           Coding
2             3           Coding
3             4       Not Coding
4             5       Not Coding
5             6       Not Coding
6             7           Coding

DataFrame after dropping NaNs:
    ConvertedCompYearly
0             100000.0
2             120000.0
4             150000.0
5             200000.0

DataFrame after imputing with median:
    ConvertedCompYearly
0             100000.0
1             135000.0
2             120000.0
3             135000.0
4             150000.0
5             200000.0
6             135000.0

DataFrame after forward-fill:
    ConvertedCompYearly
0             100000.0
1             135000.0
2             120000.0
3             135000.0
4             150000.0
5             200000.0
6             135000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['CodingActivities'].replace({'NaN': np.nan}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['ConvertedCompYearly'].fillna(median_comp, inplace=True)  # Fill with median
The behavior will change in pandas 3.0. This inplace method will never work because the i

**Note**:  Before normalizing ConvertedCompYearly, ensure that any missing values (NaN) in this column are handled appropriately. You can choose to either drop the rows containing NaN or replace the missing values with a suitable statistic (e.g., median or mean).


### Section 3: Normalizing Compensation Data
##### Task 4: Identify compensation-related columns, such as ConvertedCompYearly.
Normalization is commonly applied to compensation data to bring values within a comparable range. Here, you’ll identify ConvertedCompYearly or similar columns, which contain compensation information. This column will be used in the subsequent tasks for normalization.


In [12]:
import pandas as pd

# Load the dataset from the local file
df = pd.DataFrame(data)

# Check for compensation-related columns
compensation_columns = [col for col in df.columns if 'Comp' in col or 'Salary' in col]
print("Compensation-related columns:")
print(compensation_columns)


Compensation-related columns:
[]


In [11]:
## Write your code here
import pandas as pd

# ... (Load your data into a DataFrame called 'df') ...

# 1. Check for 'ConvertedCompYearly' directly:
if 'ConvertedCompYearly' in df.columns:
    print("Found ConvertedCompYearly column.")
    compensation_column = 'ConvertedCompYearly'  # Store the column name
else:
    print("ConvertedCompYearly column not found. Checking for similar columns...")

    # 2. Look for similar column names (case-insensitive):
    similar_columns = [col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower() or 'compensation' in col.lower()]

    if similar_columns:
        print("Found similar columns:", similar_columns)

        # 3. Choose the most appropriate column (or handle multiple columns):
        # This will depend on your specific dataset.  Here are some strategies:

        # a) If only one similar column is found, use it:
        if len(similar_columns) == 1:
            compensation_column = similar_columns[0]
            print(f"Using {compensation_column} as the compensation column.")

        # b) If multiple similar columns are found, you might need to:
        elif len(similar_columns) > 1:
            print("Multiple compensation-related columns found.  Choose one or combine them.")

            # i) Print info about the columns to help you decide
            for col in similar_columns:
                print(f"\nColumn: {col}")
                print(df[col].describe())  # Get descriptive statistics
                print(df[col].head())      # Show a few values

            # ii) Select one (or create a combined compensation column)
            # Example: Choose the column that seems to represent yearly compensation
            # (Replace with your actual logic)
            compensation_column = None
            for col in similar_columns:
                if 'yearly' in col.lower() or 'annual' in col.lower():  # Look for yearly/annual in name
                    compensation_column = col
                    break
            if compensation_column:
                print(f"Selected {compensation_column} as the primary compensation column.")
            else:
                print("No suitable yearly compensation column found. Choose one manually from the listed columns.")

            # Example: If you have multiple columns (e.g., base salary, bonus),
            # you might create a new 'TotalCompensation' column:
            # df['TotalCompensation'] = df['BaseSalary'] + df['Bonus']

        else:
            print("No compensation-related columns found.")
            compensation_column = None  # Or raise an exception if it's essential

    else:
        print("No compensation-related columns found.")
        compensation_column = None  # Or raise an exception if it's essential


if compensation_column:  # If a compensation column was identified
    print(f"The column to be normalized is: {compensation_column}")
    # Now proceed with normalization (min-max scaling, standardization, etc.)
    # ... your normalization code here ...
else:
    print("Cannot proceed with normalization without a compensation column.")

ConvertedCompYearly column not found. Checking for similar columns...
No compensation-related columns found.
Cannot proceed with normalization without a compensation column.


##### Task 5: Normalize ConvertedCompYearly using Min-Max Scaling.
Min-Max Scaling brings all values in a column to a 0-1 range, making it useful for comparing data across different scales. Here, you will apply Min-Max normalization to the ConvertedCompYearly column, creating a new column ConvertedCompYearly_MinMax with normalized values.


In [18]:
import pandas as pd

# Load the dataset from the local file
df = pd.DataFrame(data)

# Check for compensation-related columns
compensation_columns = [col for col in df.columns if 'Comp' in col or 'Salary' in col]
print("Compensation-related columns:")
print(compensation_columns)


Compensation-related columns:
[]


##### Task 6: Apply Z-score Normalization to `ConvertedCompYearly`.

Z-score normalization standardizes values by converting them to a distribution with a mean of 0 and a standard deviation of 1. This method is helpful for datasets with a Gaussian (normal) distribution. Here, you’ll calculate Z-scores for the ConvertedCompYearly column, saving the results in a new column ConvertedCompYearly_Zscore.


In [20]:
import pandas as pd

# Load the dataset from the local file
df = pd.DataFrame(data)

# Check for compensation-related columns
compensation_columns = [col for col in df.columns if 'Comp' in col or 'Salary' in col]
print("Compensation-related columns:")
print(compensation_columns)

Compensation-related columns:
[]


### Section 4: Visualization of Normalized Data
##### Task 7: Visualize the distribution of `ConvertedCompYearly`, `ConvertedCompYearly_Normalized`, and `ConvertedCompYearly_Zscore`

Visualization helps you understand how normalization changes the data distribution. In this task, create histograms for the original ConvertedCompYearly, as well as its normalized versions (ConvertedCompYearly_MinMax and ConvertedCompYearly_Zscore). This will help you compare how each normalization technique affects the data range and distribution.


In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import os

file_path = os.path.join('path_to_your_local_file', 'survey-data.csv')  # Replace with your path

try:
    df = pd.read_csv(file_path)  # Load the data
    file_loaded = True  # Set a flag to indicate successful loading
except FileNotFoundError:
    print(f"Error: File not found at path: {file_path}")
    file_loaded = False  # Set the flag to false
except Exception as e:
    print(f"An error occurred while reading the CSV: {e}")
    file_loaded = False  # Set the flag to false

if file_loaded:  # Proceed ONLY if the file was successfully loaded
    # 2. Identify the compensation column
    compensation_columns = [
        col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower()
    ]

    if 'ConvertedCompYearly' in df.columns:
        compensation_column = 'ConvertedCompYearly'
    elif compensation_columns:
        compensation_column = compensation_columns[0]  # Or your selection logic
    else:
        compensation_column = None

    if compensation_column:  # Proceed with normalization and visualization only if compensation column exists
        try:
            # 3. Handle missing values
            df[compensation_column].fillna(df[compensation_column].median(), inplace=True)

            # 4. Min-Max Scaling
            scaler_minmax = MinMaxScaler()
            df[f'{compensation_column}_MinMax'] = scaler_minmax.fit_transform(df[[compensation_column]])

            # 5. Z-score Standardization
            scaler_zscore = StandardScaler()
            df[f'{compensation_column}_Zscore'] = scaler_zscore.fit_transform(df[[compensation_column]])

            # 6. Visualization
            plt.figure(figsize=(15, 5))

            plt.subplot(1, 3, 1)
            df[compensation_column].hist(bins=50)
            plt.title(f'Original {compensation_column} Distribution')

            plt.subplot(1, 3, 2)
            df[f'{compensation_column}_MinMax'].hist(bins=50)
            plt.title(f'{compensation_column} - Min-Max Scaled')

            plt.subplot(1, 3, 3)
            df[f'{compensation_column}_Zscore'].hist(bins=50)
            plt.title(f'{compensation_column} - Z-score Standardized')

            plt.tight_layout()
            plt.show()

        except Exception as e:
            print(f"An error occurred during processing: {e}")
    else:
        print("No suitable compensation column found. Cannot proceed.")

else:
    print("Data file could not be loaded. Exiting.")


Error: File not found at path: path_to_your_local_file/survey-data.csv
Data file could not be loaded. Exiting.


### Summary


In this lab, you practiced essential normalization techniques, including:

- Identifying and handling duplicate rows.

- Checking for and imputing missing values.

- Applying Min-Max scaling and Z-score normalization to compensation data.

- Visualizing the impact of normalization on data distribution.


Copyright © IBM Corporation. All rights reserved.
