<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Normalization Techniques**


Estimated time needed: **30** minutes


In this lab, you will focus on data normalization. This includes identifying compensation-related columns, applying normalization techniques, and visualizing the data distributions.


## Objectives


In this lab, you will perform the following:


- Identify duplicate rows and remove them.

- Check and handle missing values in key columns.

- Identify and normalize compensation-related columns.

- Visualize the effect of normalization techniques on data distributions.


-----


## Hands on Lab


#### Step 1: Install and Import Libraries


In [1]:
!pip install pandas



In [2]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-3.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.55.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.1/165.1 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)
Collecting pillow>=8 (from matplotlib)
  Downloading pillow-11.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.1 kB)
Collecting pyparsing>=2.3.1 (from matplotlib)
  Down

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

### Step 2: Load the Dataset into a DataFrame


We use the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.


The functions below will download the dataset into your browser:


In [4]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

df = pd.read_csv(file_path)

# Display the first few rows to check if data is loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

In [None]:
#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")

### Section 1: Handling Duplicates
##### Task 1: Identify and remove duplicate rows.


In [5]:
## Write your code here
# Display the column names to identify any compensation-related columns
print("Columns in the dataset:")
print(df.columns)

# Check for columns that might relate to compensation
compensation_columns = [col for col in df.columns if 'salary' in col.lower() or 'compensation' in col.lower()]

# Display the identified compensation columns
print("\nPotential compensation-related columns:")
print(compensation_columns)

# Describe the distribution for compensation columns
for col in compensation_columns:
    print(f"\nDistribution of {col}:")
    print(df[col].describe())
    print("\nValue counts (if applicable):")
    print(df[col].value_counts(dropna=False))


Columns in the dataset:
Index(['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       ...
       'JobSatPoints_6', 'JobSatPoints_7', 'JobSatPoints_8', 'JobSatPoints_9',
       'JobSatPoints_10', 'JobSatPoints_11', 'SurveyLength', 'SurveyEase',
       'ConvertedCompYearly', 'JobSat'],
      dtype='object', length=114)

Potential compensation-related columns:
[]


### Section 2: Handling Missing Values
##### Task 2: Identify missing values in `CodingActivities`.


In [6]:
## Write your code here
# Identify missing values in the 'CodingActivities' column
missing_coding_activities = df['CodingActivities'].isnull().sum()

# Display the number of missing values in the 'CodingActivities' column
print(f"Number of missing values in the 'CodingActivities' column: {missing_coding_activities}")


Number of missing values in the 'CodingActivities' column: 10971


##### Task 3: Impute missing values in CodingActivities with forward-fill.


In [8]:
## Write your code here
# Impute missing values in the 'CodingActivities' column using forward-fill
df['CodingActivities'] = df['CodingActivities'].fillna(method='ffill')

# Check if there are any missing values remaining in the 'CodingActivities' column
missing_after_imputation = df['CodingActivities'].isnull().sum()

# Display the result
print(f"Number of missing values in 'CodingActivities' after forward-fill: {missing_after_imputation}")


Number of missing values in 'CodingActivities' after forward-fill: 0


  df['CodingActivities'] = df['CodingActivities'].fillna(method='ffill')


**Note**:  Before normalizing ConvertedCompYearly, ensure that any missing values (NaN) in this column are handled appropriately. You can choose to either drop the rows containing NaN or replace the missing values with a suitable statistic (e.g., median or mean).


### Section 3: Normalizing Compensation Data
##### Task 4: Identify compensation-related columns, such as ConvertedCompYearly.
Normalization is commonly applied to compensation data to bring values within a comparable range. Here, you’ll identify ConvertedCompYearly or similar columns, which contain compensation information. This column will be used in the subsequent tasks for normalization.


In [9]:
## Write your code here
# Display the column names to understand the structure
print("Columns in the dataset:")
print(df.columns)

# Check for compensation-related columns by searching for relevant keywords
compensation_keywords = ['comp', 'salary', 'income', 'convertedcomp', 'pay', 'bonus']
compensation_columns = [col for col in df.columns if any(keyword in col.lower() for keyword in compensation_keywords)]

# Display the identified compensation-related columns
print("\nIdentified compensation-related columns:")
print(compensation_columns)

Columns in the dataset:
Index(['ResponseId', 'MainBranch', 'Age', 'Employment', 'RemoteWork', 'Check',
       'CodingActivities', 'EdLevel', 'LearnCode', 'LearnCodeOnline',
       ...
       'JobSatPoints_6', 'JobSatPoints_7', 'JobSatPoints_8', 'JobSatPoints_9',
       'JobSatPoints_10', 'JobSatPoints_11', 'SurveyLength', 'SurveyEase',
       'ConvertedCompYearly', 'JobSat'],
      dtype='object', length=114)

Identified compensation-related columns:
['CompTotal', 'AIComplex', 'ConvertedCompYearly']


##### Task 5: Normalize ConvertedCompYearly using Min-Max Scaling.
Min-Max Scaling brings all values in a column to a 0-1 range, making it useful for comparing data across different scales. Here, you will apply Min-Max normalization to the ConvertedCompYearly column, creating a new column ConvertedCompYearly_MinMax with normalized values.


In [13]:
# Install scikit-learn (only if not already installed)
!pip install scikit-learn

# Importing the necessary library
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Check if 'ConvertedCompYearly' exists and is numeric
if 'ConvertedCompYearly' in df.columns and pd.api.types.is_numeric_dtype(df['ConvertedCompYearly']):
    # Perform Min-Max Scaling on the 'ConvertedCompYearly' column
    df['ConvertedCompYearly'] = scaler.fit_transform(df[['ConvertedCompYearly']])
    
    # Display the scaled values
    print("\nMin-Max scaled 'ConvertedCompYearly' values:")
    print(df['ConvertedCompYearly'].head())
else:
    print("The 'ConvertedCompYearly' column does not exist or is not numeric.")


Min-Max scaled 'ConvertedCompYearly' values:
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: ConvertedCompYearly, dtype: float64


##### Task 6: Apply Z-score Normalization to `ConvertedCompYearly`.

Z-score normalization standardizes values by converting them to a distribution with a mean of 0 and a standard deviation of 1. This method is helpful for datasets with a Gaussian (normal) distribution. Here, you’ll calculate Z-scores for the ConvertedCompYearly column, saving the results in a new column ConvertedCompYearly_Zscore.


In [14]:
## Write your code here
# Check if 'ConvertedCompYearly' exists and is numeric
if 'ConvertedCompYearly' in df.columns and pd.api.types.is_numeric_dtype(df['ConvertedCompYearly']):
    # Calculate the mean and standard deviation of the 'ConvertedCompYearly' column
    mean = df['ConvertedCompYearly'].mean()
    std_dev = df['ConvertedCompYearly'].std()
    
    # Apply Z-score normalization
    df['ConvertedCompYearly'] = (df['ConvertedCompYearly'] - mean) / std_dev
    
    # Display the normalized values
    print("\nZ-score normalized 'ConvertedCompYearly' values:")
    print(df['ConvertedCompYearly'].head())
else:
    print("The 'ConvertedCompYearly' column does not exist or is not numeric.")



Z-score normalized 'ConvertedCompYearly' values:
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: ConvertedCompYearly, dtype: float64


### Section 4: Visualization of Normalized Data
##### Task 7: Visualize the distribution of `ConvertedCompYearly`, `ConvertedCompYearly_Normalized`, and `ConvertedCompYearly_Zscore`

Visualization helps you understand how normalization changes the data distribution. In this task, create histograms for the original ConvertedCompYearly, as well as its normalized versions (ConvertedCompYearly_MinMax and ConvertedCompYearly_Zscore). This will help you compare how each normalization technique affects the data range and distribution.


In [18]:


# Check if 'ConvertedCompYearly' exists and is numeric
if 'ConvertedCompYearly' in df.columns and pd.api.types.is_numeric_dtype(df['ConvertedCompYearly']):
    # Normalize the 'ConvertedCompYearly' column (Min-Max Scaling)
    min_value = df['ConvertedCompYearly'].min()
    max_value = df['ConvertedCompYearly'].max()
    df['ConvertedCompYearly_Normalized'] = (df['ConvertedCompYearly'] - min_value) / (max_value - min_value)
    
    # Apply Z-score Normalization (Standardization)
    mean = df['ConvertedCompYearly'].mean()
    std_dev = df['ConvertedCompYearly'].std()
    df['ConvertedCompYearly_Zscore'] = (df['ConvertedCompYearly'] - mean) / std_dev
    
    # Set up the plot size
    plt.figure(figsize=(12, 6))
    
    # Plot histograms for each of the three columns
    plt.hist(df['ConvertedCompYearly'], bins=30, alpha=0.5, label='Original', color='skyblue', density=True)
    plt.hist(df['ConvertedCompYearly_Normalized'], bins=30, alpha=0.5, label='Normalized', color='orange', density=True)
    plt.hist(df['ConvertedCompYearly_Zscore'], bins=30, alpha=0.5, label='Z-score', color='green', density=True)
    
    # Add titles and labels
    plt.title('Distribution of ConvertedCompYearly, Normalized, and Z-score Columns')
    plt.xlabel('Converted Compensation')
    plt.ylabel('Density')
    plt.legend()
    
    # Show the plot
    plt.show()
else:
    print("The 'ConvertedCompYearly' column does not exist or is not numeric.")


[31mERROR: Could not find a version that satisfies the requirement as (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for as[0m[31m
[0m

NameError: name 'sns' is not defined

<Figure size 1200x600 with 0 Axes>

### Summary


In this lab, you practiced essential normalization techniques, including:

- Identifying and handling duplicate rows.

- Checking for and imputing missing values.

- Applying Min-Max scaling and Z-score normalization to compensation data.

- Visualizing the impact of normalization on data distribution.


Copyright © IBM Corporation. All rights reserved.
