# Term Project: Is AI taking our jobs or transforming them?

Lana Geissinger
Bellevue University
DSC540_T303 Data Preparation (2257-1)
Professor Catherine Williams
Milestone 3
July 13, 2025


## <u>Cleaning/Formatting Website Data</u>

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os
import numpy as np


In [2]:
# Load environment variables
load_dotenv('../env_var.env')
declining_path = os.getenv('declining_path')
growing_path = os.getenv('growing_path')

# Preview data
if declining_path and growing_path:
    try:
        # Verify files exist
        if os.path.exists(declining_path) and os.path.exists(growing_path):
            # Read HTML tables into dataframes
            df_declining = pd.read_html(declining_path)[0]
            df_growing = pd.read_html(growing_path)[0]

            print("DataFrame for Declining Occupations:")
            print(df_declining.head(5))
            print(df_declining.info())
            print("\nDataFrame for Growing Occupations:")
            print(df_growing.head(5))
            print(df_growing.info())
        else:
            print("One or both HTML files do not exist at the specified paths")

    except Exception as e:
        print(f"An unexpected error occurred: {e}")
else:
    print("Error: One or both environment variables for file paths are not set or invalid.")


DataFrame for Declining Occupations:
               2023 National Employment Matrix title  \
0                             Total, all occupations   
1                        Word processors and typists   
2                               Roof bolters, mining   
3                                Telephone operators   
4  Switchboard operators, including answering ser...   

  2023 National Employment Matrix code Employment, 2023 Employment, 2033  \
0                              00-0000         167849.8         174589.0   
1                              43-9022             39.9             24.8   
2                              47-5043              2.0              1.4   
3                              43-2021              4.7              3.5   
4                              43-2011             44.9             33.6   

  Employment change, numeric, 2023-33 Employment change, percent, 2023-33  \
0                              6739.2                                 4.0   
1              

In [3]:
# Step 1: Clean column names
## Create function to remove spaces and convert to lower case for easier access
## Call function for both dataframes
## Verify the result

def clean_column_names(df):
    return df.rename(columns=lambda x: x.replace(' ', '_').lower())


df_declining = clean_column_names(df_declining)
df_growing = clean_column_names(df_growing)

print("DataFrame for Declining Occupations:")
print(df_declining.head(5))

print("DataFrame for Growing Occupations:")
print(df_growing.head(5))


DataFrame for Declining Occupations:
               2023_national_employment_matrix_title  \
0                             Total, all occupations   
1                        Word processors and typists   
2                               Roof bolters, mining   
3                                Telephone operators   
4  Switchboard operators, including answering ser...   

  2023_national_employment_matrix_code employment,_2023 employment,_2033  \
0                              00-0000         167849.8         174589.0   
1                              43-9022             39.9             24.8   
2                              47-5043              2.0              1.4   
3                              43-2021              4.7              3.5   
4                              43-2011             44.9             33.6   

  employment_change,_numeric,_2023-33 employment_change,_percent,_2023-33  \
0                              6739.2                                 4.0   
1              

In [4]:
# Step 2: Remove footnotes and convert numeric data types, removing any commas
## Remove any rows containing 'Footnote' in the title column
## Convert numeric columns to correct data types after removing commas
## Process both dataframes and verify the converted data types

def remove_footnotes_and_convert_types(df):
    # Make a copy to avoid the SettingWithCopyWarning:
    df = df.copy()

    mask = ~df['2023_national_employment_matrix_title'].str.contains('Footnote', na=False)
    df = df[mask].copy()

 # Store the matrix code column separately
    matrix_code = df['2023_national_employment_matrix_code'].copy()

    for col in df.columns:
        # Skip the matrix code column and the first column (title)
        if col == '2023_national_employment_matrix_code' or df.columns.get_loc(col) == 0:
            continue

        # Remove footnotes and clean the data
        temp = df[col].astype(str).apply(lambda x: x.split('[')[0].strip())

        # Convert to numeric columns
        df[col] = pd.to_numeric(temp, errors='coerce').astype('Float64')

    # Restore the matrix code column
    df['2023_national_employment_matrix_code'] = matrix_code

    return df




df_declining = remove_footnotes_and_convert_types(df_declining)
df_growing = remove_footnotes_and_convert_types(df_growing)

print("Verify data types for Declining Occupations:")
print(df_declining.dtypes)

print("Verify data types for Growing Occupations:")
print(df_growing.dtypes)


Verify data types for Declining Occupations:
2023_national_employment_matrix_title     object
2023_national_employment_matrix_code      object
employment,_2023                         Float64
employment,_2033                         Float64
employment_change,_numeric,_2023-33      Float64
employment_change,_percent,_2023-33      Float64
median_annual_wage,_dollars,_2024[1]     Float64
dtype: object
Verify data types for Growing Occupations:
2023_national_employment_matrix_title     object
2023_national_employment_matrix_code      object
employment,_2023                         Float64
employment,_2033                         Float64
employment_change,_numeric,_2023-33      Float64
employment_change,_percent,_2023-33      Float64
median_annual_wage,_dollars,_2024[1]     Float64
dtype: object


In [5]:
# Step 3: Merge declining and growing datasets and create combined dataset

def create_combined_dataset(df_declining, df_growing):
    df_declining['growth_status'] = 'Declining'
    df_growing['growth_status'] = 'Growing'
    return pd.concat([df_declining, df_growing], ignore_index=True)

df_combined = create_combined_dataset(df_declining, df_growing)

print("Combined dataset shape:", df_combined.shape)
print("\nGrowth status distribution:")
print(df_combined['growth_status'].value_counts())


Combined dataset shape: (62, 8)

Growth status distribution:
growth_status
Declining    31
Growing      31
Name: count, dtype: int64


In [6]:
# Step 4: Create function to calculate derived metrics(annual change rate) and clean occupational titles and call for combined dataset


def add_derived_metrics(df):
    # Check for the correct column name
    employment_change_columns = [col for col in df.columns if 'employment_change' in col.lower()]

    if employment_change_columns:
        employment_change_col = employment_change_columns[0]
        # Calculate annual change rate
        df['annual_change_rate'] = df[employment_change_col] / 10
    else:
        df['annual_change_rate'] = 0
        print("Warning: No employment change column found")

    # Convert code column to string before extracting first 2 characters
    df['occupation_category'] = df['2023_national_employment_matrix_code'].astype(str).str[:2]

    df['2023_national_employment_matrix_title'] = (
        df['2023_national_employment_matrix_title']
        .str.strip()
        .str.title()
    )

    return df


df_combined = add_derived_metrics(df_combined)

print("Combined Dataset of Fastest Growing and Declining Occupations:")
print(df_combined[['2023_national_employment_matrix_title',
                   'occupation_category','2023_national_employment_matrix_code',
                   'annual_change_rate']].head())

Combined Dataset of Fastest Growing and Declining Occupations:
               2023_national_employment_matrix_title occupation_category  \
0                             Total, All Occupations                  00   
1                        Word Processors And Typists                  43   
2                               Roof Bolters, Mining                  47   
3                                Telephone Operators                  43   
4  Switchboard Operators, Including Answering Ser...                  43   

  2023_national_employment_matrix_code  annual_change_rate  
0                              00-0000              673.92  
1                              43-9022               -1.52  
2                              47-5043               -0.06  
3                              43-2021               -0.12  
4                              43-2011               -1.13  


In [7]:
# Step 5: Display final dataset information
print("Combined Dataset of Fastest Growing and Declining Occupations:")
print(df_combined.info())

print("\nSummary Statistics for Numeric Columns:")
print(df_combined.describe())

# Save the number of records in each category
occupation_counts = df_combined['occupation_category'].value_counts()
print("\nOccupation Categories Distribution:")
print(occupation_counts)


Combined Dataset of Fastest Growing and Declining Occupations:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 10 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   2023_national_employment_matrix_title  62 non-null     object 
 1   2023_national_employment_matrix_code   62 non-null     object 
 2   employment,_2023                       62 non-null     Float64
 3   employment,_2033                       62 non-null     Float64
 4   employment_change,_numeric,_2023-33    62 non-null     Float64
 5   employment_change,_percent,_2023-33    62 non-null     Float64
 6   median_annual_wage,_dollars,_2024[1]   62 non-null     Float64
 7   growth_status                          62 non-null     object 
 8   annual_change_rate                     62 non-null     Float64
 9   occupation_category                    62 non-null     object 
dtypes: Float64(6)

In [8]:

df_combined = df_combined.copy()
# Remove special characters, convert to lowercase, replace spaces with underscores 
df_combined.columns = (df_combined.columns.str.replace(r'[^a-zA-Z0-9_\s]', '', regex=True)
                       .str.lower()
                       .str.replace(' ', '_'))


In [9]:
print(df_combined.columns)


Index(['2023_national_employment_matrix_title',
       '2023_national_employment_matrix_code', 'employment_2023',
       'employment_2033', 'employment_change_numeric_202333',
       'employment_change_percent_202333', 'median_annual_wage_dollars_20241',
       'growth_status', 'annual_change_rate', 'occupation_category'],
      dtype='object')


In [10]:
df_combined = df_combined.rename(columns={'median_annual_wage_dollars_20241': 'median_annual_wage_dollars_2024'})


In [11]:
# Removed first row with Total
df_combined = df_combined.iloc[1:]


In [12]:
# Step 6: Save the cleaned file to output folder for loading into SQL DB in Milestone 5

# Output file path
output_dir = os.path.join('..', 'output')
output_file = os.path.join(output_dir, 'Growing_Declining.csv')

# Save as CSV
df_combined.to_csv(output_file, index=False)

# Verify the file was created
if os.path.exists(output_file):
    print(f"File successfully saved to: {output_file}")

else:
    print("Error: File was not created")

File successfully saved to: ..\output\Growing_Declining.csv


In [13]:
# Preview the output file
output_file = os.path.join('..', 'output', 'Growing_Declining.csv')
try:
    df_preview = pd.read_csv(output_file)
    print("\nGrowing and Declining Occupations:")
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', None)
    print(df_preview.head().to_string(index=False))
except FileNotFoundError:
    print(f"Error: File not found at {output_file}")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")


Growing and Declining Occupations:
             2023_national_employment_matrix_title 2023_national_employment_matrix_code  employment_2023  employment_2033  employment_change_numeric_202333  employment_change_percent_202333  median_annual_wage_dollars_2024 growth_status  annual_change_rate  occupation_category
                       Word Processors And Typists                              43-9022             39.9             24.8                             -15.2                             -38.0                          47850.0     Declining               -1.52                   43
                              Roof Bolters, Mining                              47-5043              2.0              1.4                              -0.6                             -32.0                          76640.0     Declining               -0.06                   47
                               Telephone Operators                              43-2021              4.7              3.5         

### Ethical Implications Of Data Wrangling U.S Bureau of Labor Statistics (BLS) Website Data

While working with BLS HTML Tables "Fastest Growing Occupations" and "Fastest Declining Occupations", I performed the following cleaning and formatting steps:
<br>
#### **BLS Fastest Declining Occupations Table & BLS Fastest Declining Occupations Table Data Cleaning and formating steps:**
- Read HTML Tables into data frames; have separate data frames for declining and growing occupations.<br>
- Cleaned column names: removed spaces and converted to lowercase for easy access<br>
- Removed footnotes<br>
- Converted numeric columns to correct datatypes: using pd.to_numeric() with nullable integer types for whole numbers and using float64 type for percentages and decimal values.<br>
- Merged declining and growing datasets and created a combined dataset.<br>
- Added additional column 'growing status' to identify the source<br>
- Calculated derived metrics: the annual change rate, cleaned occupational titles, and standardized formatting. Then, I applied them to the combined dataset.<br>
- Verified Final dataset information for the combined dataset: datatypes, row counts, summary statistics, and occupational category distribution<br>
- Column '2023_national_employment_matrix_code" left blank - it will be merged in Milestone 5 with SOC data.<br>
- Saved the cleaned file to the output folder for loading into SQL DB in Milestone 5
- <br>

#### **Ethical Implications:**
Like SOC and NAICS datasets, these tables are from the BLS website - a public and trusted government source. Therefore, they are ethically safe to use for my research. However, during the wrangling process, there was a small risk that I made incorrect assumptions during data merging. However, the additional column 'growth_status'  refers to the original datasets. Also, the extra step, "Verification of Final dataset," was created to ensure data quality: datatypes were verified after conversion, and the calculated metrics were validated.
All changes to the original data were documented for future reference to avoid misinterpretation and stay responsible.

In [None]:
%%sql
