# Term Project: Is AI taking our jobs or transforming them?

Lana Geissinger
Bellevue University
DSC540_T303 Data Preparation (2257-1)
Professor Catherine Williams
Milestone 2
June 29, 2025


## <u>Cleaning/Formatting Flat File Source</u>

In [1]:
import os
import pandas as pd
from dotenv import load_dotenv




Load and preview data files with SOC and NAICS codes

In [2]:
# Load environment variables
load_dotenv('../env_var.env')
NAICS_codes_path = os.getenv('NAICS_codes_path')
SOC_codes_path = os.getenv('SOC_codes_path')

# Preview data
if NAICS_codes_path and SOC_codes_path:
    try:
        df_NAICS = pd.read_csv(NAICS_codes_path, encoding='Windows-1252')
        df_SOC = pd.read_csv(SOC_codes_path, encoding='Windows-1252')
        print("DataFrame for NAICS Data:")
        print(df_NAICS.head(20))
        print(df_NAICS.info())
        print("DataFrame for SOC Data:")
        print(df_SOC.head(20))
        print(df_SOC.info())
    except FileNotFoundError as e:
        print(f"Error: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
else:
    print("Error: One or both environment variables for file paths are not set or invalid.")


DataFrame for NAICS Data:
   Sector                                               Name  \
0     NaN                                                NaN   
1      11         Agriculture, Forestry, Fishing and Hunting   
2      21      Mining, Quarrying, and Oil and Gas Extraction   
3      22                                          Utilities   
4      23                                       Construction   
5   31-33                                      Manufacturing   
6      42                                    Wholesale Trade   
7   44-45                                       Retail Trade   
8   48-49                     Transportation and Warehousing   
9      51                                        Information   
10     52                              Finance and Insurance   
11     53                 Real Estate and Rental and Leasing   
12     54   Professional, Scientific, and Technical Services   
13     55            Management of Companies and Enterprises   
14     56  Adm

### Cleaning and Formatting SOC Data

In [3]:
# Step 1: Remove first 7 rows with metadata and whitespace
print(df_SOC.iloc[:7])
df_SOC = df_SOC.iloc[7:].copy()
df_SOC = df_SOC.apply(lambda x: x.str.strip() if x.dtype == "object" else x)



                     U.S. Bureau of Labor Statistics   Unnamed: 1  \
0  On behalf of the Office of Management and Budg...          NaN   
1                                                NaN          NaN   
2    November 2017 (for reference year January 2018)          NaN   
3  ***This is the final structure for the 2018 SO...          NaN   
4                                                NaN          NaN   
5                                                NaN          NaN   
6                                        Major Group  Minor Group   

    Unnamed: 2           Unnamed: 3 Unnamed: 4  
0          NaN                  NaN        NaN  
1          NaN                  NaN        NaN  
2          NaN                  NaN        NaN  
3          NaN                  NaN        NaN  
4          NaN                  NaN        NaN  
5          NaN                  NaN        NaN  
6  Broad Group  Detailed Occupation        NaN  


In [4]:
# Step 2: Rename columns
df_SOC = df_SOC.rename(columns={
    'U.S. Bureau of Labor Statistics': 'major_group',
    'Unnamed: 1': 'minor_group',
    'Unnamed: 2': 'broad_group',
    'Unnamed: 3': 'detailed_occupation',
    'Unnamed: 4': 'occupation_title'
})

# Display the SOC Structure after renaming
print("\nSOC Structure:")
print(df_SOC.head())



SOC Structure:
   major_group minor_group broad_group detailed_occupation  \
7      11-0000         NaN         NaN                 NaN   
8          NaN     11-1000         NaN                 NaN   
9          NaN         NaN     11-1010                 NaN   
10         NaN         NaN         NaN             11-1011   
11         NaN         NaN     11-1020                 NaN   

                   occupation_title  
7            Management Occupations  
8                    Top Executives  
9                  Chief Executives  
10                 Chief Executives  
11  General and Operations Managers  


In [5]:
# Step 4: Forward fill the hierarchy levels
df_SOC['major_group'] = df_SOC['major_group'].ffill()
df_SOC['minor_group'] = df_SOC['minor_group'].ffill()
df_SOC['broad_group'] = df_SOC['broad_group'].ffill()
df_SOC['detailed_occupation'] = df_SOC['detailed_occupation'].ffill()
df_SOC['occupation_title'] = df_SOC['occupation_title'].ffill()

# Display the SOC Structure after filling hierarchy levels
print("\nSOC Structure:")
print(df_SOC.head())


SOC Structure:
   major_group minor_group broad_group detailed_occupation  \
7      11-0000         NaN         NaN                 NaN   
8      11-0000     11-1000         NaN                 NaN   
9      11-0000     11-1000     11-1010                 NaN   
10     11-0000     11-1000     11-1010             11-1011   
11     11-0000     11-1000     11-1020             11-1011   

                   occupation_title  
7            Management Occupations  
8                    Top Executives  
9                  Chief Executives  
10                 Chief Executives  
11  General and Operations Managers  


In [6]:
# Check for null values
print("\nNull values:")
print(df_SOC.isnull().sum())


Null values:
major_group            0
minor_group            1
broad_group            2
detailed_occupation    3
occupation_title       0
dtype: int64


In [7]:
# Step 5.1: Remove rows where occupation_title is missing
df_SOC = df_SOC.dropna(subset=['occupation_title'])

# Reset index after removing rows
df_SOC = df_SOC.reset_index(drop=True)

# Display the result
print("SOC DF Shape:", df_SOC.shape)
print(df_SOC.head())


SOC DF Shape: (1447, 5)
  major_group minor_group broad_group detailed_occupation  \
0     11-0000         NaN         NaN                 NaN   
1     11-0000     11-1000         NaN                 NaN   
2     11-0000     11-1000     11-1010                 NaN   
3     11-0000     11-1000     11-1010             11-1011   
4     11-0000     11-1000     11-1020             11-1011   

                  occupation_title  
0           Management Occupations  
1                   Top Executives  
2                 Chief Executives  
3                 Chief Executives  
4  General and Operations Managers  


In [8]:
# Step 5.2: Remove rows where detailed_occupation is missing
df_SOC = df_SOC.dropna(subset=['detailed_occupation'])

# Reset index after removing rows
df_SOC = df_SOC.reset_index(drop=True)

# Display the result
print("SSOC DF Shape:", df_SOC.shape)
print(df_SOC.head(10))


SSOC DF Shape: (1444, 5)
  major_group minor_group broad_group detailed_occupation  \
0     11-0000     11-1000     11-1010             11-1011   
1     11-0000     11-1000     11-1020             11-1011   
2     11-0000     11-1000     11-1020             11-1021   
3     11-0000     11-1000     11-1030             11-1021   
4     11-0000     11-1000     11-1030             11-1031   
5     11-0000      Nov-00     11-1030             11-1031   
6     11-0000      Nov-00     11-2010             11-1031   
7     11-0000      Nov-00     11-2010              Nov-11   
8     11-0000      Nov-00     11-2020              Nov-11   
9     11-0000      Nov-00     11-2020              Nov-21   

                                    occupation_title  
0                                   Chief Executives  
1                    General and Operations Managers  
2                    General and Operations Managers  
3                                        Legislators  
4                           

In [9]:
# Step 6: Make sure all occupation codes look like XX-XXXX (not changed to dates like 'Nov-00')

# Create function to convert to standard format
def standardize_soc_code(code, major_group):

    if pd.isna(code):
        return code

    code = str(code).strip()

# If it's in "Nov-XX" format covert to standard format XX-XXXX where
# Major Group: XX-0000 (first 2 digits significant, rest zeros)
# Minor Group: XX-X000 (first 3 digits significant, rest zeros)
# Broad Group: XX-XX00 (first 4 digits significant, rest zeros)
# Detailed Occupation: XX-XXXX (all digits significant)

    if 'Nov' in code or not '-' in code:
        prefix = str(major_group)[:2]
        numbers = ''.join(filter(str.isdigit, code))
        numbers = numbers.zfill(4)
        return f"{prefix}-{numbers}"

    parts = code.split('-')
    if len(parts) == 2:
        prefix = str(major_group)[:2]
        numbers = parts[1].zfill(4)
        return f"{prefix}-{numbers}"

    return code

# Apply the standardization to each column
df_SOC['minor_group'] = df_SOC.apply(
    lambda row: standardize_soc_code(row['minor_group'], row['major_group']), axis=1)

df_SOC['broad_group'] = df_SOC.apply(
    lambda row: standardize_soc_code(row['broad_group'], row['major_group']), axis=1)

df_SOC['detailed_occupation'] = df_SOC.apply(
    lambda row: standardize_soc_code(row['detailed_occupation'], row['major_group']), axis=1)

# Show result
print(df_SOC[['major_group', 'minor_group', 'broad_group', 'detailed_occupation']].head(20))



   major_group minor_group broad_group detailed_occupation
0      11-0000     11-1000     11-1010             11-1011
1      11-0000     11-1000     11-1020             11-1011
2      11-0000     11-1000     11-1020             11-1021
3      11-0000     11-1000     11-1030             11-1021
4      11-0000     11-1000     11-1030             11-1031
5      11-0000     11-0000     11-1030             11-1031
6      11-0000     11-0000     11-2010             11-1031
7      11-0000     11-0000     11-2010             11-0011
8      11-0000     11-0000     11-2020             11-0011
9      11-0000     11-0000     11-2020             11-0021
10     11-0000     11-0000     11-2020             11-0022
11     11-0000     11-0000     11-2030             11-0022
12     11-0000     11-0000     11-2030             11-0032
13     11-0000     11-0000     11-2030             11-0033
14     11-0000     11-0000     11-2030             11-0033
15     11-0000     11-0000     11-3010             11-00

In [10]:
# Step 6: Save the cleaned file to output folder for loading into SQL DB in Milestone 5

# Define the output file path
output_dir = os.path.join('..', 'output')
output_file = os.path.join(output_dir, 'SOC_DB.csv')

# Save as CSV
df_SOC.to_csv(output_file, index=False)

# Verify the file was created
if os.path.exists(output_file):
    print(f"File successfully saved to: {output_file}")

else:
    print("Error: File was not created")



File successfully saved to: ..\output\SOC_DB.csv


In [11]:
# Preview the output file
output_file = os.path.join('..', 'output', 'SOC_DB.csv')

try:

    df_preview = pd.read_csv(output_file)
    print("\nSOC Structure Final:")
    print(df_preview.head(15))

except FileNotFoundError:
    print(f"Error: File not found at {output_file}")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")



SOC Structure Final:
   major_group minor_group broad_group detailed_occupation  \
0      11-0000     11-1000     11-1010             11-1011   
1      11-0000     11-1000     11-1020             11-1011   
2      11-0000     11-1000     11-1020             11-1021   
3      11-0000     11-1000     11-1030             11-1021   
4      11-0000     11-1000     11-1030             11-1031   
5      11-0000     11-0000     11-1030             11-1031   
6      11-0000     11-0000     11-2010             11-1031   
7      11-0000     11-0000     11-2010             11-0011   
8      11-0000     11-0000     11-2020             11-0011   
9      11-0000     11-0000     11-2020             11-0021   
10     11-0000     11-0000     11-2020             11-0022   
11     11-0000     11-0000     11-2030             11-0022   
12     11-0000     11-0000     11-2030             11-0032   
13     11-0000     11-0000     11-2030             11-0033   
14     11-0000     11-0000     11-2030          

### Cleaning and Formatting NAICS Data

In [12]:
# Step 1.1: Remove whitespace
df_NAICS = df_NAICS.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [13]:
# Step 1.2: Remove rows where the "Sector" column is empty
df_NAICS = df_NAICS.loc[~df_NAICS['Sector'].isna()].copy()


In [14]:
# Step 2: Modify column names to remove sub-columns under "6-digit Industries"
df_NAICS = df_NAICS.rename(columns={
    'U. S. Census Bureau – NAICS structure by industry': 'Sector',
    'Unnamed: 1': 'Name',
    'Unnamed: 2': 'Subsectors (3-digit)',
    'Unnamed: 3': 'detailed_occupation',
    'Unnamed: 4': 'occupation_title',
    'Unnamed: 5': '6-digit Industries - U.S. Detail',
    'Unnamed: 6': '6-digit Industries - Same as 5-digit',
    'Unnamed: 7': '6-digit Industries - Total'
})


# Display the NAICS Structure after renaming
print("\nNAICS structure by industry:")
print(df_NAICS.head(30))


NAICS structure by industry:
   Sector                                               Name  \
1      11         Agriculture, Forestry, Fishing and Hunting   
2      21      Mining, Quarrying, and Oil and Gas Extraction   
3      22                                          Utilities   
4      23                                       Construction   
5   31-33                                      Manufacturing   
6      42                                    Wholesale Trade   
7   44-45                                       Retail Trade   
8   48-49                     Transportation and Warehousing   
9      51                                        Information   
10     52                              Finance and Insurance   
11     53                 Real Estate and Rental and Leasing   
12     54   Professional, Scientific, and Technical Services   
13     55            Management of Companies and Enterprises   
14     56  Administrative and Support and Waste Managemen...   
15     61 

In [15]:
# Step 3: Create function to expand ranges in the dataframe
def expand_ranges(df):
    expanded_rows = []

    for _, row in df.iterrows():
        name = str(row['Sector'])
        if '-' in name:
            try:
                start, end = map(int, name.split('-'))

                for num in range(start, end + 1):
                    new_row = row.copy()
                    new_row['Sector'] = str(num)
                    expanded_rows.append(new_row)
            except ValueError:
                expanded_rows.append(row)
        else:
            expanded_rows.append(row)

    # Create new dataframe with expanded rows
    return pd.DataFrame(expanded_rows)

# Apply the expansion to the NAICS dataframe
df_NAICS = expand_ranges(df_NAICS)

# Reset index
df_NAICS = df_NAICS.reset_index(drop=True)


print("\nNAICS structure by industry:")
print(df_NAICS)



NAICS structure by industry:
   Sector                                               Name  \
0      11         Agriculture, Forestry, Fishing and Hunting   
1      21      Mining, Quarrying, and Oil and Gas Extraction   
2      22                                          Utilities   
3      23                                       Construction   
4      31                                      Manufacturing   
5      32                                      Manufacturing   
6      33                                      Manufacturing   
7      42                                    Wholesale Trade   
8      44                                       Retail Trade   
9      45                                       Retail Trade   
10     48                     Transportation and Warehousing   
11     49                     Transportation and Warehousing   
12     51                                        Information   
13     52                              Finance and Insurance   
14     53 

In [16]:
# Step 4: Save the cleaned file to output folder for loading into SQL DB in Milestone 5

# Output file path
output_dir = os.path.join('..', 'output')
output_file = os.path.join(output_dir, 'NAICS_DB.csv')

# Save as CSV
df_NAICS.to_csv(output_file, index=False)

# Verify the file was created
if os.path.exists(output_file):
    print(f"File successfully saved to: {output_file}")

else:
    print("Error: File was not created")

File successfully saved to: ..\output\NAICS_DB.csv


In [17]:
# Preview the output file
output_file = os.path.join('..', 'output', 'NAICS_DB.csv')
try:
    df_preview = pd.read_csv(output_file)
    print("\nNAICS Structure:")
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', None)
    pd.set_option('display.max_colwidth', None)
    print(df_preview.head().to_string(index=False))
except FileNotFoundError:
    print(f"Error: File not found at {output_file}")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")





NAICS Structure:
 Sector                                          Name  Subsectors (3-digit)  Industry Groups (4-digit)  NAICS Industries (5-digit)  6-digit Industries  6-digit Industries - Same as 5-digit  6-digit Industries - Total
     11    Agriculture, Forestry, Fishing and Hunting                   5.0                       19.0                        42.0                  32                                    32                          64
     21 Mining, Quarrying, and Oil and Gas Extraction                   3.0                        5.0                        11.0                  14                                     7                          21
     22                                     Utilities                   1.0                        3.0                         6.0                  10                                     4                          14
     23                                  Construction                   3.0                       10.0            

During Milestone 5, the analysis approach was modified to focus on skills rather than occupations
because the occupational data needed was discontinued through the BLS API. This change allowed for a more granular analysis of how AI impacts specific job skills rather than broad occupational categories.

The following code was added to process the O*NET skills data:


In [18]:
import pandas as pd

with open('../data/Skills.csv', 'r', encoding='utf-8') as f:
    skills_ONET_df = pd.read_csv(f)
    skills_ONET_df.columns = skills_ONET_df.columns.str.lower().str.replace(' ', '_').str.replace('[^a-z0-9_]', '',
                                                                                                  regex=True)
    skills_ONET_df.to_csv('../output/skills_upd.csv', index=False)

### Ethical Implications Of Data Wrangling SOC and NAICS Codes Data

While working with SOC and NAICS datasets, I performed the following cleaning and formating steps.
<br>
#### **SOC (Standard Occupational Classification) Data Cleaning and formating steps:**
- Removed first 7 rows containing metadata<br>
- Stripped whitespace from all string columns<br>
- Renamed columns<br>
- Forward filled hierarchy levels for all group columns<br>
- Removed rows with missing occupation titles and missing detailed occupations<br>
- Standardized occupation codes<br>
- Saved cleaned data to 'SOC_DB.csv" to output folder for loading into SQL DB in Milestone 5.<br>
- <br>
#### **NAICS (North American Industry Classification System) Data Cleaning and Formating Steps:**
- Stripped whitespace from all string columns<br>
- REmoved rows where the 'Sector' column was empty<br>
- Renamed columns to remove sub-columns under "6-digit Industries"<br>
- Created function  to expand ranges in 'Sector' column<br>
- Expanded ranges into individual rows<br>
- Saved cleaned data to 'NAICS_DB.csv" to output folder for loading into SQL DB in Milestone 5.<br>
- <br>
#### **Ethical Implications:**
These datasets are public and come from trusted government sources. Therefore, they are ethically safe to use for my research. However, during the wrangling process, there was a small risk that I made incorrect assumptions during forward-filling missing values or labeling split sectors. All changes to the original data were documented for future reference to avoid misinterpretation and stay responsible.