# Term Project: Is AI taking our jobs or transforming them?

Lana Geissinger
Bellevue University
DSC540_T303 Data Preparation (2257-1)
Professor Catherine Williams
Milestone 4
July 27, 2025


## <u>Connecting to an API/Pulling in the Data and Cleaning/Formatting</u>

Milestone 4 focuses on connecting to the O*NET Webservices API to gather detailed occupation data, specifically focusing on skills and requirements for jobs identified in Milestone 3 "Growing/Declining occupations". The goal is to understand how job requirements and skills are evolving in response to AI.<br>

In [28]:
import json
import pandas as pd
import sys
from html5lib.treeadapters.sax import prefix
from thefuzz import fuzz
import re
from typing import Dict, List
import time
import os
import requests
from dotenv import load_dotenv
from datetime import datetime


## Connecting to O*NET Web Services API

### Connection Setup
The ONET Web Services implementation is based on the official O*NET Center sample code:
[O*NET Web Services Python Sample Code](https://github.com/onetcenter/web-services-samples/blob/master/python-2/OnetWebService.py)

- Add scripts directory to Python path for module access using
- Load environment variables for API authentication
- Test connection to O*NET Web Services

### Configuration Notes
- **API Endpoint**: `https://services.onetcenter.org/ws/online/search`
- **Authentication**: Username/password from environment variables
- **Request Headers**:
  - User-Agent: Python/OnetWebService
  - Accept: application/json
  - Content-Type: application/json

### Connection Test Process
- Load credentials from environment file
-  Verify credentials exist
- Test connection with sample search
- Check response status (200 for success)
 - Handle errors and provide feedback


In [29]:
# Add scripts directory to Python path
scripts_dir = os.path.abspath(os.path.join(os.path.dirname('OnetWebService.py'), '..', 'scripts'))
sys.path.append(scripts_dir)


In [30]:
from OnetWebService import OnetWebService

def test_onet_connection():

    print("\n🔑 Loading environment variables...")

    load_dotenv('../env_var.env')

    username = os.getenv('ONET_API_USERNAME')
    password = os.getenv('ONET_API_PASSWORD')

    # Verify credentials exist
    if not username or not password:
        print("❌ Missing O*NET API credentials in environment variables")
        print("Please set ONET_API_USERNAME and ONET_API_PASSWORD in env_var.env")
        return False

    print("✅ Found credentials:")
    print(f"  Username: {'*'*len(username)}")
    print(f"  Password: {'*' * len(password)}")

    # Test connection
    print("\n🌐 Testing connection to O*NET Web Services...")

    base_url = "https://services.onetcenter.org/ws/online/search"  # Updated endpoint
    headers = {
        "User-Agent": "Python/OnetWebService",
        "Accept": "application/json",
        "Content-Type": "application/json"
    }

    # Test connection with simple search
    params = {
        "keyword": "software developer"
    }

    try:
        response = requests.get(
            base_url,
            auth=(username, password),
            headers=headers,
            params=params
        )

        if response.status_code == 200:
            print("✅ Successfully connected to O*NET Web Services")
            return True
        else:
            print(f"❌ Connection failed with status code: {response.status_code}")
            print(f"Response text: {response.text}")
            return False

    except Exception as e:
        print(f"❌ Error during connection test: {e}")
        return False

if __name__ == "__main__":
    test_onet_connection()


🔑 Loading environment variables...
✅ Found credentials:
  Username: ********************
  Password: *******

🌐 Testing connection to O*NET Web Services...
✅ Successfully connected to O*NET Web Services


## SOC Code Formatting for Growing_Declining.csv

This process starts by loading and cleaning occupation data from the Growing_Declining.csv file, which contains employment trends, and then matching those job titles with standardized classifications from the SOC_DB.csv file. <br>
Once the titles are aligned between the two sources, SOC codes are added to the employment data to ensure consistency for analysis. After that, matrix codes are generated to align with BLS series or ONET formats. <br>
 The final output is a new file called Growing_Declining_SOC.csv, which includes cleaned occupation titles, corresponding SOC codes, and the original employment matrix data.



In [31]:
# Load datasets
df_growing_declining = pd.read_csv('../output/Growing_Declining.csv')
df_soc = pd.read_csv('../output/SOC_DB.csv', encoding='Windows-1252')

# Clean and normalize titles 
df_growing_declining['title_clean'] = df_growing_declining[
    '2023_national_employment_matrix_title'].str.lower().str.strip()
df_soc['title_clean'] = df_soc['occupation_title'].str.lower().str.strip()

# Merge datasets
df_merged = df_growing_declining.merge(
    df_soc[['detailed_occupation', 'occupation_title', 'title_clean']],
    on='title_clean',
    how='left'
)

# Cleanup and rename columns
df_merged = df_merged.rename(columns={
    'detailed_occupation': 'soc_code'
})
df_merged.drop(columns=['title_clean'], inplace=True)

# Save to CSV
df_merged.to_csv('../output/Growing_Declining_SOC.csv', index=False)


## SOC Code Formatting for O*NET API

Convert BLS SOC codes to ONET format for API compatibility. The formatted codes enable proper ONET API queries for occupation data and employment trends analysis.

###  Format Changes

- Basic SOC code (11-9041) → O*NET format (11-9041.00)
- Add .00 suffix if no detail level
- Keep existing detail levels (.01, .02, etc.)




In [32]:
# Load the Growing_Declining dataset
df = pd.read_csv('../output/Growing_Declining_SOC.csv')


def format_soc_code_for_onet(soc_code):

    if pd.isna(soc_code):
        return None

    soc_code = str(soc_code).strip()

    if re.match(r'^\d{2}-\d{4}\.\d{2}$', soc_code):
        return soc_code

    if re.match(r'^\d{2}-\d{4}$', soc_code):
        return f"{soc_code}.00"

    if re.match(r'^\d{6}$', soc_code):
        return f"{soc_code[:2]}-{soc_code[2:6]}.00"

    return None


# Create a new column 'onet_soc_code' with O*NET formatted SOC codes
df['onet_soc_code'] = df['soc_code'].apply(format_soc_code_for_onet)


print("\nSOC code conversions:")
print(pd.concat([df['soc_code'], df['onet_soc_code']], axis=1).head(10))

# Count of successful conversions
total = len(df)
converted = df['onet_soc_code'].notna().sum()
print(f"\nConverted {converted} out of {total} SOC codes")

# Save updated dataset in output folder
output_file = '../output/Growing_Declining_ONET_SOC.csv'
df.to_csv(output_file, index=False)
print(f"\nSaved updated dataset to {output_file}")

# Display errors
failed = df[df['onet_soc_code'].isna()]
if not failed.empty:
    print("\nSOC codes that couldn't be converted:")
    print(failed[['soc_code', '2023_national_employment_matrix_title']])


SOC code conversions:
  soc_code onet_soc_code
0  43-9022    43-9022.00
1  47-5043    47-5043.00
2  43-2011    43-2011.00
3  43-2021    43-2021.00
4  43-1011    43-1011.00
5  43-2011    43-2011.00
6  43-9021    43-9021.00
7  51-4071    51-4071.00
8  47-5044    47-5044.00
9  51-4062    51-4062.00

Converted 85 out of 87 SOC codes

Saved updated dataset to ../output/Growing_Declining_ONET_SOC.csv

SOC codes that couldn't be converted:
   soc_code              2023_national_employment_matrix_title
43      NaN                             Total, All Occupations
73      NaN  Substance Abuse, Behavioral Disorder, And Ment...


## O*NET Occupation Search

Using fuzzy matching search O*NET database for occupation titles and return dictionary mapping input titles to matched occupations.

### Process
- Load occupation titles from dataset
-  Query O*NET API with each title
-  Score matches using fuzzy string comparison
- Filter and rank results
 - Save matches to JSON for further analysis

### Output
Creates a JSON file containing:
- Timestamp and metadata
- Matched occupations with scores
- Success/failure statistics<br>

This file serves as an intermediate step in the data processing pipeline. It stores the results of fuzzy matching between occupation title and ONET database and contains metadata about the matching process. These matches will be used to retrieve occupational data.

In [33]:
# Add scripts directory to Python path
scripts_dir = os.path.abspath(os.path.join(os.path.dirname('OnetWebService.py'), '..', 'scripts'))
sys.path.append(scripts_dir)


def fuzzy_search_occupations(onet_ws: OnetWebService, occupation_titles: List[str],
                             min_score: int = 80, max_results: int = 3) -> Dict[str, List[dict]]:

    results = {}

    for title in occupation_titles:
        print(f"\nSearching for: {title}")

        try:
            search_response = onet_ws.call('online/search',
                                           keyword=title,
                                           end=10)

            if not search_response or 'occupation' not in search_response:
                print(f"❌ No results found for '{title}'")
                results[title] = []
                continue

            # Get all occupations from response
            occupations = search_response.get('occupation', [])
            if not isinstance(occupations, list):
                occupations = [occupations]

            # Score matches using fuzzy matching
            scored_matches = []
            for occ in occupations:
                if isinstance(occ, dict):
                    onet_title = occ.get('title', '')
                    score = fuzz.ratio(title.lower(), onet_title.lower())

                    if score >= min_score:
                        scored_matches.append({
                            'title': onet_title,
                            'code': occ.get('code', ''),
                            'match_score': score,
                            'href': occ.get('href', '')
                        })

            # Sort by score and take top matches
            scored_matches.sort(key=lambda x: x['match_score'], reverse=True)
            results[title] = scored_matches[:max_results]

            # Print matches
            if scored_matches:
                print("✅ Found matches:")
                for match in scored_matches[:max_results]:
                    print(f"  - {match['title']} (Score: {match['match_score']})")
            else:
                print("⚠️ No matches above minimum score threshold")

            time.sleep(1)

        except Exception as e:
            print(f"❌ Error processing '{title}': {str(e)}")
            results[title] = []
            continue

    return results


In [34]:
# Load occupation titles from Growing_Declining dataset
try:
    df = pd.read_csv('../output/Growing_Declining.csv')
    occupation_titles = df['2023_national_employment_matrix_title'].unique().tolist()

    # Initialize O*NET Web Service
    onet = OnetWebService()

    print(f"Processing {len(occupation_titles)} unique occupation titles...")

    # Process all occupations and store results
    all_results = fuzzy_search_occupations(onet, occupation_titles)

    # Prepare output data with metadata
    output_data = {
        "metadata": {
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
            "total_occupations": len(occupation_titles),
            "min_match_score": 80,
            "max_results_per_occupation": 3
        },
        "matches": all_results
    }

    # Save all results to a single JSON file
    output_file = '../api_responses/onet_occupation_matches.json'
    with open(output_file, 'w') as f:
        json.dump(output_data, f, indent=2)

    print(f"\n✅ All matches saved to: {output_file}")

    # Print summary statistics
    matches_found = sum(len(matches) > 0 for matches in all_results.values())
    print(f"\nSummary:")
    print(f"- Total occupations processed: {len(occupation_titles)}")
    print(f"- Occupations with matches: {matches_found}")
    print(f"- Occupations without matches: {len(occupation_titles) - matches_found}")

except Exception as e:
    print(f"Error: {str(e)}")

Processing 61 unique occupation titles...

Searching for: Word Processors And Typists
✅ Found matches:
  - Word Processors and Typists (Score: 100)

Searching for: Roof Bolters, Mining
✅ Found matches:
  - Roof Bolters, Mining (Score: 100)

Searching for: Telephone Operators
✅ Found matches:
  - Telephone Operators (Score: 100)

Searching for: Switchboard Operators, Including Answering Service
✅ Found matches:
  - Switchboard Operators, Including Answering Service (Score: 100)

Searching for: Data Entry Keyers
✅ Found matches:
  - Data Entry Keyers (Score: 100)

Searching for: Foundry Mold And Coremakers
✅ Found matches:
  - Foundry Mold and Coremakers (Score: 100)

Searching for: Loading And Moving Machine Operators, Underground Mining
✅ Found matches:
  - Loading and Moving Machine Operators, Underground Mining (Score: 100)

Searching for: Patternmakers, Metal And Plastic
✅ Found matches:
  - Patternmakers, Metal and Plastic (Score: 100)
  - Model Makers, Metal and Plastic (Score: 83

In [35]:
type(all_results)

dict

## Retrieving Occupation Data from O*NET

This code retrieves detailed occupation information from the O*NET Web Services API.

### Data Collection Process
1. Loads environment variables for API authentication
2. Initializes O*NET Web Service connection
3. Reads the Growing_Declining_ONET_SOC.csv file containing occupation codes
4. For each occupation with a valid O*NET SOC code:
   - Retrieves basic occupation details
   - Gets additional occupation description
   - Stores the following information:
     - Original occupation title
     - O*NET code
     - Job description
     - Job family classification

### Progress Tracking to stay informed about the progress and monitor API call usage and optimize process efficiency
- Monitors processing progress to confirm thae program is still running and not frozen
- Shows completion status every 5 occupations
- Includes error handling for failed API calls

### Data Storage
- Saves collected data to JSON file with metadata
- Includes timestamp and total occupations processed
- Stores results in '../api_responses/occupation_details.json'


In [36]:
# Load environment variables
load_dotenv('../env_var.env')
username = os.getenv('ONET_API_USERNAME')
password = os.getenv('ONET_API_PASSWORD')

# Initialize O*NET Web Service
onet = OnetWebService()

# Create a list to store occupation data
occupation_data = []

# Load the Growing_Declining_ONET_SOC.csv
df = pd.read_csv('../output/Growing_Declining_ONET_SOC.csv')

# Progress tracking
total = len(df[df['onet_soc_code'].notna()])
processed = 0

# Process each occupation
for _, row in df.iterrows():
    if pd.notna(row['onet_soc_code']):
        onet_code = row['onet_soc_code']
        occupation_title = row['2023_national_employment_matrix_title']

        try:
            occupation_details = onet.call(f'online/occupations/{onet_code}')

            if occupation_details:
                details = onet.call(f'online/occupations/{onet_code}/details')

                occupation_data.append({
                    'original_title': occupation_title,
                    'onet_code': onet_code,
                    'description': occupation_details.get('description', ''),
                    'job_family': occupation_details.get('job_family', {}).get('name', '')

                })


                processed += 1
                if processed % 5 == 0:
                    print(f"Progress: {processed}/{total} occupations processed")

        except Exception as e:
            print(f"Error processing {onet_code}")

        time.sleep(1)

Progress: 5/85 occupations processed
Progress: 10/85 occupations processed
Progress: 15/85 occupations processed
Progress: 20/85 occupations processed
Progress: 25/85 occupations processed
Progress: 30/85 occupations processed
Progress: 35/85 occupations processed
Progress: 40/85 occupations processed
Progress: 45/85 occupations processed
Progress: 50/85 occupations processed
API request error: 422 Client Error: Unprocessable Entity for url: https://services.onetcenter.org/ws/online/occupations/11-0081.00
API request error: 422 Client Error: Unprocessable Entity for url: https://services.onetcenter.org/ws/online/occupations/11-0011.00
Progress: 55/85 occupations processed
Progress: 60/85 occupations processed
Progress: 65/85 occupations processed
Progress: 70/85 occupations processed
Progress: 75/85 occupations processed
API request error: 422 Client Error: Unprocessable Entity for url: https://services.onetcenter.org/ws/online/occupations/11-0013.00
API request error: 422 Client Error

In [37]:
# Write all occupation data in JSON file
output_file = '../api_responses/occupation_details.json'
json_data = {
    'occupations': occupation_data,
    'metadata': {
        'total_occupations': len(occupation_data),
        'timestamp': datetime.now().isoformat()
    }
}

# Save to JSON file
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(json_data, f, indent=2, ensure_ascii=False)

print(f"\nProcessing complete. Data saved to {output_file}")
print(f"Total occupations processed: {len(occupation_data)}")


Processing complete. Data saved to ../api_responses/occupation_details.json
Total occupations processed: 81


In [38]:
type(occupation_data)

list

In [39]:
print(occupation_data)

[{'original_title': 'Word Processors And Typists', 'onet_code': '43-9022.00', 'description': 'Use word processor, computer, or typewriter to type letters, reports, forms, or other material from rough draft, corrected copy, or voice recording. May perform other clerical duties as assigned.', 'job_family': ''}, {'original_title': 'Roof Bolters, Mining', 'onet_code': '47-5043.00', 'description': 'Operate machinery to install roof support bolts in underground mine.', 'job_family': ''}, {'original_title': 'Telephone Operators', 'onet_code': '43-2011.00', 'description': 'Operate telephone business systems equipment or switchboards to relay incoming, outgoing, and interoffice calls. May supply information to callers and record messages.', 'job_family': ''}, {'original_title': 'Telephone Operators', 'onet_code': '43-2021.00', 'description': 'Provide information by accessing alphabetical, geographical, or other directories. Assist customers with special billing requests, such as charges to a th

In [40]:
pd.json_normalize(occupation_data).head()

Unnamed: 0,original_title,onet_code,description,job_family
0,Word Processors And Typists,43-9022.00,"Use word processor, computer, or typewriter to...",
1,"Roof Bolters, Mining",47-5043.00,Operate machinery to install roof support bolt...,
2,Telephone Operators,43-2011.00,Operate telephone business systems equipment o...,
3,Telephone Operators,43-2021.00,"Provide information by accessing alphabetical,...",
4,"Switchboard Operators, Including Answering Ser...",43-1011.00,Directly supervise and coordinate the activiti...,


In [41]:
# Load the growing_declining_ONET_SOC file
df_growing_declining = pd.read_csv('../output/Growing_Declining_ONET_SOC.csv')

# Get the occupation data
with open('../api_responses/occupation_details.json', 'r', encoding='utf-8') as f:
    occupation_data = json.load(f)

# Convert occupation_data to DataFrame
df_occupations = pd.json_normalize(occupation_data['occupations'])

# Rename the column in df_growing_declining
df_growing_declining = df_growing_declining.rename(columns={'onet_soc_code': 'onet_code'})

# Add columns from growing_declining to occupation_data
for occ in occupation_data['occupations']:
    matches = df_growing_declining[df_growing_declining['onet_code'] == occ['onet_code']]
    if not matches.empty:
        record = matches.iloc[0]
        occ.update({
            '2023_national_employment_matrix_title': record['2023_national_employment_matrix_title'],
            'growth_status': record['growth_status'],
            'soc_code': record['soc_code']
        })
    else:
        occ.update({
            '2023_national_employment_matrix_title': None,
            'growth_status': None,
            'soc_code': None
        })

# Save the updated occupation_details.json
with open('../api_responses/occupation_details_upd.json', 'w', encoding='utf-8') as f:
    json.dump(occupation_data, f, indent=2, ensure_ascii=False)




In [42]:
pd.json_normalize(occupation_data['occupations']).head()

Unnamed: 0,original_title,onet_code,description,job_family,2023_national_employment_matrix_title,growth_status,soc_code
0,Word Processors And Typists,43-9022.00,"Use word processor, computer, or typewriter to...",,Word Processors And Typists,Declining,43-9022
1,"Roof Bolters, Mining",47-5043.00,Operate machinery to install roof support bolt...,,"Roof Bolters, Mining",Declining,47-5043
2,Telephone Operators,43-2011.00,Operate telephone business systems equipment o...,,Telephone Operators,Declining,43-2011
3,Telephone Operators,43-2021.00,"Provide information by accessing alphabetical,...",,Telephone Operators,Declining,43-2021
4,"Switchboard Operators, Including Answering Ser...",43-1011.00,Directly supervise and coordinate the activiti...,,"Switchboard Operators, Including Answering Ser...",Declining,43-1011


In [43]:
# Read the JSON file
with open('../api_responses/occupation_details_upd.json', 'r', encoding='utf-8') as f:
    occupation_data = json.load(f)

# Create DataFrame from the occupations list in the JSON
df_occupations = pd.json_normalize(occupation_data['occupations'])

# Display the first few rows and basic information about the DataFrame
print("DataFrame Preview:")
print(df_occupations.head())
print("\nDataFrame Info:")
print(df_occupations.info())


DataFrame Preview:
                                      original_title   onet_code  \
0                        Word Processors And Typists  43-9022.00   
1                               Roof Bolters, Mining  47-5043.00   
2                                Telephone Operators  43-2011.00   
3                                Telephone Operators  43-2021.00   
4  Switchboard Operators, Including Answering Ser...  43-1011.00   

                                         description job_family  \
0  Use word processor, computer, or typewriter to...              
1  Operate machinery to install roof support bolt...              
2  Operate telephone business systems equipment o...              
3  Provide information by accessing alphabetical,...              
4  Directly supervise and coordinate the activiti...              

               2023_national_employment_matrix_title growth_status soc_code  
0                        Word Processors And Typists     Declining  43-9022  
1            

### Transformation: Step 1 - Replace Headers

In [44]:
column_mapping = {
    'onet_code': 'onet_soc_code',
    'original_title': 'occupation_title',
    'description': 'occupation_description',
    '2023_national_employment_matrix_title': 'employment_matrix_title',
    'growth_status': 'employment_trend'
}
df_occupations = df_occupations.rename(columns=column_mapping)

In [45]:
df_occupations.head()

Unnamed: 0,occupation_title,onet_soc_code,occupation_description,job_family,employment_matrix_title,employment_trend,soc_code
0,Word Processors And Typists,43-9022.00,"Use word processor, computer, or typewriter to...",,Word Processors And Typists,Declining,43-9022
1,"Roof Bolters, Mining",47-5043.00,Operate machinery to install roof support bolt...,,"Roof Bolters, Mining",Declining,47-5043
2,Telephone Operators,43-2011.00,Operate telephone business systems equipment o...,,Telephone Operators,Declining,43-2011
3,Telephone Operators,43-2021.00,"Provide information by accessing alphabetical,...",,Telephone Operators,Declining,43-2021
4,"Switchboard Operators, Including Answering Ser...",43-1011.00,Directly supervise and coordinate the activiti...,,"Switchboard Operators, Including Answering Ser...",Declining,43-1011


### Transformation: Step 2 - Find and Remove Duplicates

In [46]:
print("Number of duplicate titles:",
      df_occupations['occupation_title'].duplicated().sum())
print(df_occupations[df_occupations['occupation_title'].duplicated(keep=False)]
      .sort_values('occupation_title')[['occupation_title', 'onet_soc_code']])


Number of duplicate titles: 24
                                     occupation_title onet_soc_code
60                                          Actuaries    15-1299.00
61                                          Actuaries    15-2011.00
25  Aircraft Structure, Surfaces, Rigging, And Sys...    51-1011.00
26  Aircraft Structure, Surfaces, Rigging, And Sys...    51-2011.00
55       Computer And Information Research Scientists    15-1221.00
54       Computer And Information Research Scientists    15-1212.00
50                                    Data Scientists    15-2051.00
49                                    Data Scientists    15-2041.00
12                Engine And Other Machine Assemblers    51-2023.00
13                Engine And Other Machine Assemblers    51-2031.00
33                                        File Clerks    43-4071.00
32                                        File Clerks    43-4061.00
63                                Financial Examiners    13-2061.00
62               

In [47]:
df_occupations_unique = df_occupations.drop_duplicates(
    subset=['occupation_title'],
    keep='first'
).reset_index(drop=True)


In [48]:
df_occupations_unique.head()

Unnamed: 0,occupation_title,onet_soc_code,occupation_description,job_family,employment_matrix_title,employment_trend,soc_code
0,Word Processors And Typists,43-9022.00,"Use word processor, computer, or typewriter to...",,Word Processors And Typists,Declining,43-9022
1,"Roof Bolters, Mining",47-5043.00,Operate machinery to install roof support bolt...,,"Roof Bolters, Mining",Declining,47-5043
2,Telephone Operators,43-2011.00,Operate telephone business systems equipment o...,,Telephone Operators,Declining,43-2011
3,"Switchboard Operators, Including Answering Ser...",43-1011.00,Directly supervise and coordinate the activiti...,,"Switchboard Operators, Including Answering Ser...",Declining,43-1011
4,Data Entry Keyers,43-9021.00,"Operate data entry device, such as keyboard or...",,Data Entry Keyers,Declining,43-9021


### Transformation: Step 3 - Check for missing values, empty strings and whitespace in column

In [49]:
# Function to analyze empty and whitespace values in a column
def analyze_column(df, column):
    if df[column].dtype == object:
        empty_strings = (df[column] == '').sum()
        whitespace_only = df[column].str.isspace().sum() if df[column].dtype == object else 0
        leading_trailing = (df[column].str.strip() != df[column]).sum() if df[column].dtype == object else 0
        null_values = df[column].isna().sum()

        print(f"\nColumn: {column}")
        print(f"Empty strings (''): {empty_strings}")
        print(f"Whitespace-only strings: {whitespace_only}")
        print(f"Leading/trailing whitespace: {leading_trailing}")
        print(f"NULL values: {null_values}")

# Analyze each column in the dataframe
print("Analysis of empty strings and whitespace by column:")
for column in df_occupations_unique.columns:
    analyze_column(df_occupations_unique, column)

Analysis of empty strings and whitespace by column:

Column: occupation_title
Empty strings (''): 0
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0

Column: onet_soc_code
Empty strings (''): 0
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0

Column: occupation_description
Empty strings (''): 0
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0

Column: job_family
Empty strings (''): 57
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0

Column: employment_matrix_title
Empty strings (''): 0
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0

Column: employment_trend
Empty strings (''): 0
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0

Column: soc_code
Empty strings (''): 0
Whitespace-only strings: 0
Leading/trailing whitespace: 0
NULL values: 0


### Transformation: Step 4 - Remove column 'job_family' with missing values

In [50]:
df_occupations_unique = df_occupations_unique.drop(columns=['job_family'])

In [51]:
df_occupations_unique.head()

Unnamed: 0,occupation_title,onet_soc_code,occupation_description,employment_matrix_title,employment_trend,soc_code
0,Word Processors And Typists,43-9022.00,"Use word processor, computer, or typewriter to...",Word Processors And Typists,Declining,43-9022
1,"Roof Bolters, Mining",47-5043.00,Operate machinery to install roof support bolt...,"Roof Bolters, Mining",Declining,47-5043
2,Telephone Operators,43-2011.00,Operate telephone business systems equipment o...,Telephone Operators,Declining,43-2011
3,"Switchboard Operators, Including Answering Ser...",43-1011.00,Directly supervise and coordinate the activiti...,"Switchboard Operators, Including Answering Ser...",Declining,43-1011
4,Data Entry Keyers,43-9021.00,"Operate data entry device, such as keyboard or...",Data Entry Keyers,Declining,43-9021


### Transformation: Step 5 - Reorder columns

In [52]:
# Reorder columns
df_occupations_unique = df_occupations_unique[['occupation_title',
                                             'employment_trend',
                                             'onet_soc_code',
                                             'soc_code',
                                             'occupation_description']]


In [53]:
df_occupations_unique.head()

Unnamed: 0,occupation_title,employment_trend,onet_soc_code,soc_code,occupation_description
0,Word Processors And Typists,Declining,43-9022.00,43-9022,"Use word processor, computer, or typewriter to..."
1,"Roof Bolters, Mining",Declining,47-5043.00,47-5043,Operate machinery to install roof support bolt...
2,Telephone Operators,Declining,43-2011.00,43-2011,Operate telephone business systems equipment o...
3,"Switchboard Operators, Including Answering Ser...",Declining,43-1011.00,43-1011,Directly supervise and coordinate the activiti...
4,Data Entry Keyers,Declining,43-9021.00,43-9021,"Operate data entry device, such as keyboard or..."


In [54]:
# Save the cleaned file to output folder for loading into SQL DB in Milestone 5

# Output file path
output_dir = os.path.join('..', 'output')
output_file = os.path.join(output_dir, 'Occupation_Descriptions.csv')

# Save as CSV
df_occupations_unique.to_csv(output_file, index=False)

# Verify the file was created
if os.path.exists(output_file):
    print(f"File successfully saved to: {output_file}")

else:
    print("Error: File was not created")

File successfully saved to: ..\output\Occupation_Descriptions.csv


### Ethical Implications Of Connecting to API and cleaning/formating ONET occupation data

While connecting to API and cleaning/formating ONET occupation data, I performed the following steps:
<br>
#### **Initial Setup and Connection Testing:**
- Configured ONET Web Services connection using official sample code
- Implemented secure credential management using environment variables
- Tested API connection
- Validated response status and error handling
- Added rate limiting to respect API constraints

#### **Data Processing Setup and SOC Code formatting:**
- Loaded the Growing_Declining_SOC.scv file with occupational data
- Implemented SOC code formatting function for ONET compatibility
- Validated and saved formatted codes to Growing_Declining_ONET_SOC.csv

#### **ONET occupation data cleaning and formating steps to occupation_details.json file:**
- Implemented fuzzy matching search for occupation titles
- Queried ONET API for each occupation with formatted SOC codes and scored and ranked matches
- Collected occupation data and saved results in occupation_details.json
- Normalized json data using panda
- Add columns from modified growing_declining.csv and saved results in occupation_details_upd.json<br>
*In occupation_details.json:*
- Replaced headers
- Found and removed duplicates
- Checked for missing values, empty strings and whitespaces in all columns
- Removed colum 'job_family' with all values missing
- Reordered columns
- Saved the cleaned file to the output folder for loading into SQL DB in Milestone 5

#### **Ethical Implications:**
All data came from public sources like BLS and O*NET, keeping it legal and ethical for research use.<br>
Growth categories and metrics were clearly labeled to avoid confusion around workforce trends.<br>
Some bias may come from assumptions during fuzzy matching and cleaning steps.<br>
To help with that, I used multiple sources and documented the process.<br>
The project doesn't make firm predictions about AI's impact but stays transparent and notes its limits.<br>
