# Comparing unemployment datasets

The Alaskan Laborstats division offers different files.

A complete, non-seasonally adjusted one available at: https://live.laborstats.alaska.gov/labforce/csv/AKlaborforce.csv
While the area-specific ones through the website: https://live.laborstats.alaska.gov/data-pages/labor-force-home

At first glance the complete file seemed to lack some areas, while offering a complete and continuous time-series.

I compare the merged area-specific files with the total using my fuzzy matching function set to a 95-score for acceptance, to check for missing data.

Conclusion: the complete file offered does not lack any area-specific data and has a complete time-series, while individual files do not.

In [21]:
import os
import pandas as pd

In [22]:
custom_data_path = '../../data/custom_data/'
laborstats_path = '../../data/alaskan_laborstats'

In [23]:
custom_unem = pd.read_csv(os.path.join(custom_data_path, 'unemployment_by_area_merged.csv'))
compl_unem = pd.read_csv(os.path.join(laborstats_path, 'Alaska_NOT_SEASONALLY_adjusted.csv'))

In [24]:
custom_unem.columns

Index(['Unnamed: 0', 'Area Name', 'Area Type', 'Area Code', 'Year', 'month',
       'period', 'Preliminary if value is 1', 'Labor Force', 'Employment',
       'Unemployment', 'Unemployment Rate'],
      dtype='object')

In [25]:
compl_unem.columns

Index(['Area Name', 'Area Type', 'Area Code', 'Year', 'month', 'period',
       'Preliminary if value is 1', 'Labor Force', 'Employment',
       'Unemployment', 'Unemployment Rate'],
      dtype='object')

## Area Name comparison

In [26]:
from fuzzywuzzy import process, fuzz

def clean_location_name(name):
    '''
    Cleans location names by removing common suffixes that skew fuzzy matching.
    '''
    ignore_suffixes = ["census designated place", "city", "road", "street", "highway", "borough", "area", "island"]
    for suffix in ignore_suffixes:
        if name.lower().endswith(suffix):
            name = name.rsplit(suffix, 1)[0].strip()
    return name

def match_names(trusted_df, trusted_col, other_df, other_col):
    '''
    Matches names between two dataframe columns, automatically rejecting low-confidence matches.
    
    Returns:
    - Matched DataFrame with original and final names
    - Unmatched names for manual review
    '''

    # Clean and strip names
    trusted_unique = set(trusted_df[trusted_col].astype(str).str.strip().str.lower().apply(clean_location_name).unique())
    other_unique = set(other_df[other_col].astype(str).str.strip().str.lower().apply(clean_location_name).unique())

    matched_other_to_trusted = {}
    unmatched_names = []

    for name in other_unique:
        for trusted_name in trusted_unique:
            if trusted_name in name or name in trusted_name:
                matched_other_to_trusted[name] = trusted_name
                break
        else:
            match, score = process.extractOne(name, list(trusted_unique), scorer=fuzz.token_sort_ratio)
            
            if score >= 95:  # Almost perfect match
                matched_other_to_trusted[name] = match
            else:
                unmatched_names.append(name)  # Log unmatched names

    matched_df = pd.DataFrame(list(matched_other_to_trusted.items()), columns=["Original", "Matched"])
    unmatched_df = pd.DataFrame(unmatched_names, columns=["Unmatched"])

    return matched_df, unmatched_df

In [27]:
matched_df, unmatched_df = match_names(compl_unem, 'Area Name', custom_unem, 'Area Name')
unmatched_df

Unnamed: 0,Unmatched


# Findings
Since the names matched, and thus all areas are covered by both datasets, the complete dataset provided will be integrated in my final dataset.