<a href="https://colab.research.google.com/github/Andrew-TraverseMT/placekey-joins/blob/main/dept_labor_wage_compliance_join_doctors_clinicians.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Joining Department of Labor Wage and Hour Compliance data with National Downloadable Files from the Doctors and Clinicians Data section using Placekey
This notebook demonstrates how to combine using Placekey for location-based joins. Understanding this relationship can help in.

### Installing Dependencies and Importing Libraries

In [1]:
!pip install placekey

Collecting placekey
  Downloading placekey-0.0.36-py3-none-any.whl.metadata (8.1 kB)
Collecting h3<5,>=4.2.1 (from placekey)
  Downloading h3-4.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting ratelimit (from placekey)
  Downloading ratelimit-2.2.1.tar.gz (5.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting backoff (from placekey)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting boto3 (from placekey)
  Downloading boto3-1.37.3-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.38.0,>=1.37.3 (from boto3->placekey)
  Downloading botocore-1.37.3-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->placekey)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3->placekey)
  Downloading s3transfer-0.11.3-py3-none-any.whl.metadata (1.7 kB)
Downloading placekey-0.0.36-py3-none-any.whl (18 kB)
Downloading h3-4.2.1-c

In [2]:
import placekey as pk
import pandas as pd
import numpy as np
import geopandas as gpd
import folium
from datetime import datetime
import matplotlib.pyplot as plt

### Explore Available Free Datasets using Placekey_PY

In [3]:
datasets = pk.list_free_datasets()
print(datasets)

['anteriad-skinny-file', 'chicago-vacant-and-abandoned-buildings', 'ageon-skinny-file', 'preferred-communications-skinny-file', 'buildzoom-skinny-file', 'regrid-skinny-file', 'chipotle-locations', 'department-of-labor-wage-and-hour-compliance', 'verisk-skinny-file', 'boston-public-works-violations', 'environics-skinny-file', 'nyc-eviction-data', 'cap-locations-skinny-file', 'chicago-scofflaw-law-violation-data', 'l2-data-skinny-file', 'openaddresses', 'la-crime-2020-24', 'chicago-building-permits', 'national-address-database', 'boston-food-establishment-inspections', 'nyc-tax-liens-sale', 'national-provider-identifier', 'paycheck-protection-program-lender-locations', 'federal-real-property-data', 'philadelphia-affordable-housing-production', 'hifld-fire-department-data', 'national-downloadable-files-from-the-doctors-and-clinicians-data-section', 'foursquare-open-source-places', 'hospice-medicare-enrollments', 'skilled-nursing-facility-medicare-enrollments', 'la-county-active-businesses

In [4]:
dept_labor_dataset = [dataset for dataset in datasets if 'department-of-labor' in dataset]
print(dept_labor_dataset)

['department-of-labor-wage-and-hour-compliance']


In [5]:
doctors_clinicians_dataset = [dataset for dataset in datasets if 'doctors-and-clinicians' in dataset]
print(doctors_clinicians_dataset)

['national-downloadable-files-from-the-doctors-and-clinicians-data-section']


### Data Loading
We'll load two datasets:
- **Boston Property Assessments**: Contains property details including assessed values and location identifiers.
- **Public Works Violations**: Records of violations issued by the public works department, useful for understanding neighborhood or property-specific issues.

In [6]:
# S3 URL location for department-of-labor-wage-and-hour-compliance
s3_location_0 = pk.return_free_datasets_location_by_name('department-of-labor-wage-and-hour-compliance', url=True)
print(s3_location_0)

# S3 URL location for
s3_location_1 = pk.return_free_datasets_location_by_name('national-downloadable-files-from-the-doctors-and-clinicians-data-section', url=True)
print(s3_location_1)

https://placekey-free-datasets.s3.us-west-2.amazonaws.com/department-of-labor-wage-and-hour-compliance/csv/department-of-labor-wage-and-hour-compliance.csv
https://placekey-free-datasets.s3.us-west-2.amazonaws.com/national-downloadable-files-from-the-doctors-and-clinicians-data-section/csv/national-downloadable-files-from-the-doctors-and-clinicians-data-section.csv


In [None]:
# Read Department of Labor data to Pandas DataFrame and inspect it
dept_labor_df = pd.read_csv(s3_location_0, on_bad_lines='warn', low_memory=False) # handling inconsistent number of fields in some rows and mixed dtypes
dept_labor_df.head()

Skipping line 155108: expected 118 fields, saw 119
Skipping line 283686: expected 118 fields, saw 119
Skipping line 341625: expected 118 fields, saw 119

  dept_labor_df = pd.read_csv(s3_location_0, on_bad_lines='warn', low_memory=False) # handling inconsistent number of fields in some rows and mixed dtypes


The Department of Labor dataset provides a rich collection of fields related to labor law enforcement, including quantitative metrics such as back wages owed, number of employees affected, and violation counts, as well as details tied to specific federal laws like the Fair Labor Standards Act (FLSA) and Family and Medical Leave Act (FMLA). By using Placekey, this dataset can be linked with other location-based datasets to enable diverse analyses. In this example, we will join the data to Doctors and Clinicians data to identify practices that have labor violations.

In [None]:
# Read property assessment data to Pandas DataFrame and inspect it
doctors_clinicians_df = pd.read_csv(s3_location_1, on_bad_lines='warn', low_memory=False)
doctors_clinicians_df.head()

The National Downloadable files for Doctors and Clinicians data contains a comprehensive set of information about healthcare providers, including identification information, professional credentials, specialties, national provider id ('npi), facility name, and a flag for telehealth providers. Using Placekey, we will join these datasets.

### Explore the data for insights about the Placekey join

For this example, we will explore the relationships between:

1.   Provider experience and back wages owed
2.   Violation frequency and provider specialty

We will create a map showing the count of violations by facility, enabled by Placekey location fields



In [None]:
# Calculate the number of cases in the Department of Labor data
case_count = len(dept_labor_df)
print(f"Number of cases in Department of Labor DataFrame: {case_count}")

# Calculate the number of unique address Placekeys in the Department of Labor data
address_placekeys_left = len(dept_labor_df['address_placekey'].unique())
print(f"Number of unique address Placekeys in Department of Labor DataFrame: {address_placekeys_left}")

In [None]:
# Count the number of unique providers in the Doctors and Clinicians data
unique_providers_count = len(doctors_clinicians_df['npi'].unique())
print(f"Number of unique providers in Doctors and Clinicians DataFrame: {unique_providers_count}")

# Count the number of unique address Placekeys in the Doctors and Clinicians data
address_placekeys_right = len(doctors_clinicians_df['address_placekey'].unique())
print(f"Number of unique address Placekeys in Doctors and Clinicians DataFrame: {address_placekeys_right}")

In [None]:
# Calculate statistics about provider experience by calculating the time since graduation in the grd_yr field
current_year = datetime.now().year
doctors_clinicians_df['provider_experience'] = current_year - doctors_clinicians_df['grd_yr']

# Set pandas display option to suppress scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Describe the provider experience data
print(doctors_clinicians_df['provider_experience'].describe())

In [None]:
# Print the frequency of provider primary specialties
print(doctors_clinicians_df['pri_spec'].value_counts())

In [None]:
# Group the total back wages owed by address Placekey
back_wages_grouped = dept_labor_df.groupby('address_placekey')['bw_atp_amt'].sum().reset_index()
back_wages_grouped.head()

In [None]:
# Get the total violation count by address Placekey
violations_grouped = dept_labor_df.groupby('address_placekey')['case_violtn_cnt'].sum().reset_index()
violations_grouped.head()

### Joining Data
We use Placekey for this join because:
- It provides a standardized way to match locations across different datasets.
- Helps in dealing with inconsistencies in address or location data.

In [None]:
# Percent of address Placekeys in Doctors and Clinicians data with matches in the Department of Labor violations data
common_placekeys = doctors_clinicians_df[doctors_clinicians_df['address_placekey'].isin(dept_labor_df['address_placekey'])]
percent_common_placekeys = (len(common_placekeys) / len(doctors_clinicians_df)) * 100
print(f"Percent of address Placekeys in Doctors and Clinicians data with matches in the Department of Labor violations data: {percent_common_placekeys:.2f}%")

In [None]:
# Perform a join to plot provider experiance versus labor violations
join_df = doctors_clinicians_df.merge(back_wages_grouped, on='address_placekey', how='left')
join_df = join_df.merge(violations_grouped, on='address_placekey', how='left')

In [None]:
join_df.head()

### Exploring Relationships with Graphs

#### Bar Chart: Provider Experience vs. Labor Violations

In [None]:
# Create bins for provider experience
bins = range(0, int(join_df['provider_experience'].max()) + 5, 5)
labels = [f"{i}-{i+4}" for i in bins[:-1]] # Create bin labels
join_df['experience_bins'] = pd.cut(join_df['provider_experience'], bins=bins, labels=labels, right=False)

# Calculate mean back wages owed for each experience bin
mean_back_wages = join_df.groupby('experience_bins')['bw_atp_amt'].mean().reset_index()

# Create the bar chart
plt.figure(figsize=(10, 6))
plt.bar(mean_back_wages['experience_bins'], mean_back_wages['bw_atp_amt'], color='skyblue')

plt.xlabel('Provider Experience (Years Since Graduation)')
plt.ylabel('Mean Total Back Wages Owed ($)')
plt.title('Mean Total Back Wages Owed vs. Provider Experience')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Interpretation**:

This chart reflects the mean total back wages owed by employers (e.g., healthcare facilities) for labor law violations, linked to providers grouped by their years of experience. Importantly, these back wages are owed due to violations committed by the employers, not the providers themselves. The chart excludes providers who have no back wages owed. The trend suggests that:

Mid-career providers (55-59 years) are associated with facilities that have the highest mean back wages owed, indicating potentially more significant or frequent labor law violations in these settings, or higher wages for more experienced providers.
Less experienced providers (0-4 years) and highly experienced providers (60+ years) are linked to lower back wages owed, possibly reflecting differences in the types of facilities or employment arrangements they are involved with.

#### Bar Chart: Violation Frequency by Provider Specialty

In [None]:
# Group the data by 'pri_spec' and sum the 'case_violtn_cnt'
specialty_violations = join_df.groupby('pri_spec')['case_violtn_cnt'].sum().reset_index()

# Filter out specialties with no violations
specialty_violations = specialty_violations[specialty_violations['case_violtn_cnt'] > 0]

# Sort by violation count for better visualization
specialty_violations = specialty_violations.sort_values(by='case_violtn_cnt', ascending=False)

# Filter to include only case_violtn_cnt greater than 1000000
specialty_violations = specialty_violations[specialty_violations['case_violtn_cnt'] > 1e6]

# Create the bar chart
plt.figure(figsize=(12, 8)) # Adjust the figure size as needed
plt.bar(specialty_violations['pri_spec'], specialty_violations['case_violtn_cnt'], color='skyblue')

# Customize the plot
plt.xlabel('Provider Specialty')
plt.ylabel('Total Violation Count')
plt.title('Total Violation Count by Provider Specialty')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.tight_layout()
plt.show()

**Interpretation:**

The chart reveals that Physician Assistants and Nurse Practitioners are among the specialties with the highest total violation counts, which may reflect trends in workforce size, role complexity, or practice settings. Specialties with fewer than 1,000,000 total violations are not shown.

### Preparing Data for Mapping

With the joined dataset, we can look at the geographic distribution of violations associated with each property in the assessments data.

In [None]:
# Convert to float where necessary
join_df['geocode_latitude'] = join_df['geocode_latitude'].astype(float)
join_df['geocode_longitude'] = join_df['geocode_longitude'].astype(float)
# Select relevant columns for mapping
map_data = join_df[['geocode_latitude', 'geocode_longitude', 'case_violtn_cnt', 'bw_atp_amt', 'facility_name', 'state']]
map_data = map_data.groupby(['facility_name', 'geocode_latitude', 'geocode_longitude']).agg({'case_violtn_cnt': 'sum', 'bw_atp_amt': 'sum', 'state': 'first'}).reset_index()

In [None]:
# Create a GeoDataFrame
gdf = gpd.GeoDataFrame(
    map_data,
    geometry=gpd.points_from_xy(map_data['geocode_longitude'], map_data['geocode_latitude'])
)
# Set the coordinate reference system (CRS) to WGS84, which is standard for latitude and longitude
gdf.crs = 'EPSG:4326'

### Mapping Violations by Placekey with Folium

In [None]:
# Define the latitude and longitude for Boston
ca_lat = 36.7468
ca_lon = -119.7726

# Create a base map centered around Boston
map_center = [ca_lat, ca_lon]
attr='© OpenStreetMap © CartoDB'
m = folium.Map(location=map_center, zoom_start=10, tiles='cartodbpositron', attr=attr)

# Filter GDF to show only facilities with city_nm = San Francisco
gdf = gdf[gdf['state'].str.contains('CA')]

In [None]:
# Calculate min and max values for scaling (excluding zero and negative values)
values = gdf['case_violtn_cnt'].dropna()  # Drop null values
min_value = values[values > 0].min()  # Exclude zero and negative values
max_value = values.max()

# Create a legend
legend_html = """
     <div style="position: fixed;
     bottom: 50px; left: 50px; width: 150px; height: 150px;
     border:2px solid grey; z-index:9999; font-size:14px;
     background-color:white;
     ">&nbsp; Violations by Provider Facility <br>
     &nbsp; <i class="fa fa-circle fa-1x" style="color:green"></i>&nbsp; No Violations <br>
     &nbsp; <i class="fa fa-circle fa-1x" style="color:orange"></i>&nbsp; Violation(s) <br>
     &nbsp; Circle size represents cumulative violation count<br>
     </div>
     """
m.get_root().html.add_child(folium.Element(legend_html))

# Iterate through the GeoDataFrame and create a Circle for each point
for index, row in gdf.iterrows():
    # Determine the color and radius based on 'case_violtn_cnt'
    if pd.isnull(row['case_violtn_cnt']) or row['case_violtn_cnt'] <= 0:
        color = 'green'
        radius = 4  # Small radius for null or non-positive values
    else:
        color = 'orange'
        # Scale radius using a logarithmic scale
        radius = 1 + 9 * np.log10(row['case_violtn_cnt'] / min_value) / np.log10(max_value / min_value)  # Adjust the base and scaling factors as needed

    # Create a popup with HTML content
    popup_html = f"""
        <b>Facility Name:</b> {row['facility_name']}<br>
        <b>Total Violations:</b> {row['case_violtn_cnt']}
        <b>Total Back Wages Owed:</b> {row['bw_atp_amt']}
    """
    popup = folium.Popup(popup_html, max_width=300)

    folium.Circle(
        location=[row['geometry'].y, row['geometry'].x],
        radius=radius,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7,
        popup=popup
    ).add_to(m)

In [None]:
# display the map
m

In [None]:
# Optionally, save the map as an html file to add it to a website
m.save('map.html')