### Data Cleaning + Analysis Notebook 3

***NOTE***: 

- Datset Aquired from Data.gov / US Department of Health & Human Services

- Handling missing values, aligning variable definitions across data sources, and performing initial exploratory analysis to verify data consistency and identify early trends.

In [11]:
import pandas as pd

# Load data
visits = pd.read_csv("Data_Files_Uncleaned/US_Healthcare_Visits.csv")


In [12]:
# Keep only relevant columns
visits_clean = visits[[
    "YEAR",
    "AGE",
    "PANEL",
    "ESTIMATE"
]].copy()

# Rename for clarity
visits_clean = visits_clean.rename(columns={
    "YEAR": "Year",
    "AGE": "Age_Group",
    "PANEL": "Category",
    "ESTIMATE": "Visits_Thousands"
})


In [13]:
# Convert numeric columns
visits_clean["Year"] = visits_clean["Year"].astype(int)
visits_clean["Visits_Thousands"] = pd.to_numeric(visits_clean["Visits_Thousands"], errors="coerce")

# Drop missing values
visits_clean = visits_clean.dropna(subset=["Visits_Thousands"])

# Add visits in millions for readability
visits_clean["Visits_Millions"] = visits_clean["Visits_Thousands"] / 1000


In [18]:
# Preview cleaned dataset
visits_clean.head()


Unnamed: 0,Year,Age_Group,Category,Visits_Thousands,Visits_Millions
0,2000,All ages,All places,1014848.0,1014.848
1,2001,All ages,All places,1142420.0,1142.42
2,2002,All ages,All places,1157798.0,1157.798
3,2003,All ages,All places,1114504.0,1114.504
4,2004,All ages,All places,1106067.0,1106.067


In [15]:
# Save cleaned dataset
visits_clean.to_csv("us_healthcare_visits_cleaned.csv", index=False)

print("Cleaned dataset saved as 'us_healthcare_visits_cleaned.csv'")


Cleaned dataset saved as 'us_healthcare_visits_cleaned.csv'


***Data Analysis (validating data integrity)***: 

- Conducting Data Analysis below to determine if we gain meaningful insights from the data

In [39]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Load cleaned datasets
visits = pd.read_csv("Data_Files_Cleaned/us_healthcare_visits_cleaned.csv", dtype={"GeoID": str})


In [55]:
# --- Filter for Hospital Emergency Departments only ---
er_visits = visits[
    (visits["Category"] == "Hospital emergency departments") &
    (visits["Age_Group"] == "All ages")
]

# --- Keep only "All ages" data ---
visits_all = visits[visits["Age_Group"] == "All ages"]

# --- Filter out "All places" category ---
visits_filtered = visits_all[visits_all["Category"] != "All places"]

In [60]:
# --- Sort by Year ---
er_visits = er_visits.sort_values("Year")

# --- Create bar chart ---
fig = px.bar(
    er_visits,
    x="Year",
    y="Visits_Millions",
    title="Emergency Room Visits in the U.S. (2000–2018)",
    labels={"Visits_Millions": "ER Visits (Millions)", "Year": "Year"},
    color="Visits_Millions",
    color_continuous_scale="Reds"
)

# --- Format layout ---
fig.update_layout(
    title_x=0.5,
    plot_bgcolor="white",
    showlegend=False,
    height=500
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridcolor="lightgray")

fig.show()

In [57]:
# --- Create stacked bar chart ---
fig = px.bar(
    visits_filtered,
    x="Year",
    y="Visits_Millions",
    color="Category",
    title="Distribution of U.S. Healthcare Visits by Category (2000–2018)",
    labels={"Visits_Millions": "Visits (Millions)", "Year": "Year"},
    color_discrete_sequence=px.colors.qualitative.Set2
)

# --- Format layout ---
fig.update_layout(
    barmode="stack",
    title_x=0.5,
    plot_bgcolor="white",
    legend_title_text="Category",
    height=550
)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=True, gridcolor="lightgray")

fig.show()


***NOTE***: 

- Physician office visits dominate — showing the largest share of healthcare utilization.

- Emergency department visits are much lower but show a steady increase, suggesting growing reliance on ERs for care access.

- Outpatient departments and All places (aggregated total) move in parallel, reflecting national healthcare demand trends. 

The U.S. healthcare visits dataset is aggregated at the national level, likely derived from broad surveys or administrative sources that emphasize nationwide patterns rather than local detail. Because it lacks state or county identifiers, it’s not possible to examine regional differences or assess how local access to primary care may influence emergency room usage. 

This made it challenging to find any recent or granular data at the county level, as such localized datasets are often fragmented across different agencies, subject to privacy restrictions, or not publicly released in real time.
