<a href="https://colab.research.google.com/github/Fidaaz2521/Unemplyment_Rate_Analysis_MiniProject1/blob/main/Fida_UnemploymentRateAnalysis_MiniProject1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name: UNEMPLOYMENT RATE ANALYSIS BY REGION**

Project Type - EXPLORATORY DATA ANALYSIS and HYPOTHESIS TESTING

Contribution- Individual

### Submitted By - Fida Taneem

## **PROJECT SUMMARY**

**OBJECTIVE:**


To perform a detailed Exploratory Data Analysis (EDA) on U.S. unemployment data, aiming to:

1. Identify patterns and anomalies in unemployment trends across states and regions.

2. Compare pre-crisis, crisis, and post-crisis periods, focusing on:

  *The 2008 Global Financial Crisis.

  *The 2020 COVID-19 pandemic.

3. Understand regional disparities in unemployment trends.

4. Provide actionable insights that can guide:

  *Economic policy analysis.

  *Labor market interventions.

  *Investment or business decisions.

**BUSINESS CONTEXT**

Understanding unemployment trends is vital for:

1.Government & Policy Makers: To assess the effectiveness of past economic policies and plan for future shocks.

2.Economists & Researchers: To understand labor market behaviors and regional disparities.

3.Businesses & Investors: To evaluate regional economic health, labor availability, and risk exposure.

4.Citizens & Job Seekers: To inform relocation or career decisions based on labor market trends.

The U.S. Bureau of Labor Statistics (BLS) provides detailed monthly unemployment data at the state level, while the World Bank offers a consistent national-level unemployment series, suitable for long-term and international comparison.

**Subject:** Economics

**Data Sources:** U.S. Bureau of Labor Statistics (BLS), World Bank

**DATA OVERVIEW**

1. **Bureau of Labor Statistics (BLS) – State-Level Data**

  Frequency: Monthly.

  Coverage: All 50 U.S. states + D.C.

  *Metrics:*

  Unemployment Rate (%).

  Labor Force Participation.

  Employment & Unemployment counts.

  Period: 2000–2025 (ideal).


2. **World Bank – National-Level Data**

  Frequency: Annual.

  Coverage: U.S. only (but globally comparable).

  Metric: National Unemployment Rate (%).

  Period: 2000–2024.

# **DATA PREPROCESSING**


**1.Data Collection**

BLS CSV downloads from bls.gov

World Bank unemployment data via data.worldbank.org


 2.**Cleaning**

Handle missing values (e.g., for recent months or certain territories).

Standardize date formats.

Ensure consistent state names.

Align monthly (BLS) and annual (World Bank) data for combined plots.


3.**Feature Engineering**

Add:

Region column (Northeast, Midwest, South, West).

Crisis Period Flags (2008–2010, 2020–2021).

Year-over-year and  unemployment change.

Yearly averages .

# **TECHNIQUES USED IN EDA**

Here are the EDA techniques you’d typically use:

**1. Univariate Analysis**

Histogram, Boxplot of unemployment rates.

Distribution analysis across states or years.

**2. Bivariate Analysis**

Line plots: Unemployment over time (national vs states).

Heatmaps: State unemployment by year/month.

Bar charts: Yearly average unemployment per state.

**3. Multivariate Analysis**

Compare unemployment vs. labor force participation or GDP growth.

Cluster states based on unemployment patterns.

**4. Time Series Analysis**

Rolling averages.

Trend/seasonality decomposition.

Crisis period comparison (2008 vs 2020):

Pre-crisis (2006–2007, 2018–2019)

Crisis (2008–2010, 2020–2021)

Recovery (2011–2015, 2022–2024)



**7. Correlation and Trend Analysis**

Correlation heatmap (between states or variables).

Identify synchronized/unusual trends across time.

**8. Hypothesis Testing**

# **GITHUB LINK:**

https://github.com/Fidaaz2521/Unemplyment_Rate_Analysis_MiniProject1

# **LET'S BEGIN**

**IMPORTING LIBRARIES**


In [None]:
import numpy as np
import pandas as pd
import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
import missingno

**Importing the Dataset into Google Colab**

In [None]:


# State-level unemployment data
df1 = pd.read_csv('https://raw.githubusercontent.com/Fidaaz2521/Unemplyment_Rate_Analysis_MiniProject1/main/us_states_unemployment_2000_2024_synthetic.csv')

# World Bank (region-level) unemployment data
df2 = pd.read_csv('https://raw.githubusercontent.com/Fidaaz2521/Unemplyment_Rate_Analysis_MiniProject1/main/worldbank_regions_unemployment_2000_2024_synthetic.csv')


# **Understanding the data**

In [None]:
df1.head()



In [None]:
df1.tail()

In [None]:
df2.head()

In [None]:
df2.tail()

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
df1.info()

In [None]:
df2.info()

# **Total Unique values in each column**

In [None]:
df1.nunique()

In [None]:
df1["region"].unique()

In [None]:
df2.nunique()

In [None]:
df2["wb_region"].unique()

# **Finding total null values in each column**

In [None]:
df1.isnull().sum()

In [None]:
df2.isnull().sum()

In [None]:
# Visualizing the missing values in df1 using missingno
missingno.matrix(df1)
# Visualizing the missing values in df2 using missingno
missingno.matrix(df2)

# **Finding total duplicates in the dataset**

In [None]:
print("Duplicates in state level unemployment data:",df1.duplicated().sum())
print("Duplicates in World Bank (region-level) unemployment data ",df2.duplicated().sum())

#no duplicates , the data is clean .

# **Understanding the variables**

In [None]:
# Dataset Columns
print("Features in state level unemployment data: ",df1.columns.tolist())
print("Features in World Bank (region-level) unemployment data: ",df2.columns.tolist())

In [None]:
# Dataset Describe
df1.describe()



In [None]:
# Dataset Describe
df2.describe()


*
Regional averages.

*National average unemployment by year.

*Identify highest & lowest unemployment regions.



In [None]:
df1.groupby("region")["unemp_rate"].describe()


In [None]:
df2.groupby("wb_region")["unemp_rate"].describe()

# **WHAT DID I KNOW ABOUT THE DATASET?**


The dataset I worked on contains unemployment rates from 2000 to 2024 across US states as well as World Bank regions.

**Structure:**

Each record includes a year, geographic identifier (state or region), and the corresponding unemployment rate (%).
The data is in a panel/time-series format, meaning multiple states and regions are observed over multiple years.

**Coverage:**

It covers all 50 US states + DC and several global regions, giving both a national and international perspective on unemployment trends.

**Purpose:**

The dataset helps analyze:

Trends over time (long-term unemployment changes from 2000–2024).

Crisis impacts (e.g., 2008 financial crisis, COVID-19 in 2020).

Regional disparities (which states or regions faced higher unemployment).

Comparisons between US states and World Bank regions.


**Value:**

The dataset is useful for economic research, policy analysis, forecasting, and identifying vulnerable states/regions during crises.

# **UNIVARIATE ANALYSIS**

To see how unemployment rates are distributed across states and years.

In [None]:
# Histogram of unemployment rates
plt.figure(figsize=(10,6))
sns.histplot(df1["unemp_rate"], bins=30, kde=True, color="skyblue")
plt.title("Histogram of Unemployment Rates (All US States, 2000–2024)")
plt.xlabel("Unemployment Rate (%)")
plt.ylabel("Frequency")
plt.show()



In [None]:
# Boxplot of unemployment rates (overall distribution)
plt.figure(figsize=(8,5))
sns.boxplot(x=df1["unemp_rate"], color="lightcoral")
plt.title("Boxplot of Unemployment Rates (2000–2024)")
plt.xlabel("Unemployment Rate (%)")
plt.show()



In [None]:
# Distribution across years
plt.figure(figsize=(12,6))
sns.boxplot(x="year", y="unemp_rate", data=df1, showfliers=False)
plt.xticks(rotation=90)
plt.title("Distribution of Unemployment Rates Across Years")
plt.ylabel("Unemployment Rate (%)")
plt.show()

### **UNIVARIATE**

**Histograms & Boxplots of Unemployment Rates**

**Why chosen:** To analyze overall distribution and detect outliers.

**Insights:** Most unemployment rates fall between 4–8%, but crisis years create long right tails.

**Impact:** Negative – high volatility during crises means instability in labor markets.


# **BIVARIATE ANALYSIS**

To check how unemployment relates across time, states, and regions.


In [None]:

# Plot sample states
sample_states = ["California","Texas","New York","Florida","Nevada"]
for state in sample_states:
    state_data = df1[df1["state_name"]==state]
    sns.lineplot(data=state_data, x="year", y="unemp_rate", label=state, alpha=0.7)

plt.title("Unemployment Trends: US vs Selected States")
plt.ylabel("Unemployment Rate (%)")
plt.xlabel("Year")
plt.legend()
plt.show()



In [None]:
# Heatmap: State unemployment by year
pivot_us = df1.pivot_table(index="state_name", columns="year", values="unemp_rate")
plt.figure(figsize=(15,10))
sns.heatmap(pivot_us, cmap="coolwarm", cbar_kws={'label': 'Unemployment Rate (%)'})
plt.title("Heatmap of US State Unemployment (2000–2024)")
plt.show()



In [None]:
# Bar chart: Yearly average unemployment by state (for a specific year, say 2020)
year_choice = 2020
yearly_state_avg = df1[df1["year"]==year_choice].sort_values("unemp_rate", ascending=False)

plt.figure(figsize=(12,8))
sns.barplot(data=yearly_state_avg, x="unemp_rate", y="state_name", palette="magma")
plt.title(f"Average Unemployment Rates by State in {year_choice}")
plt.xlabel("Unemployment Rate (%)")
plt.title(f"Average Unemployment Rates by State in {year_choice}")
plt.xlabel("Unemployment Rate (%)")
plt.ylabel("State")
plt.show()

**BIVARIATE**

**Line Plot (US National Unemployment Over Time)**

Why chosen: To track unemployment trends across years.

Insights: Steady before 2008, sharp spike in 2008–2010, recovery after 2011, another sharp spike in 2020, decline post-2021.

Impact: Negative (crises cause major disruptions) but Positive (recovery patterns show resilience).



**Heatmap (State Unemployment by Year)**

Why chosen: To visualize unemployment differences across states and years.

Insights: Nevada, Michigan, and California faced severe unemployment peaks; other states stayed closer to national average.

Impact: Positive – policymakers can identify vulnerable states for targeted aid.

**Bar Chart (Yearly Average Unemployment per State)**

Why chosen: To rank states by average unemployment.

Insights: West Virginia, Nevada, and Michigan show consistently higher rates.

Impact: Positive – helps direct long-term employment policies to specific states.

In [None]:
sns.pairplot(df2)


# 1. US Overall Trend


In [None]:

plt.figure(figsize=(12,6))
us_trend = df1.groupby("year")["unemp_rate"].mean()
sns.lineplot(x=us_trend.index, y=us_trend.values, marker="o")
plt.title("US Average Unemployment Rate (2000–2024)")
plt.ylabel("Unemployment Rate (%)")
plt.xlabel("Year")
plt.show()

# 2. US Regional Trends

In [None]:
plt.figure(figsize=(12,6))
us_region_trend = df1.groupby(["year","region"])["unemp_rate"].mean().reset_index()
sns.lineplot(data=us_region_trend, x="year", y="unemp_rate", hue="region", marker="o")
plt.title("US Regional Unemployment Trends (2000–2024)")
plt.ylabel("Unemployment Rate (%)")
plt.show()

# 3. Distribution Across States (Boxplot)

In [None]:
plt.figure(figsize=(14,6))
sns.boxplot(data=df1, x="region", y="unemp_rate")
plt.title("Distribution of State Unemployment Rates by Region")
plt.ylabel("Unemployment Rate (%)")
plt.show()


# 5. Top 5 States with Highest Average Unemployment

In [None]:
top5_high = df1.groupby("state_name")["unemp_rate"].mean().nlargest(5)
plt.figure(figsize=(8,5))
sns.barplot(x=top5_high.values, y=top5_high.index)
plt.title("Top 5 States with Highest Avg. Unemployment (2000–2024)")
plt.xlabel("Average Unemployment Rate (%)")
plt.show()


# 6. Top 5 States with Lowest Average Unemployment

In [None]:
top5_low = df1.groupby("state_name")["unemp_rate"].mean().nsmallest(5)
plt.figure(figsize=(8,5))
sns.barplot(x=top5_low.values, y=top5_low.index, color="green")
plt.title("Top 5 States with Lowest Avg. Unemployment (2000–2024)")
plt.xlabel("Average Unemployment Rate (%)")
plt.show()

In [None]:
#Top 10 States by Average Unemployment Rate in 2020

year = 2020
state_share = df1[df1["year"]==year].groupby("state")["unemp_rate"].mean()

plt.figure(figsize=(10,8))
state_share.nlargest(10).plot.pie(autopct='%1.1f%%', startangle=140, colormap="tab20")
plt.title(f"Top 10 States by Average Unemployment Rate in {year}")
plt.ylabel("")
plt.show()

# 7. Global Trend (World Bank Regions Average)

In [None]:
plt.figure(figsize=(12,6))
world_trend = df2.groupby("year")["unemp_rate"].mean()
sns.lineplot(x=world_trend.index, y=world_trend.values, marker="o", color="red")
plt.title("Global Average Unemployment Rate (2000–2024)")
plt.ylabel("Unemployment Rate (%)")
plt.xlabel("Year")
plt.show()

# 8. World Regional Trends

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(data=df2, x="year", y="unemp_rate", hue="wb_region", marker="o")
plt.title("World Bank Regional Unemployment Trends (2000–2024)")
plt.ylabel("Unemployment Rate (%)")
plt.show()


# 9. Compare US vs World Avg

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(x=us_trend.index, y=us_trend.values, label="US Avg", marker="o")
sns.lineplot(x=world_trend.index, y=world_trend.values, label="World Avg", marker="s", color="red")
plt.title("US vs World Average Unemployment (2000–2024)")
plt.ylabel("Unemployment Rate (%)")
plt.xlabel("Year")
plt.legend()
plt.show()

# 10.  Correlation Heatmap (US data only)

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(df1.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap (US Data)")
plt.show()


**Correlation Heatmap**

Why chosen: To examine if states’ unemployment rates move together.

Insights: Strong correlations across most states → crises affect the entire economy.

Impact: Positive – helps plan national-level interventions.

# 11.  Trend Comparison: Pre-2008, Post-2008, Post-2020

In [None]:
plt.figure(figsize=(12,6))
df1["period"] = pd.cut(df1["year"], bins=[1999,2007,2019,2024], labels=["Pre-2008","2008–2019","2020+"])
sns.boxplot(data=df1, x="period", y="unemp_rate")
plt.title("US Unemployment Distribution Across Economic Periods")
plt.ylabel("Unemployment Rate (%)")
plt.show()


# **Trend Analysis (Time Series Analysis)**



In [None]:
import statsmodels.api as sm

# Rolling Averages (National Avg)
us_trend = df1.groupby("year")["unemp_rate"].mean().reset_index()
us_trend["rolling_avg"] = us_trend["unemp_rate"].rolling(window=3, center=True).mean()

plt.figure(figsize=(12,6))
plt.plot(us_trend["year"], us_trend["unemp_rate"], label="US Avg Unemployment", marker="o")
plt.plot(us_trend["year"], us_trend["rolling_avg"], label="3-Year Rolling Avg", linestyle="--", color="red")
plt.title("US Unemployment with Rolling Average (2000–2024)")
plt.xlabel("Year")
plt.ylabel("Unemployment Rate (%)")
plt.legend()
plt.show()



In [None]:
# Crisis Period Comparison
def plot_crisis_comparison(state_name):
    data = df1[df1["state_name"]==state_name]

    plt.figure(figsize=(12,6))
    sns.lineplot(data=data, x="year", y="unemp_rate", marker="o", label=state_name)
    plt.axvspan(2006, 2007, alpha=0.2, color="green", label="Pre-2008")
    plt.axvspan(2008, 2010, alpha=0.3, color="red", label="2008 Crisis")
    plt.axvspan(2011, 2015, alpha=0.2, color="blue", label="Recovery 2011–2015")
    plt.axvspan(2018, 2019, alpha=0.2, color="green")
    plt.axvspan(2020, 2021, alpha=0.3, color="orange", label="COVID Crisis")
    plt.axvspan(2022, 2024, alpha=0.2, color="blue")

    plt.title(f"Crisis Period Comparison: {state_name}")
    plt.ylabel("Unemployment Rate (%)")
    plt.legend()
    plt.show()


In [None]:
# Example: Compare crisis impacts on Nevada (biggest spike during 2008–2010 & 2020)
plot_crisis_comparison("Nevada")

# Example: Compare crisis impacts on Texas (more stable)
plot_crisis_comparison("Texas")

In [None]:
import matplotlib.pyplot as plt

for region in df1["region"].unique():
    subset = df1[df1["region"] == region]
    plt.plot(subset["year"], subset["unemp_rate"], label=region)

plt.legend()
plt.title("State Level Unemployment Rate Trends by Region")
plt.show()


In [None]:
import matplotlib.pyplot as plt

for region in df2["wb_region"].unique():
    subset = df2[df2["wb_region"] == region]
    plt.plot(subset["year"], subset["unemp_rate"], label=region)

plt.legend()
plt.title("World Bank (region level) Unemployment Rate Trends by Region")
plt.show()


In [None]:
sns.lineplot(data=df2, x="year", y="unemp_rate")
plt.title("U.S. Unemployment Rate ")
plt.show()

# **Pre- and Post-Crisis Comparison**



**Split dataset into pre-crisis and post-crisis (e.g., 2008, 2020).**

In [None]:
pre_crisis1 = df1[df1["year"] < 2008]
post_crisis1 = df1[df1["year"] >= 2008]

df1.groupby(["region", "year"])["unemp_rate"].mean().unstack()
def period(year):
    if year < 2008:
        return "Pre-2008"
    elif year < 2020:
        return "Post-2008 Pre-2020"
    else:
        return "Post-2020"

df1['period'] = df1['year'].apply(period)
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.boxplot(x="region", y="unemp_rate", hue="period", data=df1)
plt.title("Unemployment Rate by Region and Period")
plt.ylabel("Unemployment Rate (%)")
plt.xlabel("U.S. Region")
plt.grid(True)
plt.legend(title="Period")
plt.tight_layout()
plt.show()

**Outlier Detection (Boxplot )**

Why chosen: To identify extreme unemployment spikes.

Insights: Nevada (2020), Michigan (2009) are strong outliers.

Impact: Negative – shocks to certain industries (tourism, manufacturing) create vulnerability.

In [None]:
pre_crisis2 = df2[df2["year"] < 2008]
post_crisis2 = df2[df2["year"] >= 2008]

df2.groupby(["wb_region", "year"])["unemp_rate"].mean().unstack()

**Boxplots for visual comparison:**

In [None]:
import seaborn as sns
sns.boxplot(x="region", y="unemp_rate", hue="year", data=df1)

In [None]:
import seaborn as sns
sns.boxplot(x="wb_region", y="unemp_rate", hue="year", data=df2)

# **Key Observations from EDA:**

Unemployment has clear cyclical patterns, with spikes during the 2008 recession and the 2020 COVID crisis.

Some states (e.g., Nevada, Michigan, California) were more affected during crises.

Regional differences exist — for instance, developing regions tend to have more volatile unemployment trends compared to developed ones.

Over the long term, post-crisis recoveries (2011–2015, 2022–2024) show unemployment gradually declining.

# **HYPOTHESIS TESTING**

Crisis vs Pre-crisis

H₀: Average unemployment rate in 2006–2007 = 2008–2010

H₁: They are different (expect higher unemployment during crisis).

In [None]:
from scipy.stats import ttest_ind, f_oneway

# Pre-crisis (2006–2007) vs Crisis (2008–2010)
pre_crisis = df1[(df1["year"].between(2006,2007))]["unemp_rate"]
crisis = df1[(df1["year"].between(2008,2010))]["unemp_rate"]

t_stat, p_val = ttest_ind(pre_crisis, crisis, equal_var=False)
print("T-test Pre-crisis vs Crisis:")
print(f"T-stat = {t_stat:.3f}, p-value = {p_val:.5f}")

if p_val < 0.05:
    print(" Reject H₀: Unemployment significantly changed during crisis.")
else:
    print("Fail to Reject H₀: No significant difference.")


# **Time Series & Hypothesis Testing**


**1.T-Test (Pre-crisis vs Crisis)**

**Why chosen:** To confirm if crisis years significantly differ from pre-crisis years.

**Insights:** T-stat = -8.427, p < 0.001, unemployment was significantly higher in crises.

**Impact:** Negative – confirms high vulnerability to crises.

**Why is the p-value showing as 0.00000?**

It’s not actually zero

The true p-value is extremely small, smaller than the number of decimal places Python is printing.

By default, scipy reports something like 3.2e-15, but if you format it with :.5f, it will round down to 0.00000.

Strong evidence

A p-value this small means the difference between pre-crisis (2006–2007) and crisis (2008–2010) unemployment is statistically very significant.
 report it as p < 0.001 (instead of saying it equals 0).

# **Conclusion**

1. The dataset shows clear evidence of cyclical unemployment patterns, dominated by the 2008 financial crisis and 2020 COVID crisis.

2. Certain states/regions are more vulnerable and need special policy focus.

3. Statistical testing confirms significant differences in unemployment during crises.

# **Business impact:**

**Negative** → crises severely harm employment stability.

**Positive** → insights can help governments, businesses, and policymakers forecast risks, allocate resources efficiently, and design targeted interventions.