**Sheth L.U.J. & Sir M.V. College Of Arts, Science & Commerce**

**Shobit Halse | T083**

**Practical No. 05**

**Aim:** ANOVA (Analysis of Variance)
* Perform one-way ANOVA to compare means across multiple groups.
* Conduct post-hoc tests to identify significant differences between group means.


### **ANOVA (F-TEST)**

**One Way F-test (Anova):** We will compare the mean unemployment rates across different regions using data from multiple years (2010-2021) as observations.

In [7]:
import pandas as pd
import scipy.stats as stats
import numpy as np

# Load the dataset
df = pd.read_csv("Unemployment-Analysis.csv")

# Select specific regions for comparison
regions = ['Europe & Central Asia', 'Latin America & Caribbean', 'East Asia & Pacific', 'Middle East & North Africa']

# Filter data for these regions
region_data = df[df['Country Name'].isin(regions)]

# Use multiple years as observations (2010-2021)
years = [str(y) for y in range(2010, 2022)]

# Prepare data for ANOVA - each region gets unemployment rates across multiple years
d_data = {}
for region in regions:
    row = region_data[region_data['Country Name'] == region]
    values = row[years].values.flatten().astype(float)
    d_data[region] = values
    print(f"{region}: {values}")

# Perform One-way ANOVA
F, p = stats.f_oneway(*d_data.values())

print(f"\nF-statistic: {F:.4f}")
print(f"p-value: {p:.6f}")

if p < 0.05:
    print("Reject null hypothesis - There is a significant difference in unemployment rates across regions")
else:
    print("Accept null hypothesis - No significant difference in unemployment rates across regions")

Europe & Central Asia: [9.02 8.76 8.96 9.2  8.96 8.56 8.08 7.48 6.9  6.67 7.18 7.12]
Latin America & Caribbean: [ 6.8   6.48  6.41  6.36  6.15  6.63  7.72  7.99  7.87  7.93 10.06  9.96]
East Asia & Pacific: [4.21 4.12 4.04 4.03 4.01 4.07 3.98 3.86 3.73 3.82 4.32 4.2 ]
Middle East & North Africa: [ 9.55 10.05 10.17  9.92 10.04 10.18 10.24 10.26  9.88  9.15 10.54 10.53]

F-statistic: 104.4641
p-value: 0.000000
Reject null hypothesis - There is a significant difference in unemployment rates across regions


**Alternative One-Way ANOVA:** Compare unemployment rates across multiple countries

In [8]:
# Select individual countries for comparison
countries = ['India', 'China', 'Germany', 'Brazil', 'Japan']

# Filter data for these countries
country_data = df[df['Country Name'].isin(countries)]

# Use multiple years as observations
years = [str(y) for y in range(2010, 2022)]

# Prepare data
d_data = {}
for country in countries:
    row = country_data[country_data['Country Name'] == country]
    values = row[years].values.flatten().astype(float)
    d_data[country] = values
    print(f"{country}: Mean = {np.mean(values):.2f}%")

# Perform One-way ANOVA
F, p = stats.f_oneway(*d_data.values())

print(f"\nF-statistic: {F:.4f}")
print(f"p-value: {p:.10f}")

if p < 0.05:
    print("Reject null hypothesis - Significant difference in unemployment rates across countries")
else:
    print("Accept null hypothesis")

India: Mean = 5.67%
China: Mean = 4.59%
Germany: Mean = 4.56%
Brazil: Mean = 10.02%
Japan: Mean = 3.44%

F-statistic: 33.3157
p-value: 0.0000000000
Reject null hypothesis - Significant difference in unemployment rates across countries


### **Two Way F-test**

We will check the effect of two independent variables (**Region** and **Time Period**) on unemployment rate.

In [9]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Select countries for analysis
selected_countries = ['India', 'China', 'Germany', 'France', 'Brazil', 'Japan', 'Australia', 'Canada']

# Years to analyze
years = [str(y) for y in range(2015, 2022)]

# Filter and reshape data
country_df = df[df['Country Name'].isin(selected_countries)][['Country Name'] + years]

# Melt the dataframe to long format
melted_df = country_df.melt(id_vars=['Country Name'], 
                            value_vars=years,
                            var_name='Year', 
                            value_name='UnemploymentRate')

# Create a region category
region_map = {
    'India': 'Asia',
    'China': 'Asia',
    'Japan': 'Asia',
    'Germany': 'Europe',
    'France': 'Europe',
    'Brazil': 'Americas',
    'Australia': 'Oceania',
    'Canada': 'Americas'
}

melted_df['Region'] = melted_df['Country Name'].map(region_map)

# Create time period category (Pre-COVID vs COVID)
def get_period(year):
    if int(year) < 2020:
        return 'Pre-COVID'
    else:
        return 'COVID-Era'

melted_df['Period'] = melted_df['Year'].apply(get_period)

print("Data sample:")
print(melted_df.head(10))
print(f"\nTotal observations: {len(melted_df)}")

# Fit the OLS model - Region and Period as factors
model = ols('UnemploymentRate ~ C(Region) * C(Period)', melted_df).fit()

print(f"\nOverall model F({model.df_model:.0f}, {model.df_resid:.0f}) = {model.fvalue:.3f}, p = {model.f_pvalue:.4f}")

# Perform ANOVA on the model
res = sm.stats.anova_lm(model, typ=2)
print("\nTwo-Way ANOVA Results:")
print(res)

Data sample:
  Country Name  Year  UnemploymentRate    Region     Period
0    Australia  2015              6.05   Oceania  Pre-COVID
1       Brazil  2015              8.43  Americas  Pre-COVID
2       Canada  2015              6.91  Americas  Pre-COVID
3        China  2015              4.63      Asia  Pre-COVID
4      Germany  2015              4.62    Europe  Pre-COVID
5       France  2015             10.35    Europe  Pre-COVID
6        India  2015              5.43      Asia  Pre-COVID
7        Japan  2015              3.40      Asia  Pre-COVID
8    Australia  2016              5.71   Oceania  Pre-COVID
9       Brazil  2016             11.60  Americas  Pre-COVID

Total observations: 56

Overall model F(7, 48) = 6.720, p = 0.0000

Two-Way ANOVA Results:
                         sum_sq    df          F        PR(>F)
C(Region)            227.069617   3.0  14.415430  7.963250e-07
C(Period)              5.364529   1.0   1.021695  3.171856e-01
C(Region):C(Period)   14.562343   3.0   0.9244

### **Post-hoc Test (Tukey HSD)**

If ANOVA shows significant differences, we perform post-hoc tests to identify which specific groups differ.

In [10]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey HSD test for Region
tukey_region = pairwise_tukeyhsd(endog=melted_df['UnemploymentRate'],
                                  groups=melted_df['Region'],
                                  alpha=0.05)

print("Tukey HSD Post-hoc Test Results (by Region):")
print(tukey_region)

# Perform Tukey HSD test for Period
tukey_period = pairwise_tukeyhsd(endog=melted_df['UnemploymentRate'],
                                  groups=melted_df['Period'],
                                  alpha=0.05)

print("\nTukey HSD Post-hoc Test Results (by Period):")
print(tukey_period)

Tukey HSD Post-hoc Test Results (by Region):
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1   group2 meandiff p-adj   lower   upper  reject
-------------------------------------------------------
Americas    Asia  -5.1502    0.0 -7.2445  -3.056   True
Americas  Europe    -3.16 0.0033 -5.4541 -0.8659   True
Americas Oceania  -3.9393 0.0027  -6.749 -1.1296   True
    Asia  Europe   1.9902 0.0681  -0.104  4.0845  False
    Asia Oceania    1.211 0.6213 -1.4381    3.86  False
  Europe Oceania  -0.7793 0.8821  -3.589  2.0304  False
-------------------------------------------------------

Tukey HSD Post-hoc Test Results (by Period):
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
  group1    group2  meandiff p-adj  lower  upper reject
-------------------------------------------------------
COVID-Era Pre-COVID  -0.6851 0.447 -2.4782 1.108  False
-------------------------------------------------------
