**Sheth L.U.J. & Sir M.V. College Of Arts, Science & Commerce**

**Shobit Halse | T083**

**Practical No. 04**

**Aim:** Hypothesis Testing
* Formulate null and alternative hypotheses for a given problem.
* Conduct a hypothesis test using appropriate statistical tests (e.g., t-test chi-square test).
* Interpret the results and draw conclusions based on the test outcomes.

### **One-sample t-test to evaluate whether our hypothesis is correct or not.**

In [1]:
import pandas as pd
import scipy.stats as stats

# Load the dataset
df = pd.read_csv("Unemployment-Analysis.csv")

# Sample data: We will use the unemployment rate for year 2021
data = df['2021'].dropna()

print("First 10 Unemployment Rates (2021): ", data.head(10).values)

# Define the null hypothesis
H0 = "The average unemployment rate across countries in 2021 is 7%."

# Define the alternative hypothesis
H1 = "The average unemployment rate across countries in 2021 is not 7%."

# Calculate the test statistic
# We perform a 1-sample t-test against the population mean of 7
t_stat, p_value = stats.ttest_1samp(data, 7)

# Print the results
print("\nNull Hypothesis:", H0)
print("Alternative Hypothesis:", H1)
print("\nTest statistic:", t_stat)
print("p-value:", p_value)

# Conclusion
if p_value < 0.05:
    print("\nConclusion: Reject the null hypothesis.")
else:
    print("\nConclusion: Fail to reject the null hypothesis.")

First 10 Unemployment Rates (2021):  [ 8.11 13.28  6.84  8.53 11.82 11.63  3.36 10.9  20.9   5.11]

Null Hypothesis: The average unemployment rate across countries in 2021 is 7%.
Alternative Hypothesis: The average unemployment rate across countries in 2021 is not 7%.

Test statistic: 3.3951912524526238
p-value: 0.0008055222105643003

Conclusion: Reject the null hypothesis.


### **Two-sample (Independent) T-test**

The Independent Samples t-test compares the means of two independent groups. Here, we will compare the **unemployment rate in 2020** vs **unemployment rate in 2019** to see if there was a significant change (possibly due to COVID-19 impact).

In [2]:
from scipy.stats import ttest_ind
import numpy as np

# Select unemployment rates for two different years
unemployment_2019 = df['2019'].dropna()
unemployment_2020 = df['2020'].dropna()

print("Unemployment Rates 2019 (first 5):", unemployment_2019.head().values)
print("Unemployment Rates 2020 (first 5):", unemployment_2020.head().values)

mean_2019 = np.mean(unemployment_2019)
mean_2020 = np.mean(unemployment_2020)

print("\nMean unemployment rate 2019:", round(mean_2019, 2))
print("Mean unemployment rate 2020:", round(mean_2020, 2))

std_2019 = np.std(unemployment_2019)
std_2020 = np.std(unemployment_2020)

print("\nStd deviation 2019:", round(std_2019, 2))
print("Std deviation 2020:", round(std_2020, 2))

# Perform Independent t-test
ttest, pval = ttest_ind(unemployment_2019, unemployment_2020)
print("\np-value:", pval)

if pval < 0.05:
    print("\nConclusion: We reject null hypothesis - There is a significant difference between 2019 and 2020 unemployment rates.")
else:
    print("\nConclusion: We accept null hypothesis - No significant difference between 2019 and 2020 unemployment rates.")

Unemployment Rates 2019 (first 5): [ 6.91 11.22  6.06  7.42 11.47]
Unemployment Rates 2020 (first 5): [ 7.56 11.71  6.77  8.33 13.33]

Mean unemployment rate 2019: 7.09
Mean unemployment rate 2020: 8.28

Std deviation 2019: 5.12
Std deviation 2020: 5.46

p-value: 0.015237345431532911

Conclusion: We reject null hypothesis - There is a significant difference between 2019 and 2020 unemployment rates.


### **Paired sample t-test**

The paired sample t-test checks for differences between two related variables. Here we compare unemployment rates for the same countries in 2019 vs 2020 (before and during COVID-19).

**H0**: Mean difference between 2019 and 2020 unemployment rates is 0.
**H1**: Mean difference is not 0.

In [3]:
import pandas as pd
from scipy import stats
import numpy as np

# Create a subset with both years having valid data
paired_df = df[['Country Name', '2019', '2020']].dropna()

print("Sample of paired data:")
print(paired_df.head(10))
print("\nDescriptive Statistics:")
print(paired_df[['2019', '2020']].describe())

# Perform Paired t-test
ttest, pval = stats.ttest_rel(paired_df['2019'], paired_df['2020'])
print("\np-value:", pval)

if pval < 0.05:
    print("\nConclusion: Reject null hypothesis - There is a significant change in unemployment rates from 2019 to 2020.")
else:
    print("\nConclusion: Accept null hypothesis - No significant change in unemployment rates from 2019 to 2020.")

Sample of paired data:
                  Country Name   2019   2020
0  Africa Eastern and Southern   6.91   7.56
1                  Afghanistan  11.22  11.71
2   Africa Western and Central   6.06   6.77
3                       Angola   7.42   8.33
4                      Albania  11.47  13.33
5                   Arab World  10.01  11.49
6         United Arab Emirates   2.23   3.19
7                    Argentina   9.84  11.46
8                      Armenia  18.30  21.21
9                    Australia   5.16   6.46

Descriptive Statistics:
             2019        2020
count  235.000000  235.000000
mean     7.087362    8.278809
std      5.129146    5.470319
min      0.100000    0.210000
25%      3.805000    4.620000
50%      5.530000    6.800000
75%      8.605000   10.230000
max     28.470000   29.220000

p-value: 2.0069299704307128e-39

Conclusion: Reject null hypothesis - There is a significant change in unemployment rates from 2019 to 2020.


### **When you can run a Z Test.**

We use a Z-test because our sample size is large (N > 30). We will test if the mean unemployment rate in 2021 is significantly different from 8%.

**One-sample Z test**

In [4]:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests

# H0: The mean unemployment rate in 2021 is 8%
data_2021 = df['2021'].dropna()

ztest, pval = stests.ztest(data_2021, x2=None, value=8)

print("Testing if mean unemployment rate in 2021 equals 8%")
print("Sample mean:", round(data_2021.mean(), 2))
print("p-value:", float(pval))

if pval < 0.05:
    print("\nConclusion: Reject null hypothesis - Mean unemployment rate is significantly different from 8%.")
else:
    print("\nConclusion: Accept null hypothesis - Mean unemployment rate is not significantly different from 8%.")

Testing if mean unemployment rate in 2021 equals 8%
Sample mean: 8.22
p-value: 0.5408879166038716

Conclusion: Accept null hypothesis - Mean unemployment rate is not significantly different from 8%.


### **Two-sample Z test**

Here we check two independent data groups. We will compare the unemployment rates in **2010** vs **2020** to analyze the decade-long change.

**H0**: Mean unemployment rate of 2010 and 2020 is equal.
**H1**: Mean unemployment rate is not equal.

In [5]:
# Selecting unemployment rates for two different decades
unemployment_2010 = df['2010'].dropna()
unemployment_2020 = df['2020'].dropna()

print("Mean unemployment 2010:", round(unemployment_2010.mean(), 2))
print("Mean unemployment 2020:", round(unemployment_2020.mean(), 2))

# Perform Two-sample Z-test
ztest, pval1 = stests.ztest(unemployment_2010, x2=unemployment_2020,
                            value=0, alternative='two-sided')

print("\np-value:", float(pval1))

if pval1 < 0.05:
    print("\nConclusion: Reject null hypothesis - Significant difference between 2010 and 2020 unemployment rates.")
else:
    print("\nConclusion: Accept null hypothesis - No significant difference between 2010 and 2020 unemployment rates.")

Mean unemployment 2010: 8.11
Mean unemployment 2020: 8.28

p-value: 0.7349469582916599

Conclusion: Accept null hypothesis - No significant difference between 2010 and 2020 unemployment rates.


### **Chi-Square Test**

This test is applied to categorical variables. We will categorize countries based on their unemployment levels (Low, Medium, High) and test if there's an association between unemployment category in 2019 vs 2020.

In [6]:
import pandas as pd
from scipy.stats import chi2_contingency

# Create unemployment categories
def categorize_unemployment(rate):
    if pd.isna(rate):
        return None
    elif rate < 5:
        return 'Low'
    elif rate < 10:
        return 'Medium'
    else:
        return 'High'

# Apply categorization
df['Category_2019'] = df['2019'].apply(categorize_unemployment)
df['Category_2020'] = df['2020'].apply(categorize_unemployment)

# Create a contingency table
contingency_table = pd.crosstab(df['Category_2019'], df['Category_2020'])
print("Contingency Table (2019 vs 2020 Unemployment Categories):")
print(contingency_table)

# Perform the chi-square test
chi2_statistic, p_value, dof, expected_frequencies = chi2_contingency(contingency_table)

# Print the results
print('\nChi-square statistic:', chi2_statistic)
print('P-value:', p_value)
print('Degrees of freedom:', dof)
print('\nExpected frequencies:')
print(expected_frequencies)

if p_value < 0.05:
    print("\nConclusion: Reject null hypothesis - There is a significant association between 2019 and 2020 unemployment categories.")
else:
    print("\nConclusion: Accept null hypothesis - Unemployment categories in 2019 and 2020 are independent.")

Contingency Table (2019 vs 2020 Unemployment Categories):
Category_2020  High  Low  Medium
Category_2019                   
High             47    0       1
Low               1   68      27
Medium           14    1      76

Chi-square statistic: 272.34990728884395
P-value: 9.93673895842106e-58
Degrees of freedom: 4

Expected frequencies:
[[12.66382979 14.09361702 21.24255319]
 [25.32765957 28.18723404 42.48510638]
 [24.00851064 26.71914894 40.27234043]]

Conclusion: Reject null hypothesis - There is a significant association between 2019 and 2020 unemployment categories.
