# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [11]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
from statsmodels.multivariate.manova import MANOVA



In [12]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [13]:

 
from scipy.stats import ttest_ind

# Load the data
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv"
df = pd.read_csv(url)

# Ensure both Type 1 and Type 2 are considered
dragon = df[(df['Type 1'] == 'Dragon') | (df['Type 2'] == 'Dragon')]['HP']
grass = df[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Grass')]['HP']

# Perform one-tailed t-test: H1 is Dragon > Grass
t_stat, p_value = ttest_ind(dragon, grass, equal_var=False)  # Welch's t-test

# One-tailed p-value
p_value_one_tailed = p_value / 2

# Print results
print(f"Mean HP - Dragon: {dragon.mean():.2f}, Grass: {grass.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"One-tailed p-value: {p_value_one_tailed:.4f}")

# Conclusion
if (t_stat > 0) and (p_value_one_tailed < 0.05):
    print("✅ We reject the null hypothesis: Dragon-type Pokémon have significantly higher HP than Grass-type at 5% level.")
else:
    print("❌ We fail to reject the null hypothesis: No significant evidence that Dragon-type Pokémon have more HP.")


Mean HP - Dragon: 82.90, Grass: 66.05
T-statistic: 4.0975
One-tailed p-value: 0.0001
✅ We reject the null hypothesis: Dragon-type Pokémon have significantly higher HP than Grass-type at 5% level.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [None]:
#code here

# Select variables of interest
stats_cols = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
df_clean = df[stats_cols + ['Legendary']].dropna()

# Convert boolean to categorical
df_clean['Legendary'] = df_clean['Legendary'].astype(str)

# Run MANOVA
manova = MANOVA.from_formula('HP + Attack + Defense + Q("Sp. Atk") + Q("Sp. Def") + Speed ~ Legendary', data=df_clean)
result = manova.mv_test()
print(result)


                   Multivariate linear model
                                                                
----------------------------------------------------------------
       Intercept         Value  Num DF  Den DF   F Value  Pr > F
----------------------------------------------------------------
          Wilks' lambda  0.0592 6.0000 793.0000 2100.8338 0.0000
         Pillai's trace  0.9408 6.0000 793.0000 2100.8338 0.0000
 Hotelling-Lawley trace 15.8953 6.0000 793.0000 2100.8338 0.0000
    Roy's greatest root 15.8953 6.0000 793.0000 2100.8338 0.0000
----------------------------------------------------------------
                                                                
----------------------------------------------------------------
          Legendary        Value  Num DF  Den DF  F Value Pr > F
----------------------------------------------------------------
             Wilks' lambda 0.7331 6.0000 793.0000 48.1098 0.0000
            Pillai's trace 0.2669 6.0000 793.

Conclusion:
Wilks' Lambda p-value for Legendary is < 0.0001, which is much less than the 5% significance threshold.

Result: We reject the null hypothesis.

There is strong statistical evidence that Legendary Pokémon differ significantly from Non-Legendary Pokémon across the set of stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed).

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [15]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [16]:

from scipy.stats import ttest_ind

# Load data
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")

# Define school and hospital coordinates
school_coords = (-118, 34)
hospital_coords = (-122, 37)

# Function to compute Euclidean distance
def distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lon1 - lon2)**2 + (lat1 - lat2)**2)

# Compute distances to school and hospital
df['dist_school'] = distance(df['longitude'], df['latitude'], *school_coords)
df['dist_hospital'] = distance(df['longitude'], df['latitude'], *hospital_coords)

# Classify houses
df['is_close'] = ((df['dist_school'] < 0.5) | (df['dist_hospital'] < 0.5))

# Compare median house values
close_prices = df[df['is_close']]['median_house_value']
far_prices = df[~df['is_close']]['median_house_value']

# Perform independent two-sample t-test
t_stat, p_value = ttest_ind(close_prices, far_prices, equal_var=False)  # Welch's t-test

# Print summary
print(f"Mean house value (close): ${close_prices.mean():,.2f}")
print(f"Mean house value (far):   ${far_prices.mean():,.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Conclusion
alpha = 0.05
if p_value < alpha:
    print("✅ We reject the null hypothesis: Houses close to a school or hospital are significantly more expensive.")
else:
    print("❌ We fail to reject the null hypothesis: No significant difference in house prices.")


Mean house value (close): $246,951.98
Mean house value (far):   $180,678.44
T-statistic: 37.9923
P-value: 0.0000
✅ We reject the null hypothesis: Houses close to a school or hospital are significantly more expensive.
