# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.multivariate.manova import MANOVA


In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [3]:
# Null values
df.isnull().any()
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [4]:
# Understand the Hypotheses
#μ₁ = average HP of Dragon-type Pokémon
#μ₂ = average HP of Grass-type Pokémon
#Null hypothesis (H₀): μ₁ ≤ μ₂ (Dragon-type Pokémon do not have higher average HP)
#Alternative hypothesis (H₁): μ₁ > μ₂ (Dragon-type Pokémon have higher average HP)
#This is a one-tailed t-test (right-tailed), at α = 0.05 significance level.

In [5]:
#code here

# Filter Dragon and Grass type Pokémon
dragon_hp = df[df['Type 1'] == 'Dragon']['HP']
grass_hp = df[df['Type 1'] == 'Grass']['HP']

# Perform two-sample independent t-test (equal_var=False for Welch's t-test)
t_stat, p_value = ttest_ind(dragon_hp, grass_hp, equal_var=False)

# One-tailed p-value (we're testing if Dragon > Grass)
p_value_one_tailed = p_value / 2

# Print the results
print(f"T-statistic: {t_stat:.4f}")
print(f"One-tailed p-value: {p_value_one_tailed:.4f}")


T-statistic: 3.3350
One-tailed p-value: 0.0008


In [6]:
#Interpretation
#If p_value_one_tailed < 0.05: Reject H₀ → Dragon-type Pokémon do have significantly more HP.
#If p_value_one_tailed ≥ 0.05: Fail to reject H₀ → No significant difference in HP.
#Comment:
#At the 5% significance level, we have strong statistical evidence to support the claim that Dragon-type Pokémon have 
#a higher average HP than Grass-type Pokémon.

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [7]:
#code here
#Appropriate Test is Multivariate Analysis of Variance (MANOVA)
#Because:There are multiple dependent variables (the six stats),
#One independent categorical variable: whether a Pokémon is Legendary (True or False),
#To checking overall if the combination of stats differs between groups.

# Select only relevant columns
stats_cols = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
df_stats = df[stats_cols + ['Legendary']]

# Convert boolean to string for MANOVA compatibility
df_stats['Legendary'] = df_stats['Legendary'].astype(str)

# Fit MANOVA model
# We use Q("Sp. Atk") and Q("Sp. Def") to handle column names with spaces/special characters.
manova = MANOVA.from_formula('HP + Attack + Defense + Q("Sp. Atk") + Q("Sp. Def") + Speed ~ Legendary', data=df_stats)
print(manova.mv_test())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_stats['Legendary'] = df_stats['Legendary'].astype(str)


                   Multivariate linear model
                                                                
----------------------------------------------------------------
       Intercept         Value  Num DF  Den DF   F Value  Pr > F
----------------------------------------------------------------
          Wilks' lambda  0.0592 6.0000 793.0000 2100.8338 0.0000
         Pillai's trace  0.9408 6.0000 793.0000 2100.8338 0.0000
 Hotelling-Lawley trace 15.8953 6.0000 793.0000 2100.8338 0.0000
    Roy's greatest root 15.8953 6.0000 793.0000 2100.8338 0.0000
----------------------------------------------------------------
                                                                
----------------------------------------------------------------
          Legendary        Value  Num DF  Den DF  F Value Pr > F
----------------------------------------------------------------
             Wilks' lambda 0.7331 6.0000 793.0000 48.1098 0.0000
            Pillai's trace 0.2669 6.0000 793.

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [8]:
df1 = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df1.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [9]:
print(df1.columns)

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [10]:
#Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
# Define coordinates
school_coords = (-118, 34)
hospital_coords = (-122, 37)

# Function to compute Euclidean distance
def euclidean_distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lon2 - lon1)**2 + (lat2 - lat1)**2)

# Compute distances using the correct column names
df1['dist_to_school'] = euclidean_distance(df1['longitude'], df1['latitude'], *school_coords)
df1['dist_to_hospital'] = euclidean_distance(df1['longitude'], df1['latitude'], *hospital_coords)
# View results
df1[['longitude', 'latitude', 'dist_to_school', 'dist_to_hospital']].head()

Unnamed: 0,longitude,latitude,dist_to_school,dist_to_hospital
0,-114.31,34.19,3.694888,8.187319
1,-114.47,34.4,3.552591,7.966235
2,-114.56,33.69,3.45394,8.143077
3,-114.57,33.64,3.44884,8.154416
4,-114.57,33.57,3.456848,8.183508


In [11]:
#You may also define a column that marks whether a house is close to either the school or the hospital:
df1['is_close'] = (df1['dist_to_school'] < 0.50) | (df1['dist_to_hospital'] < 0.50)

In [12]:
df1['is_close']

0        False
1        False
2        False
3        False
4        False
         ...  
16995    False
16996    False
16997    False
16998    False
16999    False
Name: is_close, Length: 17000, dtype: bool

In [14]:
#Divide your dataset into houses close and far from either a hospital or school.
#To divide the dataset into houses that are "close" or "far" from either the hospital or school, I'll define:

#A house is "close" if its distance to either the school or the hospital is less than 0.50.

#Otherwise, it is "far".

# Define a new boolean column: True if close to school or hospital
df1['is_close'] = (df1['dist_to_school'] < 0.50) | (df1['dist_to_hospital'] < 0.50)

# Divide the dataset
df1_close = df1[df1['is_close'] == True]
df1_far = df1[df1['is_close'] == False]

# Optional: check how many rows are in each
print("Number of close houses:", len(df1_close))
print("Number of far houses:", len(df1_far))

Number of close houses: 6829
Number of far houses: 10171


In [15]:
print("Average price (close):", df1_close['median_house_value'].mean())
print("Average price (far):", df1_far['median_house_value'].mean())

Average price (close): 246951.98213501245
Average price (far): 180678.44105790975
