# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [3]:
#libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [4]:
df_pokemon = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df_pokemon

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [5]:
df_pokemon.shape

(800, 11)

In [8]:
df_pokemon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        799 non-null    object
 1   Type 1      800 non-null    object
 2   Type 2      414 non-null    object
 3   HP          800 non-null    int64 
 4   Attack      800 non-null    int64 
 5   Defense     800 non-null    int64 
 6   Sp. Atk     800 non-null    int64 
 7   Sp. Def     800 non-null    int64 
 8   Speed       800 non-null    int64 
 9   Generation  800 non-null    int64 
 10  Legendary   800 non-null    bool  
dtypes: bool(1), int64(7), object(3)
memory usage: 63.4+ KB


In [7]:
df_pokemon.describe

<bound method NDFrame.describe of                Name   Type 1  Type 2  HP  Attack  Defense  Sp. Atk  Sp. Def  \
0         Bulbasaur    Grass  Poison  45      49       49       65       65   
1           Ivysaur    Grass  Poison  60      62       63       80       80   
2          Venusaur    Grass  Poison  80      82       83      100      100   
3     Mega Venusaur    Grass  Poison  80     100      123      122      120   
4        Charmander     Fire     NaN  39      52       43       60       50   
..              ...      ...     ...  ..     ...      ...      ...      ...   
795         Diancie     Rock   Fairy  50     100      150      100      150   
796    Mega Diancie     Rock   Fairy  50     160      110      160      110   
797  Hoopa Confined  Psychic   Ghost  80     110       60      150      130   
798   Hoopa Unbound  Psychic    Dark  80     160       60      170      130   
799       Volcanion     Fire   Water  80     110      120      130       90   

     Speed  Gener

In [12]:
df_pokemon.isnull().sum()

Name            1
Type 1          0
Type 2        386
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [13]:
# Set the hypothesis
# H0: mu HP_dragon <= mu HP_grass
# H1: mu HP_dragon > mu HP_grass

# Choose significance level
alpha = 0.05

# Collect data
hp_dragon = df_pokemon[(df_pokemon['Type 1'] == 'Dragon') | (df_pokemon['Type 2'] == 'Dragon')]['HP']
hp_grass = df_pokemon[(df_pokemon['Type 1'] == 'Grass') | (df_pokemon['Type 2'] == 'Grass')]['HP']

print("Dragon HP:")
print(hp_dragon.describe())
print("\n")
print("Grass HP:")
print(hp_grass.describe())

Dragon HP:
count     50.00000
mean      82.90000
std       25.65171
min       40.00000
25%       66.50000
50%       80.00000
75%       98.75000
max      150.00000
Name: HP, dtype: float64


Grass HP:
count     95.000000
mean      66.052632
std       18.861967
min       30.000000
25%       50.000000
50%       65.000000
75%       75.000000
max      123.000000
Name: HP, dtype: float64


In [14]:
# Perform the test
result = st.ttest_ind(hp_dragon, hp_grass, equal_var=False, alternative='greater')

print(f"Statistic: {result.statistic}")
print(f"P-value: {result.pvalue}")

Statistic: 4.097528915272702
P-value: 5.090769061176927e-05


In [15]:
if result.pvalue > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypothesis")

We reject the null hypothesis


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [17]:
# Set the hypothesis
# H0: mu HP_legendary = mu HP_non_legendary
# H1: mu HP_legendary != mu HP_non_legendary

alpha = 0.05

# Collect data
legendary = df_pokemon[df_pokemon['Legendary'] == True]
non_legendary = df_pokemon[df_pokemon['Legendary'] == False]

print(f"Legendary: {len(legendary)} Pokemon")
print(f"Non-Legendary: {len(non_legendary)} Pokemon")

Legendary: 65 Pokemon
Non-Legendary: 735 Pokemon


In [19]:
stats_to_test = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

for stat in stats_to_test:
    legendary_stat = legendary[stat]
    non_legendary_stat = non_legendary[stat]
    
    result = st.ttest_ind(legendary_stat, non_legendary_stat, equal_var=False)
    
    print(f"\n{stat}:")
    print(f"  Legendary mean: {legendary_stat.mean():.2f}")
    print(f"  Non-Legendary mean: {non_legendary_stat.mean():.2f}")
    print(f"  P-value: {result.pvalue:.6f}")
    
    # Decision
    if result.pvalue > alpha:
        print(f"  We are not able to reject the null hypothesis")
    else:
        print(f"  We reject the null hypothesis")


HP:
  Legendary mean: 92.74
  Non-Legendary mean: 67.18
  P-value: 0.000000
  We reject the null hypothesis

Attack:
  Legendary mean: 116.68
  Non-Legendary mean: 75.67
  P-value: 0.000000
  We reject the null hypothesis

Defense:
  Legendary mean: 99.66
  Non-Legendary mean: 71.56
  P-value: 0.000000
  We reject the null hypothesis

Sp. Atk:
  Legendary mean: 122.18
  Non-Legendary mean: 68.45
  P-value: 0.000000
  We reject the null hypothesis

Sp. Def:
  Legendary mean: 105.94
  Non-Legendary mean: 68.89
  P-value: 0.000000
  We reject the null hypothesis

Speed:
  Legendary mean: 100.18
  Non-Legendary mean: 65.46
  P-value: 0.000000
  We reject the null hypothesis


For all stats tested, the p-value is extremely low (practically 0), well below our significance level of 5%.
We reject the null hypothesis for all stats.
This means that Legendary Pokémon have significantly different statistics (and actually higher) compared to Non-Legendary Pokémon for all characteristics: HP, Attack, Defense, Sp. Atk, Sp. Def, and Speed (with 95% confidence level).

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [22]:
# Define the function to calculate euclidean distance
def euclidean_distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lon2 - lon1)**2 + (lat2 - lat1)**2)

# School coordinates
school_lon = -118
school_lat = 34

# Hospital coordinates
hospital_lon = -122
hospital_lat = 37

# Calculate distances from each house to school and hospital
df['distance_to_school'] = euclidean_distance(df['longitude'], df['latitude'], school_lon, school_lat)
df['distance_to_hospital'] = euclidean_distance(df['longitude'], df['latitude'], hospital_lon, hospital_lat)

In [23]:
# Display the first few rows to check
print(df[['longitude', 'latitude', 'distance_to_school', 'distance_to_hospital', 'median_house_value']].head())

   longitude  latitude  distance_to_school  distance_to_hospital  \
0    -114.31     34.19            3.694888              8.187319   
1    -114.47     34.40            3.552591              7.966235   
2    -114.56     33.69            3.453940              8.143077   
3    -114.57     33.64            3.448840              8.154416   
4    -114.57     33.57            3.456848              8.183508   

   median_house_value  
0             66900.0  
1             80100.0  
2             85700.0  
3             73400.0  
4             65500.0  


In [24]:
df['is_close'] = (df['distance_to_school'] < 0.50) | (df['distance_to_hospital'] < 0.50)

# Check how many houses are close vs far
print(df['is_close'].value_counts())
print("\n")

# Separate the data into two groups
close_houses = df[df['is_close'] == True]['median_house_value']
far_houses = df[df['is_close'] == False]['median_house_value']

print(f"Close houses: {len(close_houses)}")
print(f"Mean house value (close): {close_houses.mean():.2f}")
print(f"Far houses: {len(far_houses)}")
print(f"Mean house value (far): {far_houses.mean():.2f}")

is_close
False    10171
True      6829
Name: count, dtype: int64


Close houses: 6829
Mean house value (close): 246951.98
Far houses: 10171
Mean house value (far): 180678.44


In [25]:
# Set the hypothesis
# H0: mu house_value_close = mu house_value_far
# H1: mu house_value_close != mu house_value_far

# Choose significance level
alpha = 0.05

# Perform the test (one-tailed: "greater" because we test if close houses are MORE expensive)
result = st.ttest_ind(close_houses, far_houses, equal_var=False, alternative='greater')

print(f"Statistic: {result.statistic}")
print(f"P-value: {result.pvalue}")

# Decision-making
if result.pvalue > alpha:
    print("\nWe are not able to reject the null hypothesis")
else:
    print("\nWe reject the null hypothesis")

Statistic: 37.992330214201516
P-value: 1.5032478884296307e-301

We reject the null hypothesis


With a p-value practically equal to 0 (much lower than our significance level of 0.05), we reject the null hypothesis.
This means we have strong statistical evidence to conclude that houses close to either a school or a hospital (distance < 0.50) are significantly more expensive than houses farther away, with 95% confidence.