# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [165]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [166]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [167]:
df.columns=df.columns.str.strip().str.lower().str.replace('.', '_').str.replace(' ', '')
df

Unnamed: 0,name,type1,type2,hp,attack,defense,sp_atk,sp_def,speed,generation,legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [168]:
df.isna().sum()
df=df.dropna().reset_index(drop=True)

In [169]:
df

Unnamed: 0,name,type1,type2,hp,attack,defense,sp_atk,sp_def,speed,generation,legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
...,...,...,...,...,...,...,...,...,...,...,...
409,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
410,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
411,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
412,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [None]:
# H0: Grp1_mean <= Grp2_mean
# H1: Grp1_mean > Grp2_mean

In [170]:
grp_B=df[df["type1"]=="Grass"]["hp"]
grp_B

0       45
1       60
2       80
3       80
24      45
25      60
26      75
32      50
33      65
34      80
55      60
56      95
94      35
95      55
96      75
133     70
145     70
146     90
157     60
172     50
181     70
196     99
215     95
223     40
224     60
256     60
257     90
258     90
281    100
300     40
301     60
321     69
322    114
327     44
328     74
349     91
364     88
Name: hp, dtype: int64

In [171]:
grp_A=df[df["type1"]=="Dragon"]["hp"]
grp_A

76      91
183     75
184     75
201     95
202     95
207     80
208     80
209     80
210     80
212    105
213    105
245     58
246     68
247    108
248    108
352    100
353    100
356    125
357    125
358    125
408    108
Name: hp, dtype: int64

In [172]:
from scipy import stats
# This returns the t-statistic and the p-value
stats.ttest_ind(grp_A, grp_B)

TtestResult(statistic=4.649881427485321, pvalue=2.068191387085888e-05, df=56.0)

In [173]:
# H0: Grp1_mean <= Grp2_mean
# H1: Grp1_mean > Grp2_mean
## The p-value=2.06e-05 is lower than a significance level of 0.05, we fail to reject H0 and support that Grp 1 has more stats than to Grp2.

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [None]:
# H0: Legendary Pokemons have same stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) as Non-legendary
# H1: Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) compared to Non-legendary

In [174]:
# Separate the groups based on the 'legendary' column
stats=["hp", "attack", "defense", "sp_atk", "sp_def", "speed","legendary"]
df_stats = df[stats]


In [175]:
df_stats

Unnamed: 0,hp,attack,defense,sp_atk,sp_def,speed,legendary
0,45,49,49,65,65,45,False
1,60,62,63,80,80,60,False
2,80,82,83,100,100,80,False
3,80,100,123,122,120,80,False
4,78,84,78,109,85,100,False
...,...,...,...,...,...,...,...
409,50,100,150,100,150,50,True
410,50,160,110,160,110,110,True
411,80,110,60,150,130,70,True
412,80,160,60,170,130,80,True


In [176]:
from statsmodels.multivariate.manova import MANOVA

In [180]:

# Define the formula: all six stats explained by the 'legendary' status
formula = 'hp + attack + defense + sp_atk + sp_def + speed ~ legendary'

# Run the MANOVA model
manova = MANOVA.from_formula(formula, data=df_stats)

# Get the multivariate test results (Focus on Wilks' lambda)
manova_results = manova.mv_test()
print(manova_results)

                   Multivariate linear model
                                                                
----------------------------------------------------------------
       Intercept         Value  Num DF  Den DF   F Value  Pr > F
----------------------------------------------------------------
          Wilks' lambda  0.0530 6.0000 407.0000 1211.8954 0.0000
         Pillai's trace  0.9470 6.0000 407.0000 1211.8954 0.0000
 Hotelling-Lawley trace 17.8658 6.0000 407.0000 1211.8954 0.0000
    Roy's greatest root 17.8658 6.0000 407.0000 1211.8954 0.0000
----------------------------------------------------------------
                                                                
----------------------------------------------------------------
          legendary        Value  Num DF  Den DF  F Value Pr > F
----------------------------------------------------------------
             Wilks' lambda 0.7117 6.0000 407.0000 27.4810 0.0000
            Pillai's trace 0.2883 6.0000 407.

In [None]:
# The p_value of all four test is less than 0.05 so we reject the null hypothesis
# Obtained result supports the claim that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) compared to Non-legendary 

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [181]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [182]:
pip install geopy pandas

Note: you may need to restart the kernel to use updated packages.


In [183]:
# H0: Houses close to either a school or a hospital are more expensive.
# H1: houses far from either a school or a hospital are more expensive.

In [184]:
from geopy.distance import great_circle

In [185]:

def calculate_great_circle_distance(row,target_point):
    
    # Point 1 (Start) must be (latitude, longitude)
    start_point = (row['latitude'], row['longitude'])
    
    # Point 2 (End) must be (latitude, longitude)
    end_point = target_point
    
    # The great_circle function returns a distance object; we extract miles/km
        # Returning distance in Kilometers (you can use .miles, .meters, etc.)
    distance_km = great_circle(start_point, end_point).km
    return distance_km

In [186]:
school_point = (34.00, -118.00) 
hospital_point = (37.00, -122.00) 

In [187]:
df["distance_school"]=df.apply(calculate_great_circle_distance,axis=1, target_point=school_point)
df["distance_hospital"]=df.apply(calculate_great_circle_distance,axis=1, target_point=hospital_point)

In [188]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_school,distance_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,340.418792,761.974459
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,327.659611,738.576332
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,319.542519,768.308921
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,319.365481,770.371666
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,320.561844,774.421877
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,917.047590,443.615947
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,927.102523,454.929344
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,1031.462839,573.239315
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,1027.794143,569.086288


In [189]:
cond1=df["distance_school"]<0.5
cond2=df["distance_hospital"]<0.5

In [190]:
df["distance_category"] = np.where(
    cond1 | cond2,"close", "far" )

In [191]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_school,distance_hospital,distance_category
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,340.418792,761.974459,far
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,327.659611,738.576332,far
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,319.542519,768.308921,far
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,319.365481,770.371666,far
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,320.561844,774.421877,far
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,917.047590,443.615947,far
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,927.102523,454.929344,far
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,1031.462839,573.239315,far
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,1027.794143,569.086288,far


In [192]:
# H0: avg_price_closer <= avg_price_far.
# H1: avg_price_closer > avg_price_far.

In [193]:
## If the average price of houses closer to schools and hospitals is higher than houses far from these places then we support that they are expensive.

In [194]:
df_close=df[df["distance_category"]=="close"]["median_house_value"]
df_far=df[df["distance_category"]=="far"]["median_house_value"]

In [195]:
df_close

13747    137500.0
Name: median_house_value, dtype: float64

In [196]:
from scipy import stats
# This returns the t-statistic and the p-value
stats.ttest_ind(df_close, df_far,alternative="greater")

TtestResult(statistic=-0.6018226451778325, pvalue=0.7263498868187508, df=16998.0)

In [197]:
## The P-value is higher than 0.05 so we do not reject HO and support claim that average price for houses closer to school and houses --
## -- are not more expensive.