# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


1.1 We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [28]:
# Choosing INDEPENDENT 2 way t-test
df_dragon = df[df["Type 1"]=="Dragon"]["HP"]
df_grass = df[df["Type 1"]=="Grass"]["HP"]

In [None]:
df_dragon

In [None]:
df_grass

In [None]:
#Set the hypothesis

#H0: Pokemons of type Dragon DO NOT have, on average, more HP stats than Grass
#H1: Pokemons of type Dragon have, on average, more HP stats than Grass

#significance level = 0.05

In [38]:
st.ttest_ind(df_dragon,df_grass, equal_var=False, alternative="greater")

TtestResult(statistic=3.3349632905124063, pvalue=0.0007993609745420599, df=50.83784116232685)

**Answer**
- p value itself just says "the probability of observing the data, if H0 were true".
- Interpretation: Assuming that Pokemons of type Dragon DO NOT have, on average, more HP stats than Grass, we are only 0.0007993609745420599 (0.07%) likely to observe the given data. 
- Conclusion: Therefore, we have strong enough evidence to reject H0. 

1.2 We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [43]:
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [3]:
# Choosing INDEPENDENT 2 way t-test
df_legendary = df[df["Legendary"] == True][["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]
df_non_legendary = df[df["Legendary"] == False][["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]

In [4]:
df_legendary

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
156,90,85,100,95,125,85
157,90,90,85,125,90,100
158,90,100,90,125,85,90
162,106,110,90,154,90,130
163,106,190,100,154,100,130
...,...,...,...,...,...,...
795,50,100,150,100,150,50
796,50,160,110,160,110,110
797,80,110,60,150,130,70
798,80,160,60,170,130,80


In [5]:
df_non_legendary

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,45,49,49,65,65,45
1,60,62,63,80,80,60
2,80,82,83,100,100,80
3,80,100,123,122,120,80
4,39,52,43,60,50,65
...,...,...,...,...,...,...
787,85,100,122,58,75,54
788,55,69,85,32,35,28
789,95,117,184,44,46,28
790,40,30,35,45,40,55


In [19]:
# Creating for loop to go over every stat/ list of stats to test
stats = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]
for stat in stats:
    t_test, p_value = st.ttest_ind(df_legendary[stat], df_non_legendary[stat], equal_var=False, alternative="greater")
    print(p_value)
    if p_value < 0.05:
        print(f"For {stat} we reject H0.")
    else:
        print(f"We fail to reject H0 for {stat}.")

5.013455854017642e-14
For HP we reject H0.
1.260186224618323e-16
For Attack we reject H0.
2.4134992474596658e-11
For Defense we reject H0.
7.757307056119906e-22
For Sp. Atk we reject H0.
1.1474663932026413e-15
For Sp. Def we reject H0.
5.245081559412255e-19
For Speed we reject H0.


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [25]:
housing_df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [33]:
### Step 1: Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.

# Define school and hospital coordinates
school_coords = (-118, 34)
hospital_coords = (-122, 37)

# Function to calculate Euclidean distance
def euclidean_distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lon2 - lon1)**2 + (lat2 - lat1)**2)

# Apply function to calculate distance from each house to school & hospital
housing_df["distance_to_school"] = housing_df.apply(
    lambda row: euclidean_distance(row["longitude"], row["latitude"], *school_coords), axis=1)

housing_df["distance_to_hospital"] = housing_df.apply(
    lambda row: euclidean_distance(row["longitude"], row["latitude"], *hospital_coords), axis=1)

# Create a new column: is the house close to either the school or hospital?
housing_df["close_to_school_or_hospital"] = (housing_df["distance_to_school"] < 0.50) | \
                                            (housing_df["distance_to_hospital"] < 0.50)

# Display result
housing_df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital,close_to_school_or_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,3.694888,8.187319,False
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,3.552591,7.966235,False
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,3.453940,8.143077,False
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,3.448840,8.154416,False
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,3.456848,8.183508,False
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,9.082070,4.233675,False
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,9.168915,4.332320,False
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,10.057614,5.358694,False
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,10.026465,5.322593,False


In [45]:
### Step 2: Divide your dataset into houses close and far from either a hospital or school.
close_df = housing_df[housing_df["close_to_school_or_hospital"] == True]["median_house_value"]
far_df = housing_df[housing_df["close_to_school_or_hospital"] == False]["median_house_value"]

In [47]:
close_df

2366     124700.0
2367     137500.0
2368     169100.0
2371     182900.0
2372     220800.0
           ...   
15090    177500.0
15170    500001.0
15253    277700.0
15254    319400.0
15686    286100.0
Name: median_house_value, Length: 6829, dtype: float64

In [49]:
far_df

0         66900.0
1         80100.0
2         85700.0
3         73400.0
4         65500.0
           ...   
16995    111400.0
16996     79000.0
16997    103600.0
16998     85800.0
16999     94600.0
Name: median_house_value, Length: 10171, dtype: float64

In [56]:
### Step 3: Choose the propper test and, with 5% significance, comment your findings.

# Null Hypothesis (H₀): Houses close to a school/hospital have the same median price as those farther away.
# Alternative Hypothesis (H₁): Houses close to a school/hospital have a higher median price.

# Perform independent two-sample t-test
t_test, p_value = st.ttest_ind(close_df,far_df, equal_var=False, alternative="greater")

alpha = 0.05
if p_value < alpha:
    print("Conclusion: We reject the null hypothesis. Houses near a school or hospital (indeed) tend to be more expensive, which is in line with our assumptions.")
else:
    print("Conclusion: We fail to reject the null hypothesis. There's no significant price difference.")

Conclusion: We reject the null hypothesis. Houses near a school or hospital (indeed) tend to be more expensive, which is in line with our assumptions.
