# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind



In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df.head()


Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [4]:
# H0 there is no different / the mean of dragon is the same of grass
# H1 dragon has more HP / dragon has more hp

dragon_hp = df[df["Type 1"] == "Dragon"]["HP"]
grass_hp = df[df["Type 1"] == "Grass"]["HP"]

t_stat, p_val = ttest_ind(dragon_hp, grass_hp, alternative="greater")

print("The T-statistic:", t_stat)
print("The p_value :", p_val)

a = 0.05

alpha = 0.05

if p_val > a :
    print("The p value is greater than a ! We fail to reject the H0, that mean there is no different between their HP ")
else:
    print("The p value is less than a ! We reject the H0, theat mean there is a different and Dragon does have more HP")


The T-statistic: 3.590444254130357
The p_value : 0.0002567969150153481
The p value is less than a ! We reject the H0, theat mean there is a different and Dragon does have more HP


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [13]:
# H0 legendary Pokemons has no diiferent than none legendary 
# H1 legendary Pokemons has more porwer than none legendary
stats_cols = ['HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed']

legendary_df = df[df["Legendary"] ==True]
none_legendary = df[df["Legendary"] ==False]

alpha = 0.05

for col in stats_cols:
    t_stat, p_val = ttest_ind(legendary_df[col], none_legendary[col], alternative="two-sided")
    print(f"\n {col}")
    print(f"T-statistic:", t_stat)
    print(f"The p value:", p_val)

    if p_val > alpha:
        print(" We fail to reject H0, that means the is no diffenrent")

    else:
        print(" We reject the H0, that means there is a different between them")
    


 HP
T-statistic: 8.036124405043928
The p value: 3.3306476848461913e-15
 We reject the H0, that means there is a different between them

 Attack
T-statistic: 10.397321023700622
The p value: 7.827253003205333e-24
 We reject the H0, that means there is a different between them

 Defense
T-statistic: 7.181240122992339
The p value: 1.5842226094427255e-12
 We reject the H0, that means there is a different between them

 Sp. Atk
T-statistic: 14.191406210846289
The p value: 6.314915770427265e-41
 We reject the H0, that means there is a different between them

 Sp. Def
T-statistic: 11.03775106120522
The p value: 1.8439809580409597e-26
 We reject the H0, that means there is a different between them

 Speed
T-statistic: 9.765234331931898
The p value: 2.3540754436898437e-21
 We reject the H0, that means there is a different between them


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [11]:
housing = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv')
housing.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [19]:
school_coords = (-118, 34)
hospital_coords = (-122, 37)

def euclidean_distance(lon1, lat1, lon2, lat2):
    return np.sqrt((lon1 - lon2)**2 + (lat1 - lat2)**2)

housing["dist_school"] = euclidean_distance(housing["longitude"], housing["latitude"], school_coords[0], school_coords[1])
housing["dist_hospital"] = euclidean_distance(housing["longitude"], housing["latitude"], hospital_coords[0], hospital_coords[1])


# houses is close 0.50 from schools and hospital
housing["is_close"] = ((housing["dist_school"] < 0.50) | (housing["dist_hospital"] < 0.50))

# making colse and far houses 
close_houses = housing[housing["is_close"] == True]["median_house_value"]
far_houses   = housing[housing["is_close"] == False]["median_house_value"]

# making two sample test
t_stat, p_val = ttest_ind(close_houses, far_houses, alternative="greater")

print(f"T-statistic = {t_stat}")
print(f"P-value = {p_val}")

alpha = 0.05

if p_val < alpha:
    print("We reject the H0, houses that are colser to school are more expensive")
else:
    print("We fail to reject the H0, houses that are closer are not more expensive")
                   


T-statistic = 38.04632342033554
P-value = 2.408917945663922e-304
We reject the H0, houses that are colser to school are more expensive
