# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [3]:
#H0: avg dragon HP <= avg grass HP
#H1: avg dragon HP > avg grass HP

In [4]:
#code here
#I need to use One tailed t-test

alpha = 0.05

grass_hp = df[df["Type 1"] == "Grass"]["HP"]
dragon_hp = df[df["Type 1"] == "Dragon"]["HP"]

t_stat, p_value = st.ttest_ind(dragon_hp, grass_hp, alternative = "greater")

print(f"alpha: {alpha}, p_value: {p_value}")

if p_value < alpha:
    print("We can reject the hypothesis with a 95% degree of confidence. THe average HP for dragon pokemons is greater than grass pokemons")
else:
    print("We don't have enough data to be able to reject the null hypothesis.")

alpha: 0.05, p_value: 0.0002567969150153481
We can reject the hypothesis with a 95% degree of confidence. THe average HP for dragon pokemons is greater than grass pokemons


In [5]:
grass_hp.mean()

67.27142857142857

In [6]:
dragon_hp.mean()

83.3125

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [7]:
#H0: legendary stats = non-legendary stats
#H1: legendary stats != non-legendary stats

In [8]:
df.head(0)

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary


In [9]:
#code here
#Two_sample-t-Test
#WE're going to compare lengendary and non-legendary stas one by one to test if the are equal.

alpha = 0.05

stats = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]

legendary = df[df["Legendary"] == True]

not_legendary = df[df["Legendary"] == False]

for stat in stats:
    t_stat, p_value = st.ttest_ind(legendary[stat], not_legendary[stat])
    print(f"For {stat}, t_stat: {t_stat}, p_value: {p_value}")
    
    if p_value < alpha:
        print(f"We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant difference between legendary and non legeendary in {stat}")
    else:
        print(f"We don't have enough data to reject the null hypothesis for {stat}")

For HP, t_stat: 8.036124405043928, p_value: 3.330647684846191e-15
We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant difference between legendary and non legeendary in HP
For Attack, t_stat: 10.397321023700622, p_value: 7.827253003205333e-24
We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant difference between legendary and non legeendary in Attack
For Defense, t_stat: 7.181240122992339, p_value: 1.5842226094427255e-12
We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant difference between legendary and non legeendary in Defense
For Sp. Atk, t_stat: 14.191406210846289, p_value: 6.314915770427266e-41
We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant difference between legendary and non legeendary in Sp. Atk
For Sp. Def, t_stat: 11.03775106120522, p_value: 1.8439809580409594e-26
We reject the

In [10]:
stats = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]


In [11]:
legendary["HP"].mean()

92.73846153846154

In [12]:
not_legendary["HP"].mean()

67.18231292517007

In [13]:
legendary["Attack"].mean()

116.67692307692307

In [14]:
not_legendary["Attack"].mean()

75.66938775510204

In [15]:
legendary["Defense"].mean()

99.66153846153846

In [16]:
not_legendary["Defense"].mean()

71.55918367346939

In [17]:
legendary["Sp. Atk"].mean()

122.18461538461538

In [18]:
not_legendary["Sp. Atk"].mean()

68.45442176870748

In [19]:
legendary["Sp. Def"].mean()

105.93846153846154

In [20]:
not_legendary["Sp. Def"].mean()

68.89251700680272

In [21]:
legendary["Speed"].mean()

100.18461538461538

In [22]:
not_legendary["Speed"].mean()

65.45578231292517

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [23]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [24]:
#Hypothesis:
#H0: prices of houses near schools or hospitals  = houses far from schools or hospitals    
#H1: prices of houses near schools or hospitals  != houses far from schools or hospitals

In [25]:
#Euclidean distance
def euc_dist(x1, x2, y1, y2):
        return np.sqrt((x2 - x1) ** 2 + (y2 - y1) ** 2)

In [26]:
#Dividing the dataset into houses close and far from EITHER a hospital OR school

school = (-118, 37)
hospital = (-122, 34)

#distance from school
df["dist_school"] = df.apply(lambda row: euc_dist(row["longitude"], school[0], row["latitude"], school[1]), axis = 1)

df["dist_hospital"] = df.apply(lambda row: euc_dist(row["longitude"], hospital[0], row["latitude"], hospital[1]), axis = 1)

In [27]:
df.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,dist_school,dist_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0,4.384165,7.540617
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456


In [28]:
df.dtypes

longitude             float64
latitude              float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
dist_school           float64
dist_hospital         float64
dtype: object

In [29]:
df["close_to_school_hospital"] = ((df["dist_school"] < 0.5) | (df["dist_hospital"] < 0.5))

In [30]:
df["close_to_school_hospital"].value_counts()

close_to_school_hospital
False    16995
True         5
Name: count, dtype: int64

In [31]:
#We need to be cautios because we only have 5 datapoints which are near to a hospital or a school.

In [32]:
alpha = 0.05

close_price = df[df["close_to_school_hospital"] == True]["median_house_value"]

far_price = df[df["close_to_school_hospital"] == False]["median_house_value"]


In [33]:
t_stat, p_value = st.ttest_ind(close_price, far_price)
p_value

0.026799733071128685

In [34]:
if p_value < alpha:
    print(f'''We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant 
    difference between house prices of those near and those far fromm schools or hospitals''')
else: 
    print(f"We don't have enough data to reject the null hypothesis.")

We reject the null hypothesis. We can say, with 95% degree of confidence that there is a significant 
    difference between house prices of those near and those far fromm schools or hospitals
