# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [8]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
from math import sqrt



In [9]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df.head(50)

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
7,Mega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
8,Mega Charizard Y,Fire,Flying,78,104,78,159,115,100,1,False
9,Squirtle,Water,,44,48,65,50,64,43,1,False


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [11]:
#code here
#H0 : Pokemon of type Dragon have on average <= HP than Pokemon of type Grass
#H1 : Pokemon of type Dragon have on average > HP than Pokemon of type Grass

alpha = 0.05

In [12]:
df["Type 1"].value_counts()

Type 1
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: count, dtype: int64

In [13]:
df["Type 2"].value_counts()

Type 2
Flying      97
Ground      35
Poison      34
Psychic     33
Fighting    26
Grass       25
Fairy       23
Steel       22
Dark        20
Dragon      18
Water       14
Ghost       14
Ice         14
Rock        14
Fire        12
Electric     6
Normal       4
Bug          3
Name: count, dtype: int64

In [14]:
dragon_df = df[(df["Type 1"] == "Dragon") | (df["Type 2"] == "Dragon")]
dragon_df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
7,Mega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
159,Dratini,Dragon,,41,64,45,50,50,50,1,False
160,Dragonair,Dragon,,61,84,65,70,70,70,1,False
161,Dragonite,Dragon,Flying,91,134,95,100,100,80,1,False
196,Mega Ampharos,Electric,Dragon,90,95,105,165,110,45,2,False
249,Kingdra,Water,Dragon,75,95,95,95,95,85,2,False
275,Mega Sceptile,Grass,Dragon,70,110,75,145,85,145,3,False
360,Vibrava,Ground,Dragon,50,70,50,50,50,70,3,False
361,Flygon,Ground,Dragon,80,100,80,80,80,100,3,False
365,Altaria,Dragon,Flying,75,70,90,70,105,80,3,False


In [15]:
dragon_grass = dragon_df[(dragon_df["Type 1"] == "Grass") | (dragon_df["Type 2"] == "Grass")]
dragon_grass

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
275,Mega Sceptile,Grass,Dragon,70,110,75,145,85,145,3,False


In [16]:
grass_df = df[(df["Type 1"] == "Grass") | (df["Type 2"] == "Grass")]
grass_df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
48,Oddish,Grass,Poison,45,50,55,75,65,30,1,False
...,...,...,...,...,...,...,...,...,...,...,...
783,Pumpkaboo Super Size,Ghost,Grass,59,66,70,44,55,41,6,False
784,Gourgeist Average Size,Ghost,Grass,65,90,122,58,75,84,6,False
785,Gourgeist Small Size,Ghost,Grass,55,85,122,58,75,99,6,False
786,Gourgeist Large Size,Ghost,Grass,75,95,122,58,75,69,6,False


In [17]:
grass_dragon = grass_df[(grass_df["Type 1"] == "Dragon") | (grass_df["Type 2"] == "Dragon")]
grass_dragon

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
275,Mega Sceptile,Grass,Dragon,70,110,75,145,85,145,3,False


In [18]:
#code here
#H0 : Pokemon of type Dragon have on average <= HP than Pokemon of type Grass
#H1 : Pokemon of type Dragon have on average > HP than Pokemon of type Grass

alpha = 0.05

In [19]:
dragon_df_hp = dragon_df["HP"]

In [20]:
grass_df_hp = grass_df["HP"]

In [21]:
dragon_df_hp.mean()

82.9

In [22]:
grass_df_hp.mean()

66.05263157894737

In [23]:
st.ttest_ind(dragon_df_hp, grass_df_hp, equal_var=False, alternative = "greater")

TtestResult(statistic=4.097528915272702, pvalue=5.0907690611769255e-05, df=77.58086781513519)

We reject the null-Hypothesis that Pokemon of Type Draon have less or equal as much HP as Pokemon of Type Grass!

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.



H0 : Legendary Pokemon have the same stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) as non-legendary Pokemon
H1 : Legendary Pokemon have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) than non-legendary Pokemon

In [27]:
alpha = 0.05

In [28]:
stats = ["HP","Attack","Defense","Sp. Atk","Sp. Def","Speed"]

In [29]:
legendary_df = df[df["Legendary"] == True]
non_legendary_df = df[df["Legendary"] == False]

In [30]:
legendary_df.describe()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,65.0,65.0,65.0,65.0,65.0,65.0,65.0
mean,92.738462,116.676923,99.661538,122.184615,105.938462,100.184615,3.769231
std,21.722164,30.348037,28.255131,31.104608,28.827004,22.952323,1.455262
min,50.0,50.0,20.0,50.0,20.0,50.0,1.0
25%,80.0,100.0,90.0,100.0,90.0,90.0,3.0
50%,91.0,110.0,100.0,120.0,100.0,100.0,4.0
75%,105.0,131.0,115.0,150.0,120.0,110.0,5.0
max,150.0,190.0,200.0,194.0,200.0,180.0,6.0


In [31]:
non_legendary_df.describe()

Unnamed: 0,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,735.0,735.0,735.0,735.0,735.0,735.0,735.0
mean,67.182313,75.669388,71.559184,68.454422,68.892517,65.455782,3.284354
std,24.808849,30.490153,30.408194,29.091705,25.66931,27.843038,1.673471
min,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,50.0,54.5,50.0,45.0,50.0,45.0,2.0
50%,65.0,72.0,66.0,65.0,65.0,64.0,3.0
75%,79.5,95.0,85.0,85.0,85.0,85.0,5.0
max,255.0,185.0,230.0,175.0,230.0,160.0,6.0


In [32]:

# List of stat columns to analyze
stats = ["HP","Attack","Defense","Sp. Atk","Sp. Def","Speed"]

# Perform t-test for each stat
t_test_results = {}
for stat in stats:
    t_stat, p_value = st.ttest_ind(legendary_df[stat], non_legendary_df[stat])
    t_test_results[stat] = p_value

# Output p-values for each stat
print(t_test_results)


{'HP': 3.3306476848461913e-15, 'Attack': 7.827253003205333e-24, 'Defense': 1.5842226094427255e-12, 'Sp. Atk': 6.314915770427265e-41, 'Sp. Def': 1.8439809580409597e-26, 'Speed': 2.3540754436898437e-21}


Since each p-value for every stat comparing legendary to non-legendary is lower than our alpha, we can reject the null-hypothesis!

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [36]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [38]:
def calculate_distance (df):
    """
    A(lon1,lat1)
    B(lon2,lat2)
    d(A,B) = sqr((lon2-lon1)**2+(lat2-lat1)**2)
    
    cast a new column with euc distance
    """
    S_coord = (-118, 37)
    H_coord = (-122, 34)

    df["school_dist"] = 0.0
    df["hospital_dist"] = 0.0
    
    for index, row in df.iterrows():
        d_s = sqrt((row["longitude"]-S_coord[0])**2+(row["latitude"]-S_coord[1])**2)
        d_h = sqrt((row["longitude"]-H_coord[0])**2+(row["latitude"]-H_coord[1])**2)
        
        df.at[index, "school_dist"] = float(d_s)  # Ensure it's a float
        df.at[index, "hospital_dist"] = float(d_h)  # Ensure it's a float
    return df

In [39]:
print(df.columns)

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')


In [40]:
df2 = calculate_distance(df)
df2

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,school_dist,hospital_dist
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,4.384165,7.540617
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,4.801510,7.438716
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,4.850753,7.442432
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,7.211380,6.957298
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,7.275232,7.064630
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035


In [41]:
lon1 =-114.31
lat1 = 34.19	
lon2 = -118
lat2 = 37


In [42]:
d = sqrt((lon2-lon1)**2+(lat2-lat1)**2)
d

4.638124621007934

In [43]:
df2.sort_values("school_dist").head(50)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,school_dist,hospital_dist
6904,-118.31,36.94,35.0,2563.0,530.0,861.0,371.0,2.325,80600.0,0.315753,4.718019
6776,-118.3,37.17,22.0,3480.0,673.0,1541.0,636.0,2.75,94500.0,0.344819,4.872258
4523,-118.05,36.64,34.0,2090.0,478.0,896.0,426.0,2.0357,74200.0,0.363456,4.75101
5596,-118.18,37.35,16.0,3806.0,794.0,1501.0,714.0,2.1212,108300.0,0.393573,5.080837
5597,-118.18,36.63,23.0,2311.0,487.0,1019.0,384.0,2.2574,104700.0,0.411461,4.637812
7771,-118.39,37.37,25.0,3295.0,824.0,1477.0,770.0,1.8325,105800.0,0.537587,4.938522
7849,-118.4,37.36,34.0,2465.0,619.0,1172.0,575.0,1.9722,116100.0,0.538145,4.924388
8010,-118.42,37.35,21.0,3302.0,557.0,1413.0,520.0,4.375,180400.0,0.546717,4.902948
8009,-118.42,37.36,18.0,2281.0,520.0,1425.0,465.0,1.7388,54400.0,0.553173,4.909786
8254,-118.45,37.37,26.0,3135.0,524.0,1385.0,523.0,4.337,139700.0,0.58258,4.894834


In [44]:
df2.sort_values("hospital_dist")

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,school_dist,hospital_dist
10692,-120.59,34.70,29.0,17738.0,3114.0,12427.0,2826.0,2.7377,28300.0,3.463827,1.574198
10633,-120.48,34.65,26.0,1933.0,316.0,1001.0,319.0,4.4628,134400.0,3.416563,1.653149
10619,-120.47,34.63,23.0,2441.0,463.0,1392.0,434.0,3.7917,142200.0,3.423127,1.654630
10632,-120.48,34.66,4.0,1897.0,331.0,915.0,336.0,4.1563,172600.0,3.409692,1.657106
10617,-120.47,34.64,8.0,2482.0,586.0,1427.0,540.0,3.0710,120400.0,3.416211,1.658463
...,...,...,...,...,...,...,...,...,...,...,...
16883,-123.83,41.88,18.0,1504.0,357.0,660.0,258.0,3.1300,116700.0,7.602848,8.089703
16949,-124.15,41.81,17.0,3276.0,628.0,3546.0,585.0,2.2868,103100.0,7.807599,8.100531
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410


In [45]:
School coordinates (-118, 37)
Hospital coordinates (-122, 34)

SyntaxError: invalid syntax (2305070994.py, line 1)

In [46]:
df_close = df2[(df2["school_dist"]<0.5)|(df2["hospital_dist"]<0.5)]
df_close

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,school_dist,hospital_dist
4523,-118.05,36.64,34.0,2090.0,478.0,896.0,426.0,2.0357,74200.0,0.363456,4.75101
5596,-118.18,37.35,16.0,3806.0,794.0,1501.0,714.0,2.1212,108300.0,0.393573,5.080837
5597,-118.18,36.63,23.0,2311.0,487.0,1019.0,384.0,2.2574,104700.0,0.411461,4.637812
6776,-118.3,37.17,22.0,3480.0,673.0,1541.0,636.0,2.75,94500.0,0.344819,4.872258
6904,-118.31,36.94,35.0,2563.0,530.0,861.0,371.0,2.325,80600.0,0.315753,4.718019


In [48]:
df_far = df2[(df2["school_dist"]>0.5)|(df2["hospital_dist"]>0.5)]
df_far

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,school_dist,hospital_dist
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,4.384165,7.540617
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,4.801510,7.438716
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,4.850753,7.442432
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,7.211380,6.957298
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,7.275232,7.064630
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035


In [52]:
df_close["median_house_value"].describe()

count         5.000000
mean      92460.000000
std       14823.730974
min       74200.000000
25%       80600.000000
50%       94500.000000
75%      104700.000000
max      108300.000000
Name: median_house_value, dtype: float64

In [54]:
df_far["median_house_value"].describe()

count     17000.000000
mean     207300.912353
std      115983.764387
min       14999.000000
25%      119400.000000
50%      180400.000000
75%      265000.000000
max      500001.000000
Name: median_house_value, dtype: float64

H0: Houses closer to a school or hospital are as expensive or cheaper than houses farther away.
H1: Houses closer to a school or hospital are more expensive than houses farther away.

alpha = 0.05


In [57]:
alpha = 0.05

In [63]:


t_stat, p_value = st.ttest_ind(df_close['median_house_value'], df_far['median_house_value'], equal_var=False, alternative = "greater" )

t_stat, p_value


(-17.169161722319306, 0.9999738667251178)

The t-test shows a p-value of 0.9999738667251178, meaning that we cannot reject the null hypothesis! Comparing the average median house prices of the far away and close houses actually shows the opposite, with farther away houses being on average more expensive. Furthermore, there were only 5 houses fitting the posited criterion for closeness and such a small sample size will result in a reduced power of our test and will contain a greater variability making it hard to infer robust conclusions about house prizes and their dependence on the distance to either a school or a hospital. 