# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [2]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        799 non-null    object
 1   Type 1      800 non-null    object
 2   Type 2      414 non-null    object
 3   HP          800 non-null    int64 
 4   Attack      800 non-null    int64 
 5   Defense     800 non-null    int64 
 6   Sp. Atk     800 non-null    int64 
 7   Sp. Def     800 non-null    int64 
 8   Speed       800 non-null    int64 
 9   Generation  800 non-null    int64 
 10  Legendary   800 non-null    bool  
dtypes: bool(1), int64(7), object(3)
memory usage: 63.4+ KB


In [16]:
df.nunique()

Name          799
Type 1         18
Type 2         18
HP             94
Attack        111
Defense       103
Sp. Atk       105
Sp. Def        92
Speed         108
Generation      6
Legendary       2
dtype: int64

In [20]:
df.value_counts()

Name                Type 1    Type 2  HP   Attack  Defense  Sp. Atk  Sp. Def  Speed  Generation  Legendary
Abomasnow           Grass     Ice     90   92      75       92       85       60     4           False        1
Palpitoad           Water     Ground  75   65      55       65       55       69     5           False        1
Pidove              Normal    Flying  50   55      50       36       30       43     5           False        1
Pidgey              Normal    Flying  40   45      40       35       35       56     1           False        1
Pidgeotto           Normal    Flying  63   60      55       50       50       71     1           False        1
                                                                                                             ..
Helioptile          Electric  Normal  44   38      33       61       43       70     6           False        1
Heliolisk           Electric  Normal  62   55      52       109      94       109    6           False       

In [22]:
df['Type 1'].value_counts()

Type 1
Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: count, dtype: int64

In [24]:
df['Type 2'].value_counts()

Type 2
Flying      97
Ground      35
Poison      34
Psychic     33
Fighting    26
Grass       25
Fairy       23
Steel       22
Dark        20
Dragon      18
Water       14
Ghost       14
Ice         14
Rock        14
Fire        12
Electric     6
Normal       4
Bug          3
Name: count, dtype: int64

In [30]:
df['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [150]:
# Hypothesis: Pokemons of type Dragon have, on average, more HP stats than Grass. 
# Choose the propper test and, with 5% significance, comment your findings.

# Set the Hypothesis:
# H0: mu type 1 "Dragon" avg(HP) <= type 1 "Grass" avg(HP)
# H1: mu type 1 "Dragon" avg(HP) > type 1 "Grass" avg(HP)

In [34]:
# Choose significance level
alpha = 0.05

In [44]:
# Collect the data: Group by "Type 1" and calculate the mean of "HP":
df_type1_HP = df.groupby('Type 1', as_index=False)['HP'].mean()
df_type1_HP

Unnamed: 0,Type 1,HP
0,Bug,56.884058
1,Dark,66.806452
2,Dragon,83.3125
3,Electric,59.795455
4,Fairy,74.117647
5,Fighting,69.851852
6,Fire,69.903846
7,Flying,70.75
8,Ghost,64.4375
9,Grass,67.271429


In [52]:
# Filter HP values for Type 1 = Dragon and Type 1 = Grass from original df:
hp_dragon = df[df["Type 1"] == "Dragon"]["HP"]
hp_grass = df[df["Type 1"] == "Grass"]["HP"]

# Perform two-sample t-test, "equal_var = False" means we run a Welch-Test
t_stat, p_value = st.ttest_ind(hp_dragon, hp_grass, equal_var=False, alternative='greater')

# Display results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: 3.3349632905124063, P-value: 0.0007993609745420599


In [152]:
# This means, that:
# Type 1 = Dragon has an average HP of 83.31,
# Type 1 = Grass has an average HP of 67.27,
# The p-value is 0.0008%,
# The p-value is much lower than the significance level alpha = 0.05,
# H0: mu type 1 "Dragon" avg(HP) <= type 1 "Grass" avg(HP) ==> FALSE, rejected
# H1: mu type 1 "Dragon" avg(HP) > type 1 "Grass" avg(HP) ==> TRUE, not rejected

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [58]:
# Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary.
# Choose the propper test and, with 5% significance, comment your findings.
df["Legendary"].nunique()

2

In [64]:
# Set the Hypothesis:
# H0: Stats of Legendary=TRUE Stats != Legendary=FALSE Stats
# H1: Stats of Legendary=TRUE Stats = Legendary=FALSE Stats

In [80]:
# Filter Legendary values for TRUE and FALSE from original df:
legendary_true = df[df["Legendary"] == True]
legendary_false = df[df["Legendary"] == False]

# Perform a two-sample t-test, "equal_var = False" means we run a Welch-Test
t_stat, p_value = st.ttest_ind(legendary_true, legendary_false, equal_var=False, alternative='greater')

# Display results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [76]:
# Running multiple two-sample t-tests instead:
import statsmodels.stats.multitest as smm

# Filter Legendary values for TRUE and FALSE from original df:
legendary_true = df[df["Legendary"] == True]
legendary_false = df[df["Legendary"] == False]

# Columns to test
columns = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']

# Store results
p_values = []
t_stats = []

for col in columns:
    # Perform multiple independent t-tests (Welch’s tests with equal_var = False)
    t_stat, p_val = st.ttest_ind(legendary_true[col], legendary_false[col], equal_var=False, alternative='greater')
    t_stats.append(t_stat)
    p_values.append(p_val)

# Apply Bonferroni correction (alternative: use 'fdr_bh' for FDR correction) - suggested by ChatGPT
_, p_values_corrected, _, _ = smm.multipletests(p_values, alpha=0.05, method='bonferroni')

# Display results - suggested by ChatGPT
for col, t_stat, p_val, p_corr in zip(columns, t_stats, p_values, p_values_corrected):
    print(f"{col}: t-stat={t_stat:.3f}, p-value={p_val:.4f}, Bonferroni-corrected p-value={p_corr:.4f}")

HP: t-stat=8.981, p-value=0.0000, Bonferroni-corrected p-value=0.0000
Attack: t-stat=10.438, p-value=0.0000, Bonferroni-corrected p-value=0.0000
Defense: t-stat=7.637, p-value=0.0000, Bonferroni-corrected p-value=0.0000
Sp. Atk: t-stat=13.417, p-value=0.0000, Bonferroni-corrected p-value=0.0000
Sp. Def: t-stat=10.016, p-value=0.0000, Bonferroni-corrected p-value=0.0000
Speed: t-stat=11.475, p-value=0.0000, Bonferroni-corrected p-value=0.0000


In [82]:
# This means, that:
# Legendary=True have in all 6 categories ('HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed') higher avergae values than Legendary=False.
# The p-values for testing all 6 categories are approx. 0 ! The significance level was set to 0.05.
# H0: Stats of Legendary=TRUE Stats != Legendary=FALSE Stats ==> TRUE, not rejected
# H1: Stats of Legendary=TRUE Stats = Legendary=FALSE Stats ==> FALSE, rejected

**Challenge 2**

In [110]:
df_ch = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df_ch.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [112]:
# Hypothesis: houses close to either a school or a hospital are more expensive.
# School coordinates (longitude -118, latitude 34), Hospital coordinates (longitude -122, latitude 37)
# We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

In [118]:
# Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
# Include these distances as new columns in the dataframe df_ch.

def calculate_distances(df_ch):
    # Define coordinates
    school_coords = (-118, 34)
    hospital_coords = (-122, 37)
        # Compute distances
    df_ch['distance_to_school'] = np.sqrt((df_ch['longitude'] - school_coords[0])**2 + (df_ch['latitude'] - school_coords[1])**2)
    df_ch['distance_to_hospital'] = np.sqrt((df_ch['longitude'] - hospital_coords[0])**2 + (df_ch['latitude'] - hospital_coords[1])**2)
    
    return df_ch
df_ch_dis=calculate_distances(df_ch)

In [120]:
df_ch_dis

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,3.694888,8.187319
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,3.552591,7.966235
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,3.453940,8.143077
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,3.448840,8.154416
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,3.456848,8.183508
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,9.082070,4.233675
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,9.168915,4.332320
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,10.057614,5.358694
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,10.026465,5.322593


In [124]:
# Test the function:
df_ch_test = pd.DataFrame({'longitude': [-115, -125], 'latitude': [34, 37], 'median_house_value': [50000, 150000]})
df_ch_test = calculate_distances(df_ch)
print(df_ch_test)

       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0        -114.31     34.19                15.0       5612.0          1283.0   
1        -114.47     34.40                19.0       7650.0          1901.0   
2        -114.56     33.69                17.0        720.0           174.0   
3        -114.57     33.64                14.0       1501.0           337.0   
4        -114.57     33.57                20.0       1454.0           326.0   
...          ...       ...                 ...          ...             ...   
16995    -124.26     40.58                52.0       2217.0           394.0   
16996    -124.27     40.69                36.0       2349.0           528.0   
16997    -124.30     41.84                17.0       2677.0           531.0   
16998    -124.30     41.80                19.0       2672.0           552.0   
16999    -124.35     40.54                52.0       1820.0           300.0   

       population  households  median_income  media

In [126]:
df_ch_test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,3.694888,8.187319
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,3.552591,7.966235
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,3.453940,8.143077
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,3.448840,8.154416
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,3.456848,8.183508
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,9.082070,4.233675
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,9.168915,4.332320
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,10.057614,5.358694
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,10.026465,5.322593


In [132]:
# Divide your dataset into houses close and far from either a hospital or school.
# We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.
# Define proximity based on distance threshold: lower than 0.50
df_ch_school=df_ch[df_ch["distance_to_school"]<0.50]
df_ch_hospital=df_ch[df_ch["distance_to_hospital"]<0.50]

In [134]:
df_ch_school

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital
2366,-117.51,34.00,36.0,3791.0,746.0,2258.0,672.0,3.2067,124700.0,0.490000,5.400009
2367,-117.51,33.97,35.0,352.0,62.0,184.0,57.0,3.6691,137500.0,0.490918,5.416733
2368,-117.51,33.95,12.0,9016.0,1486.0,4285.0,1457.0,4.9984,169100.0,0.492544,5.427946
2371,-117.52,33.99,14.0,13562.0,2057.0,7600.0,2086.0,5.2759,182900.0,0.480104,5.397268
2372,-117.52,33.89,2.0,17978.0,3217.0,7305.0,2463.0,5.1695,220800.0,0.492443,5.453668
...,...,...,...,...,...,...,...,...,...,...,...
8521,-118.49,34.02,28.0,2545.0,752.0,1548.0,679.0,2.9125,475000.0,0.490408,4.604400
8522,-118.49,34.02,28.0,1394.0,582.0,716.0,543.0,1.5132,450000.0,0.490408,4.604400
8523,-118.49,34.02,27.0,4725.0,1185.0,1945.0,1177.0,4.1365,470800.0,0.490408,4.604400
8524,-118.49,34.01,28.0,651.0,252.0,333.0,174.0,1.9722,500001.0,0.490102,4.610878


In [136]:
df_ch_hospital

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital
12333,-121.51,37.02,19.0,2372.0,394.0,1142.0,365.0,4.0238,374600.0,4.630389,0.490408
12361,-121.53,36.85,23.0,3359.0,725.0,1862.0,651.0,2.6719,193600.0,4.536893,0.493356
12408,-121.56,37.08,17.0,6725.0,1051.0,3439.0,1027.0,6.4313,393100.0,4.707441,0.447214
12419,-121.57,37.02,17.0,2889.0,624.0,2681.0,608.0,2.9417,178000.0,4.676035,0.430465
12420,-121.57,36.98,14.0,5231.0,817.0,2634.0,799.0,4.9702,279800.0,4.650301,0.430465
...,...,...,...,...,...,...,...,...,...,...,...
15090,-122.25,37.08,20.0,1201.0,282.0,601.0,234.0,2.5556,177500.0,5.248705,0.262488
15170,-122.26,37.38,28.0,1103.0,164.0,415.0,154.0,7.8633,500001.0,5.438014,0.460435
15253,-122.27,37.32,37.0,2607.0,534.0,1346.0,507.0,5.3951,277700.0,5.408817,0.418688
15254,-122.27,37.24,30.0,2762.0,593.0,1581.0,502.0,5.1002,319400.0,5.360084,0.361248


In [140]:
# Hypothesis: houses close to either a school or a hospital are more expensive.
# Choose the propper test and, with 5% significance.
df_ch_school["median_house_value"].mean()

237026.5619266055

In [142]:
df_ch_hospital["median_house_value"].mean()

295407.86649440136

In [146]:
# Perform a two-sample t-test, "equal_var = False" means we run a Welch-Test
t_stat, p_value = st.ttest_ind(df_ch_school["median_house_value"], df_ch_hospital["median_house_value"], equal_var=False, alternative='greater')

# Display results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: -16.813958135097597, P-value: 1.0


In [148]:
# Comment your findings:
# Houses close to the school have a mean value of 237026.56 dollar, while houses close to the hospital have a mean value of 295407.86 dollar.
# Houses close to the hospital are NOT significantly more expensive than those to the school. p-value = 1.00.
# Hypothesis is rejected. Value of houses is NOT significantly different.