# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [2]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
display(df)

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
7,Mega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
8,Mega Charizard Y,Fire,Flying,78,104,78,159,115,100,1,False
9,Squirtle,Water,,44,48,65,50,64,43,1,False


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [4]:
#code here

# Load the data
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv'
pokemon_data = pd.read_csv(url)

# Filter Dragon and Grass-type Pokémon
dragon_hp = pokemon_data[pokemon_data['Type 1'] == 'Dragon']['HP']
grass_hp = pokemon_data[pokemon_data['Type 1'] == 'Grass']['HP']

# Hypotheses
# H0: The average HP of Dragon-type Pokémon is equal to the average HP of Grass-type Pokémon
# H1: The average HP of Dragon-type Pokémon is greater than the average HP of Grass-type Pokémon

# Perform the T-Test
t_statistic, p_value = st.ttest_ind(dragon_hp, grass_hp, alternative='greater')

print("Comparison of HP between Dragon-type and Grass-type Pokémon:")
print(f"t-statistic = {t_statistic}, p-value = {p_value}")

# Interpret the p-value
alpha = 0.05  # significance level
if p_value < alpha:
    print("Reject the null hypothesis: Dragon-type Pokémon have, on average, more HP than Grass-type Pokémon.")
else:
    print("Fail to reject the null hypothesis: There is not enough evidence to claim that Dragon-type Pokémon have more HP than Grass-type Pokémon.")

'''
Interpretation:

The t-statistic of 3.59 indicates that the difference in means between Dragon-type and Grass-type Pokémon is statistically significant.
The p-value of 0.00026 is much lower than the significance level of 0.05 (5%).

Conclusion:

Since the p-value is less than 0.05, we reject the null hypothesis. This supports the alternative hypothesis (H1) that Dragon-type Pokémon have, on average, more HP than Grass-type Pokémon.
Thus, this hypothesis is corroborated by the results
'''

Comparison of HP between Dragon-type and Grass-type Pokémon:
t-statistic = 3.590444254130357, p-value = 0.0002567969150153481
Reject the null hypothesis: Dragon-type Pokémon have, on average, more HP than Grass-type Pokémon.


'\nInterpretation:\n\nThe t-statistic of 3.59 indicates that the difference in means between Dragon-type and Grass-type Pokémon is statistically significant.\nThe p-value of 0.00026 is much lower than the significance level of 0.05 (5%).\n\nConclusion:\n\nSince the p-value is less than 0.05, we reject the null hypothesis. This supports the alternative hypothesis (H1) that Dragon-type Pokémon have, on average, more HP than Grass-type Pokémon.\nThus, this hypothesis is corroborated by the results\n'

- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [5]:
#code here

# Filter statistics for Legendary and Non-Legendary Pokémon
legendary_stats = pokemon_data[pokemon_data['Legendary'] == True][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
non_legendary_stats = pokemon_data[pokemon_data['Legendary'] == False][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

# Perform a T-Test for each statistic
results = {}
for stat in legendary_stats.columns:
    t_statistic, p_value = st.ttest_ind(legendary_stats[stat], non_legendary_stats[stat])
    results[stat] = (t_statistic, p_value)

print("\nComparison of stats between Legendary and Non-Legendary Pokémon:")
for stat, (t_statistic, p_value) in results.items():
    print(f"{stat}: t-statistic = {t_statistic}, p-value = {p_value}")

# Interpret the p-values
significant_stats = []
for stat, (t_statistic, p_value) in results.items():
    if p_value < alpha:
        significant_stats.append(stat)

if significant_stats:
    print(f"Reject the null hypothesis for the following stats: {', '.join(significant_stats)}. Legendary Pokémon have different stats compared to Non-Legendary Pokémon.")
else:
    print("Fail to reject the null hypothesis for all stats: There is not enough evidence to claim that Legendary Pokémon have different stats than Non-Legendary Pokémon.")


'''
Interpretation:

All the t-statistics are significantly high, indicating a strong difference in means between Legendary and Non-Legendary Pokémon for each stat.
The p-values for all stats are extremely low (far below 0.05), indicating that there is a statistically significant difference in each stat.

Conclusion:

For every statistic (HP, Attack, Defense, Sp. Atk, Sp. Def, Speed), the p-values are much lower than 0.05, allowing us to reject the null hypothesis for all comparisons.
This confirms the alternative hypothesis (H1) that Legendary Pokémon have different stats compared to Non-Legendary Pokémon. Therefore, this hypothesis is also corroborated by the results.
'''


Comparison of stats between Legendary and Non-Legendary Pokémon:
HP: t-statistic = 8.036124405043928, p-value = 3.3306476848461913e-15
Attack: t-statistic = 10.397321023700622, p-value = 7.827253003205333e-24
Defense: t-statistic = 7.181240122992339, p-value = 1.5842226094427259e-12
Sp. Atk: t-statistic = 14.191406210846289, p-value = 6.314915770427265e-41
Sp. Def: t-statistic = 11.03775106120522, p-value = 1.8439809580409597e-26
Speed: t-statistic = 9.765234331931898, p-value = 2.3540754436898437e-21
Reject the null hypothesis for the following stats: HP, Attack, Defense, Sp. Atk, Sp. Def, Speed. Legendary Pokémon have different stats compared to Non-Legendary Pokémon.


'\nInterpretation:\n\nAll the t-statistics are significantly high, indicating a strong difference in means between Legendary and Non-Legendary Pokémon for each stat.\nThe p-values for all stats are extremely low (far below 0.05), indicating that there is a statistically significant difference in each stat.\n\nConclusion:\n\nFor every statistic (HP, Attack, Defense, Sp. Atk, Sp. Def, Speed), the p-values are much lower than 0.05, allowing us to reject the null hypothesis for all comparisons.\nThis confirms the alternative hypothesis (H1) that Legendary Pokémon have different stats compared to Non-Legendary Pokémon. Therefore, this hypothesis is also corroborated by the results.\n'

**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [10]:
# Load the data
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv'
housing_data = pd.read_csv(url)

# Display the data
display(housing_data.head(10))

# Function to calculate Euclidean distance
def calculate_distance(row, school_coords=(-118, 37), hospital_coords=(-122, 34)):
    # Calculate distance to school
    distance_to_school = ((row['longitude'] - school_coords[0])**2 + (row['latitude'] - school_coords[1])**2)**0.5
    # Calculate distance to hospital
    distance_to_hospital = ((row['longitude'] - hospital_coords[0])**2 + (row['latitude'] - hospital_coords[1])**2)**0.5
    return distance_to_school, distance_to_hospital

# Apply the function to calculate distances
housing_data[['distance_to_school', 'distance_to_hospital']] = housing_data.apply(calculate_distance, axis=1, result_type='expand')

# Define the close proximity threshold
threshold = 0.50

# Classify houses based on distance to school or hospital
housing_data['close_to_school_or_hospital'] = ((housing_data['distance_to_school'] < threshold) | 
                                                (housing_data['distance_to_hospital'] < threshold)).astype(int)

# Separate datasets
close_houses = housing_data[housing_data['close_to_school_or_hospital'] == 1]
far_houses = housing_data[housing_data['close_to_school_or_hospital'] == 0]

# Perform the T-Test on median house values
t_statistic, p_value = st.ttest_ind(close_houses['median_house_value'], far_houses['median_house_value'])

print("Comparison of house values between those close to a school or hospital and those that are far:")
print(f"t-statistic = {t_statistic}, p-value = {p_value}")

# Interpret the p-value
alpha = 0.05  # significance level
if p_value < alpha:
    print("Reject the null hypothesis: Houses close to a school or hospital are more expensive on average.")
else:
    print("Fail to reject the null hypothesis: There is not enough evidence to claim that houses close to a school or hospital are more expensive.")



Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


Comparison of house values between those close to a school or hospital and those that are far:
t-statistic = -2.2146147257665834, p-value = 0.026799733071128685
Reject the null hypothesis: Houses close to a school or hospital are more expensive on average.


Interpretation:

The t-statistic and p-value obtained from the T-Test help us understand the relationship between house prices and proximity to schools or hospitals. A p-value less than the significance level of 0.05 indicates a statistically significant difference in house prices between the two groups.

Conclusion:

If the p-value is less than 0.05, we reject the null hypothesis, supporting the assertion that houses close to a school or hospital are, on average, more expensive. Conversely, if the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating that proximity to schools or hospitals does not significantly affect house prices.
