# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np



In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [6]:
df["Type 1"].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [28]:
#Set the hypothesis

#H0: dragons_hp >= grass_hp
#H1: dragons_hp <  grass_hp

#significance level = 0.05
dragons_hp = df[df["Type 1"]=="Dragon"]["HP"].dropna()
grass_hp = df[df["Type 1"]=="Grass"]["HP"].dropna()

In [29]:
from scipy.stats import ttest_ind

# Perform the t-test
t_stat, p_value = ttest_ind(dragons_hp, grass_hp, equal_var=False, alternative='less')

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Conclusion
if p_value < 0.05:
    print("Reject H0: The mean HP of Dragon-type Pokémon is significantly less than that of Grass-type Pokémon.")
else:
    print("Fail to Reject H0: There is no significant evidence to suggest that the mean HP of Dragon-type Pokémon is less than that of Grass-type Pokémon.")


T-statistic: 3.3350
P-value: 0.9992
Fail to Reject H0: There is no significant evidence to suggest that the mean HP of Dragon-type Pokémon is less than that of Grass-type Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [17]:
stats_columns = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed", "Legendary"]

legendary_stats = df[df["Legendary"] == True][stats_columns].reset_index(drop=True)
non_legendary_stats = df[df["Legendary"] == False][stats_columns].reset_index(drop=True)

# Display the first few rows of each DataFrame
print("Legendary Pokémon Stats:")
print(legendary_stats.head())

print("\nNon-Legendary Pokémon Stats:")
print(non_legendary_stats.head())

Legendary Pokémon Stats:
    HP  Attack  Defense  Sp. Atk  Sp. Def  Speed  Legendary
0   90      85      100       95      125     85       True
1   90      90       85      125       90    100       True
2   90     100       90      125       85     90       True
3  106     110       90      154       90    130       True
4  106     190      100      154      100    130       True

Non-Legendary Pokémon Stats:
   HP  Attack  Defense  Sp. Atk  Sp. Def  Speed  Legendary
0  45      49       49       65       65     45      False
1  60      62       63       80       80     60      False
2  80      82       83      100      100     80      False
3  80     100      123      122      120     80      False
4  39      52       43       60       50     65      False


In [25]:
from scipy.stats import ttest_ind
# Hypotheses
# H0: legendary_stats[stat] >= non_legendary_stats[stat]
# H1: legendary_stats[stat] <  non_legendary_stats[stat]

# Significance level
alpha = 0.05

# Results dictionary
results = {}

# Iterate over each stat column
for stat in ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]:
    # Extract the relevant columns as 1D arrays
    legendary_stat = legendary_stats[stat].dropna()
    non_legendary_stat = non_legendary_stats[stat].dropna()
    
    # Perform the t-test
    t_stat, p_value = ttest_ind(legendary_stat, non_legendary_stat, equal_var=False, alternative='greater')
    
    # Store the results
    results[stat] = {
        "t-statistic": t_stat,
        "p-value": p_value,
        "Conclusion": "Reject H0" if p_value < alpha else "Fail to Reject H0"
    }

# Display the results
for stat, res in results.items():
    print(f"Stat: {stat}")
    print(f"  t-statistic: {res['t-statistic']:.4f}")
    print(f"  p-value: {res['p-value']:.8f}")
    print(f"  Conclusion: {res['Conclusion']}\n")


Stat: HP
  t-statistic: 8.9814
  p-value: 0.00000000
  Conclusion: Reject H0

Stat: Attack
  t-statistic: 10.4381
  p-value: 0.00000000
  Conclusion: Reject H0

Stat: Defense
  t-statistic: 7.6371
  p-value: 0.00000000
  Conclusion: Reject H0

Stat: Sp. Atk
  t-statistic: 13.4174
  p-value: 0.00000000
  Conclusion: Reject H0

Stat: Sp. Def
  t-statistic: 10.0157
  p-value: 0.00000000
  Conclusion: Reject H0

Stat: Speed
  t-statistic: 11.4750
  p-value: 0.00000000
  Conclusion: Reject H0



**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [30]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [None]:
!pip install geopy
from geopy.distance import geodesic




In [57]:
# Example coordinates for the school and hospital
school_coords = (37, -118)  # Replace with actual values
hospital_coords = (34, -122)  # Replace with actual values

# Define a function to calculate geodesic distance (latitude, longitude)
def calculate_geodesic_distance(coord1, coord2):
    return geodesic(coord1, coord2).km  # Returns the distance in kilometers

# Apply the geodesic distance calculation to your DataFrame
houses["Distance_to_School"] = houses.apply(
    lambda row: calculate_geodesic_distance((row["latitude"], row["longitude"]), school_coords), axis=1
)

houses["Distance_to_Hospital"] = houses.apply(
    lambda row: calculate_geodesic_distance((row["latitude"], row["longitude"]), hospital_coords), axis=1
)

# Define a threshold for "close" vs. "far" (in kilometers)
distance_threshold = 100.0  # Adjust this threshold based on the actual distances in your dataset

# Classify houses as close or far
houses["Close_to_School"] = houses["Distance_to_School"] <= distance_threshold
houses["Close_to_Hospital"] = houses["Distance_to_Hospital"] <= distance_threshold

# Create DataFrames for houses close and far
houses_close = houses[(houses["Close_to_School"]) | (houses["Close_to_Hospital"])]
houses_far = houses[~((houses["Close_to_School"]) | (houses["Close_to_Hospital"]))]

# Print the resulting DataFrames
print("Houses Close to School or Hospital:")
print(houses_close.shape)

print("\nHouses Far from Both:")
print(houses_far.shape)



Houses Close to School or Hospital:
(18, 13)

Houses Far from Both:
(16982, 13)


In [58]:
#Set the hypothesis

#H0: houses_close_price >= houses_far_price
#H1: houses_close_price < houses_far_price

#significance level = 0.05
houses_close = houses_close["median_house_value"].dropna()
houses_far = houses_far["median_house_value"].dropna()


In [56]:
houses_close.head

<bound method NDFrame.head of 4523     74200.0
5596    108300.0
5597    104700.0
6776     94500.0
6904     80600.0
Name: median_house_value, dtype: float64>

In [59]:
# Perform the t-test
t_stat, p_value = ttest_ind(houses_close, houses_far, equal_var=False, alternative='less')

# Print results
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.10f}")

T-statistic: -11.6854
P-value: 0.0000000006


In [None]:
#Reject the null hypothesis: There is strong evidence that the means of the two groups or conditions being compared are significantly different.
#The large negative t-statistic suggests that the first group (or sample) has a much lower mean than the second group (or population).