# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

## LOADING AND CLEANING

In [None]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np


In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [3]:
df["Type 1"].unique() # no null values

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [4]:
df["Type 2"].unique() # null values, but only using Type 1 for testing, otherwise almost 50% of the data would be deleted

array(['Poison', nan, 'Flying', 'Dragon', 'Ground', 'Fairy', 'Grass',
       'Fighting', 'Psychic', 'Steel', 'Ice', 'Rock', 'Dark', 'Water',
       'Electric', 'Fire', 'Ghost', 'Bug', 'Normal'], dtype=object)

In [5]:
mean_hp_dragon = df[df["Type 1"] == "Dragon"]["HP"].mean()
mean_hp_dragon

83.3125

In [6]:
mean_hp_grass = df[df["Type 1"] == "Grass"]["HP"].mean()
mean_hp_grass

67.27142857142857

In [7]:
# Anzahl der Nullwerte insgesamt
total_nulls = df.isnull().sum().sum()
print("Gesamtzahl der Nullwerte im DataFrame:", total_nulls)

Gesamtzahl der Nullwerte im DataFrame: 387


In [8]:
# Anzahl der Nullwerte für jede Spalte
nulls_per_column = df.isnull().sum()
print("Nullwerte pro Spalte:")
print(nulls_per_column)

Nullwerte pro Spalte:
Name            1
Type 1          0
Type 2        386
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64


## Challenge 1

- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [9]:
# Two sample test
#Set the hypothesis
#H0: mu HP stats of dragon <= mu HP stats of grass
#H1: mu HP stats of dragon > mu HP stats of grass
alpha = 0.05 #significance level = 0.05

df_dragon = df[df["Type 1"]=="Dragon"]["HP"]
df_grass = df[df["Type 1"]=="Grass"]["HP"]

In [10]:
p = st.ttest_ind(df_dragon,df_grass, equal_var=False)
p
print(f"statistic: {p.statistic}")
print(f"pvalue: {p.pvalue}")
print(f"df: {p.df}")

## p-Value is close to zero and under the significance level of 5%, so H0 is rejected

statistic: 3.3349632905124063
pvalue: 0.0015987219490841199
df: 50.83784116232685


In [11]:
if p.pvalue > alpha:
    print(f"We are not able to reject H0: p-value {p.pvalue} > {alpha}. No evidence that Dragon Pokémon have higher HP stats.")
else:
    print(f"Reject H0: p-value ({p.pvalue}) <= {alpha}. Dragon Pokémon tend to have a higher HP than Grass Pokémon.")

Reject H0: p-value (0.0015987219490841199) <= 0.05. Dragon Pokémon tend to have a higher HP than Grass Pokémon.


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.

In [12]:
# T-Test for every stat 
# #Set the hypothesis
# compare stats from two types legendary/non-legendary = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]
# #H0: mu of stats from legendary pokemons != mu stats from non-legendary pokemons
# #H0: mu of stats from legendary pokemons = mu stats from non-legendary pokemons
# alpha = 0.05

In [13]:
# Mittelwerte berechnen (explorativer Vergleich)
mean_legendary = df[df["Legendary"] == True][["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]].mean()
mean_non_legendary = df[df["Legendary"] == False][["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]].mean()
print(f"Mittelwerte für Legendary:\n{mean_legendary}")
print(f"Mittelwerte für Non-Legendary:\n{mean_non_legendary}")

Mittelwerte für Legendary:
HP          92.738462
Attack     116.676923
Defense     99.661538
Sp. Atk    122.184615
Sp. Def    105.938462
Speed      100.184615
dtype: float64
Mittelwerte für Non-Legendary:
HP         67.182313
Attack     75.669388
Defense    71.559184
Sp. Atk    68.454422
Sp. Def    68.892517
Speed      65.455782
dtype: float64


In [14]:
alpha = 0.05  # significance level
stats = ["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]

print("Vergleich der Statistiken für Legendary und Non-Legendary Pokémon:\n")

for stat in stats:
    # Rohdaten für Legendary und Non-Legendary Pokémon
    legendary_stat = df[df["Legendary"] == True][stat]
    non_legendary_stat = df[df["Legendary"] == False][stat]
    
    # Two-Sample T-Test durchführen
    t_test_result = st.ttest_ind(legendary_stat, non_legendary_stat, equal_var=False)
    pvalue = t_test_result.pvalue
    
    # Ergebnisse ausgeben
    print(f"{stat}: T-Wert = {t_test_result.statistic:.4f}, p-Wert = {pvalue:.20f}")
    
    # Entscheidung basierend auf dem p-Wert
    if pvalue > alpha:
        print(f"For {stat}, we are not able to reject H0: p-value {pvalue:.4f} > {alpha}. "
              f"No evidence that Legendary Pokémon have different {stat} in comparison with Non-Legendary Pokémon.")
    else:
        print(f"For {stat}, Reject H0: p-value ({pvalue:.4f}) <= {alpha}. "
              f"Legendary Pokémon have a different {stat} than Non-Legendary Pokémon.")
    print("\n")

Vergleich der Statistiken für Legendary und Non-Legendary Pokémon:

HP: T-Wert = 8.9814, p-Wert = 0.00000000000010026912
For HP, Reject H0: p-value (0.0000) <= 0.05. Legendary Pokémon have a different HP than Non-Legendary Pokémon.


Attack: T-Wert = 10.4381, p-Wert = 0.00000000000000025204
For Attack, Reject H0: p-value (0.0000) <= 0.05. Legendary Pokémon have a different Attack than Non-Legendary Pokémon.


Defense: T-Wert = 7.6371, p-Wert = 0.00000000004826998495
For Defense, Reject H0: p-value (0.0000) <= 0.05. Legendary Pokémon have a different Defense than Non-Legendary Pokémon.


Sp. Atk: T-Wert = 13.4174, p-Wert = 0.00000000000000000000
For Sp. Atk, Reject H0: p-value (0.0000) <= 0.05. Legendary Pokémon have a different Sp. Atk than Non-Legendary Pokémon.


Sp. Def: T-Wert = 10.0157, p-Wert = 0.00000000000000229493
For Sp. Def, Reject H0: p-value (0.0000) <= 0.05. Legendary Pokémon have a different Sp. Def than Non-Legendary Pokémon.


Speed: T-Wert = 11.4750, p-Wert = 0.000000

## Challenge 2

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [15]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [16]:
import numpy as np

# Function to calculate Euclidean distance
def euclidean_distance(lat1, lon1, lat2, lon2):
    return np.sqrt((lat2 - lat1)**2 + (lon2 - lon1)**2)

In [17]:
import pandas as pd

# Calculate distances
df["distance_to_school"] = euclidean_distance(df["latitude"], df["longitude"], 34, -118)
df["distance_to_hospital"] = euclidean_distance(df["latitude"], df["longitude"], 37, -122)

# Flag houses close to either location
df["close_to_school_or_hospital"] = (df["distance_to_school"] < 0.50) | (df["distance_to_hospital"] < 0.50)

In [18]:
# Create two groups
close_houses = df[df["close_to_school_or_hospital"] == True]["median_house_value"]
far_houses = df[df["close_to_school_or_hospital"] == False]["median_house_value"]

In [19]:
import scipy.stats as st

# #H0: mu the price of houses closer to school or hospital <= mu the price of houses far from school or hospital
# #H1: mu the price of houses closer to school or hospital > mu the price of houses far from school or hospital

# Perform t-test
alpha = 0.05
t_test_result = st.ttest_ind(close_houses, far_houses, equal_var=False)

# Comment findings
if t_test_result.pvalue < alpha:
    print(f"Reject H0: p-value ({t_test_result.pvalue:.50f}) <= {alpha}. "
          f"Houses close to schools or hospitals have significantly higher prices.")
else:
    print(f"Fail to reject H0: p-value ({t_test_result.pvalue:.50f}) > {alpha}. "
          f"No evidence that houses close to schools or hospitals are more expensive.")

Reject H0: p-value (0.00000000000000000000000000000000000000000000000000) <= 0.05. Houses close to schools or hospitals have significantly higher prices.
