# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [1]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np
from scipy.stats import ttest_ind



In [2]:
#Solving an issue of correct reading of dataset
import pandas as pd
import requests

# Fetch the data while ignoring SSL certificate verification
response = requests.get("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv", verify=False)

# Ensure the request was successful
if response.status_code == 200:
    from io import StringIO
    data = StringIO(response.text)

    # Load the data into a pandas DataFrame
    df = pd.read_csv(data)
    print(df)
else:
    print("Failed to retrieve data: HTTP Status", response.status_code)



               Name   Type 1  Type 2  HP  Attack  Defense  Sp. Atk  Sp. Def  \
0         Bulbasaur    Grass  Poison  45      49       49       65       65   
1           Ivysaur    Grass  Poison  60      62       63       80       80   
2          Venusaur    Grass  Poison  80      82       83      100      100   
3     Mega Venusaur    Grass  Poison  80     100      123      122      120   
4        Charmander     Fire     NaN  39      52       43       60       50   
..              ...      ...     ...  ..     ...      ...      ...      ...   
795         Diancie     Rock   Fairy  50     100      150      100      150   
796    Mega Diancie     Rock   Fairy  50     160      110      160      110   
797  Hoopa Confined  Psychic   Ghost  80     110       60      150      130   
798   Hoopa Unbound  Psychic    Dark  80     160       60      170      130   
799       Volcanion     Fire   Water  80     110      120      130       90   

     Speed  Generation  Legendary  
0       45     

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [4]:
from scipy.stats import ttest_ind

In [5]:
#code here
#H0: HP stats of Dragon Pokemon type <= HP stats of Grass pokemon HP stats
#H1: HP stats of Dragon Pokemon type > HP stats of Grass pokemon HP stats
# Filter the data for Dragon and Grass types
df_dragonHP = df[df["Type 1"] == "Dragon"]["HP"]
df_grassHP = df[df["Type 1"] == "Grass"]["HP"]

# Perform Welch's t-test for unequal variances with the alternative hypothesis that Dragon HP is greater
t_test_result = ttest_ind(df_dragonHP, df_grassHP, equal_var=False, alternative='greater')

# Output the test results
print(t_test_result)

TtestResult(statistic=3.3349632905124063, pvalue=0.0007993609745420599, df=50.83784116232685)


- We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [6]:
#code here
# Assuming df is your DataFrame containing Pokémon data
df['stat'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].sum(axis=1)
legendary = df[df['Legendary'] == True]['stat']
non_legendary = df[df['Legendary'] == False]['stat']

# Perform Welch's t-test (does not assume equal variances)
t_stat, p_value = ttest_ind(legendary, non_legendary, equal_var=False)

# Print results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: 25.8335743895517, P-value: 9.357954335957446e-47


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [8]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 37)
- Hospital coordinates (-122, 34)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [10]:
# Hypotheses:
# H0: The average price of houses close to a school or hospital is equal to the average price of houses not close to either.
# H1: The average price of houses close to a school or hospital is not equal to the average price of houses not close to either.

# Define the Euclidean distance function
def euclidean_distance(x1, y1, x2, y2):
    return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

# School and Hospital Coordinates
school_coords = (-118, 37)
hospital_coords = (-122, 34)

# Calculate distances to school and hospital
df['distance_to_school'] = df.apply(lambda row: euclidean_distance(row['longitude'], row['latitude'], school_coords[0], school_coords[1]), axis=1)
df['distance_to_hospital'] = df.apply(lambda row: euclidean_distance(row['longitude'], row['latitude'], hospital_coords[0], hospital_coords[1]), axis=1)

# Filter data for proximity to either school or hospital
close_to_school_or_hospital = df[(df['distance_to_school'] < 0.50) | (df['distance_to_hospital'] < 0.50)]
far_from_school_and_hospital = df[(df['distance_to_school'] >= 0.50) & (df['distance_to_hospital'] >= 0.50)]

# Welch's t-test for unequal variances
t_stat, p_value = st.ttest_ind(close_to_school_or_hospital['median_house_value'], far_from_school_and_hospital['median_house_value'], equal_var=False)

# Print the results
print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("There is a statistically significant difference in median house values between houses close to and far from schools or hospitals.")
else:
    print("There is no statistically significant difference in median house values between houses close to and far from schools or hospitals.")

T-statistic: -17.174167998688404, P-value: 5.220018561223529e-05
There is a statistically significant difference in median house values between houses close to and far from schools or hospitals.


In [11]:
close_to_school_or_hospital

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital
4523,-118.05,36.64,34.0,2090.0,478.0,896.0,426.0,2.0357,74200.0,0.363456,4.75101
5596,-118.18,37.35,16.0,3806.0,794.0,1501.0,714.0,2.1212,108300.0,0.393573,5.080837
5597,-118.18,36.63,23.0,2311.0,487.0,1019.0,384.0,2.2574,104700.0,0.411461,4.637812
6776,-118.3,37.17,22.0,3480.0,673.0,1541.0,636.0,2.75,94500.0,0.344819,4.872258
6904,-118.31,36.94,35.0,2563.0,530.0,861.0,371.0,2.325,80600.0,0.315753,4.718019


In [12]:
far_from_school_and_hospital

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_to_school,distance_to_hospital
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,4.638125,7.692347
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,4.384165,7.540617
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,4.773856,7.446456
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,4.801510,7.438716
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,4.850753,7.442432
...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,7.211380,6.957298
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,7.275232,7.064630
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,7.944533,8.170410
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,7.920227,8.132035
