# Lab | Hypothesis Testing

**Objective**

Welcome to the Hypothesis Testing Lab, where we embark on an enlightening journey through the realm of statistical decision-making! In this laboratory, we delve into various scenarios, applying the powerful tools of hypothesis testing to scrutinize and interpret data.

From testing the mean of a single sample (One Sample T-Test), to investigating differences between independent groups (Two Sample T-Test), and exploring relationships within dependent samples (Paired Sample T-Test), our exploration knows no bounds. Furthermore, we'll venture into the realm of Analysis of Variance (ANOVA), unraveling the complexities of comparing means across multiple groups.

So, grab your statistical tools, prepare your hypotheses, and let's embark on this fascinating journey of exploration and discovery in the world of hypothesis testing!

**Challenge 1**

In this challenge, we will be working with pokemon data. The data can be found here:

- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv

In [3]:
#libraries
import pandas as pd
import scipy.stats as st
import numpy as np

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/pokemon.csv")
df.columns

Index(['Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

- We posit that Pokemons of type Dragon have, on average, more HP stats than Grass. Choose the propper test and, with 5% significance, comment your findings.

In [7]:
# State hypothesis
# H0. Dragon <= Grass
# H1. Dragon > Grass

In [8]:
df

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,Mega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,Hoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,Hoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [9]:
df_dragon = df[(df['Type 1'] == 'Dragon') |(df['Type 2'] == 'Dragon')]
df_grass = df[(df['Type 1'] == 'Grass') |(df['Type 2'] == 'Grass')]
print(df_dragon['HP'].mean())
print(df_grass['HP'].mean())

print(df_dragon['HP'].min())
print(df_grass['HP'].max())

82.9
66.05263157894737
40
123


In [10]:
df_dragon

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
7,Mega Charizard X,Fire,Dragon,78,130,111,130,85,100,1,False
159,Dratini,Dragon,,41,64,45,50,50,50,1,False
160,Dragonair,Dragon,,61,84,65,70,70,70,1,False
161,Dragonite,Dragon,Flying,91,134,95,100,100,80,1,False
196,Mega Ampharos,Electric,Dragon,90,95,105,165,110,45,2,False
249,Kingdra,Water,Dragon,75,95,95,95,95,85,2,False
275,Mega Sceptile,Grass,Dragon,70,110,75,145,85,145,3,False
360,Vibrava,Ground,Dragon,50,70,50,50,50,70,3,False
361,Flygon,Ground,Dragon,80,100,80,80,80,100,3,False
365,Altaria,Dragon,Flying,75,70,90,70,105,80,3,False


In [11]:
df_grass

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,Mega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
48,Oddish,Grass,Poison,45,50,55,75,65,30,1,False
...,...,...,...,...,...,...,...,...,...,...,...
783,Pumpkaboo Super Size,Ghost,Grass,59,66,70,44,55,41,6,False
784,Gourgeist Average Size,Ghost,Grass,65,90,122,58,75,84,6,False
785,Gourgeist Small Size,Ghost,Grass,55,85,122,58,75,99,6,False
786,Gourgeist Large Size,Ghost,Grass,75,95,122,58,75,69,6,False


In [12]:
st.ttest_ind(df_dragon['HP'],df_grass['HP'])

TtestResult(statistic=4.4991348531252635, pvalue=1.4019427478617643e-05, df=143.0)

 We posit that Legendary Pokemons have different stats (HP, Attack, Defense, Sp.Atk, Sp.Def, Speed) when comparing with Non-Legendary. Choose the propper test and, with 5% significance, comment your findings.


In [13]:
# H0. legend = Nonlegend
# H1. legend ! = Nonlegend

In [15]:
# Fix column name typo and bracket placement, and use correct column names from df
df_legend = df[df['Legendary'] == True][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
df_nonlegend = df[df['Legendary'] == False][['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

In [17]:
stats = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
legendary_df = df[df['Legendary'] == True]
non_legendary_df = df[df['Legendary'] == False]

In [19]:
stat,p_value = st.f_oneway(df_legend, df_nonlegend)
print(stat)
print(p_value)

[ 64.57929545 108.10428447  51.5702097  201.39601024 121.83194849
  95.35980156]
[3.33064768e-15 7.82725300e-24 1.58422261e-12 6.31491577e-41
 1.84398096e-26 2.35407544e-21]


**Challenge 2**

In this challenge, we will be working with california-housing data. The data can be found here:
- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv

In [20]:
df_house = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/california_housing.csv")
df_house.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


**We posit that houses close to either a school or a hospital are more expensive.**

- School coordinates (-118, 34)
- Hospital coordinates (-122, 37)

We consider a house (neighborhood) to be close to a school or hospital if the distance is lower than 0.50.

Hint:
- Write a function to calculate euclidean distance from each house (neighborhood) to the school and to the hospital.
- Divide your dataset into houses close and far from either a hospital or school.
- Choose the propper test and, with 5% significance, comment your findings.
 

In [23]:
# H0. houses near school or hospital<= price of all houses
# H1. houses near school or hospital> price of all houses

In [24]:
def get_distance_school(row):
    school = (-118, 34)
    return np.sqrt((row["longitude"] - school[0]) ** 2 + (row["latitude"] - school[1]) ** 2)

def get_distance_hospital(row):
    hospital = (-122, 37)
    return np.sqrt((row["longitude"] - hospital[0]) ** 2 + (row["latitude"] - hospital[1]) ** 2)

df_house["distance_school"] = df_house.apply(get_distance_school, axis=1)
df_house["distance_hospital"] = df_house.apply(get_distance_hospital, axis=1)
df_house["close"] = df_house.apply(lambda row: min(row["distance_school"], row["distance_hospital"]) < 0.5, axis=1)
df_house_close = df_house[df_house["close"] == True]
df_house_far = df_house[df_house["close"] == False]



In [25]:
df_house_close

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_school,distance_hospital,close
2366,-117.51,34.00,36.0,3791.0,746.0,2258.0,672.0,3.2067,124700.0,0.490000,5.400009,True
2367,-117.51,33.97,35.0,352.0,62.0,184.0,57.0,3.6691,137500.0,0.490918,5.416733,True
2368,-117.51,33.95,12.0,9016.0,1486.0,4285.0,1457.0,4.9984,169100.0,0.492544,5.427946,True
2371,-117.52,33.99,14.0,13562.0,2057.0,7600.0,2086.0,5.2759,182900.0,0.480104,5.397268,True
2372,-117.52,33.89,2.0,17978.0,3217.0,7305.0,2463.0,5.1695,220800.0,0.492443,5.453668,True
...,...,...,...,...,...,...,...,...,...,...,...,...
15090,-122.25,37.08,20.0,1201.0,282.0,601.0,234.0,2.5556,177500.0,5.248705,0.262488,True
15170,-122.26,37.38,28.0,1103.0,164.0,415.0,154.0,7.8633,500001.0,5.438014,0.460435,True
15253,-122.27,37.32,37.0,2607.0,534.0,1346.0,507.0,5.3951,277700.0,5.408817,0.418688,True
15254,-122.27,37.24,30.0,2762.0,593.0,1581.0,502.0,5.1002,319400.0,5.360084,0.361248,True


In [26]:
df_house_far

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,distance_school,distance_hospital,close
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,3.694888,8.187319,False
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0,3.552591,7.966235,False
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,3.453940,8.143077,False
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,3.448840,8.154416,False
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0,3.456848,8.183508,False
...,...,...,...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0,9.082070,4.233675,False
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0,9.168915,4.332320,False
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0,10.057614,5.358694,False
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0,10.026465,5.322593,False


In [27]:
st.ttest_ind(df_house_close['median_house_value'], df_house_far['median_house_value'])

TtestResult(statistic=38.04632342033554, pvalue=4.817835891327844e-304, df=16998.0)