# S07_T01_Hypothesis Testing

### Ex1: Agafa un conjunt de dades de tema esportiu que t'agradi i selecciona un atribut del conjunt de dades. Calcula el p-valor i digues si rebutja la hipòtesi nul·la agafant un alfa de 5%

#### Terminology
To understand hypothesis testing, there’s some terminology that you have to understand:

 * Null Hypothesis: the hypothesis that sample observations result purely from chance. The null hypothesis tends to state that there’s no change.
 * Alternative Hypothesis: the hypothesis that sample observations are influenced by some non-random cause.
 * P-value: the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct; a smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.
 * Alpha: the significance level; the probability of rejecting the null hypothesis when it is true — also known as Type 1 error.

#### Steps for Hypothesis Testing

Here are the steps to performing a hypothesis test:

 * State your null and alternative hypotheses. To reiterate, the null hypothesis typically states that everything is as normally was — that nothing has changed.

 * Set your significance level, the alpha. This is typically set at 5% but can be set at other levels depending on the situation and how severe it is to committing a type 1 and/or 2 error.

 * Collect sample data and calculate sample statistics.

 * Calculate the p-value given sample statistics. Once you get the sample statistics, you can determine the p-value through different methods. The most common methods are the T-score and Z-score for normal distributions. Learn more about T-score and Z-score here.

 * Reject or do not reject the null hypothesis
 
 
 If the P-value is Greater than the Alpha, Do not Reject the Null

In [115]:
#Importem llibreries necessàries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_1samp,ttest_ind,f_oneway

In [22]:
#Agafem el mateix dataset del Sprint anterior 120 years of Olympic history athletes and results

pd.set_option("display.max_rows",None)
statistics_df= pd.read_csv("athlete_events.csv")
statistics_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [23]:
#Comprovem si tenim NaN values i si hi ha els reemplaçem per zero's
statistics_df.fillna(value=0, inplace = True)
statistics_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,0
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,0
2,3,Gunnar Nielsen Aaby,M,24.0,0.0,0.0,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,0
3,4,Edgar Lindenau Aabye,M,34.0,0.0,0.0,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,0


In [5]:
statistics_df.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,271116.0,271116.0,271116.0,271116.0
mean,68248.954396,24.663827,136.424553,54.305674,1978.37848
std,39022.286345,7.840652,73.45056,32.381492,29.877632
min,1.0,0.0,0.0,0.0,1896.0
25%,34643.0,21.0,157.0,47.0,1960.0
50%,68205.0,24.0,171.0,64.0,1988.0
75%,102097.25,28.0,180.0,75.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


In [6]:
#Torno a triar l'atribut de l'edat per aquest primer exercici
#Agafo mostra aleatòria simple
sample_dades = statistics_df.sample(50)
sample_dades.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,50.0,50.0,50.0,50.0,50.0
mean,72468.48,24.52,138.04,53.96,1980.56
std,39264.801857,6.004216,70.241554,29.293699,27.662809
min,1855.0,0.0,0.0,0.0,1900.0
25%,41924.0,21.25,158.25,47.5,1960.0
50%,72544.0,24.0,170.0,64.0,1992.0
75%,101472.75,27.75,179.0,72.75,2001.5
max,134913.0,42.0,190.0,93.0,2016.0


In [7]:
#calculo mitjana de la mostra seleccionada (x̄)
sample_dades["Age"].mean()

24.52

In [8]:
#Calculo la mitjana de les edats dels atletes (pooulation mean μ = statistics_df["Age"].mean())
age_mitjana = statistics_df["Age"].mean()
age_mitjana

24.663826553947388

In [9]:
#i faig les hipòtesis
#Hipotesi nul.la (H0): μ=25
#Hipotesi alternativa (H1): μ<25
#Sample mean (x̄) = 25.58
#sample Standard deviation(s) = 9.73
#Number of observations(n) = 50
#alpha = 5% (0.05)

#fem one sample t-test
from scipy import stats
alpha=0.05
stat, p = stats.ttest_1samp(statistics_df["Age"], popmean=25)
print(f't-stat = {stat:.2f}\np-value = {p:.3f}')
print("We can not reject H0") if p > alpha else print("We can reject H0")


t-stat = -22.32
p-value = 0.000
We can reject H0


Amb els càlculs obtinguts, podem pràcticament assegurar amb un error de 0.05, que la mitjana de les edats dels atletes serà menor de 25 anys! Hem obtingut un p-value < alpha, amb el que podem rebutjar la hipòtesis nul.la Ho.

In [10]:
#fem one sample t-test
from scipy import stats
alpha=0.05
stat, p = stats.ttest_1samp(statistics_df["Age"], sample_dades["Age"].mean())
print(f't-stat = {stat:.2f}\np-value = {p:.4f}')
print("We can not reject H0") if p > alpha else print("We can reject H0")


t-stat = 9.55
p-value = 0.0000
We can reject H0


In [11]:
#Sample mean (x̄) = 25.58
#population mean (μ) = 24.66
#sample Standard deviation(s) = 9.73
#Number of observations(n) = 50
#alpha = 5% (0.05)
# t=(25.58-24.66)/(9.73/√50) = 0.6685

### Ex2: Continua amb el conjunt de dades de tema esportiu que t'agradi i selecciona dos altres atributs del conjunt de dades. Calcula els p-valors i digues si rebutgen la hipòtesi nul·la agafant un alfa de 5%

In [12]:
#Com a atributs que puguin estar relacionats i tenir una bona hipòtesis, tenim l'alçada, i el sexe...
#Fisiologicament, sabem que normalment la mitjana d'alçada en els homes és més alta que en les dones
#podem plantejar aquesta hipotesi com a alternativa (Ha): sexe Masculí més alt que sexe femení
#i Hipòtesi nulla (Ho): sexe Masculí no és més alt que el femení

#He triat el tes Student's t-test per comparar les dues mostres d'alçades


Student’s t-test

Tests whether the means of two independent samples are significantly different.

Assumptions

* Observations in each sample are independent and identically distributed (iid).
* Observations in each sample are normally distributed.
* Observations in each sample have the same variance.

Interpretation

* H0: the means of the samples are equal.
* H1: the means of the samples are unequal.

In [50]:
#en aquest cas, agafarem les mitjanes d'alçades de les dones, i les dels homes,
#i com a hipòtesis nul·la agafarem (Ho): la mitjana d'alçada en els homes no és més alta que en les dones
#hipòtesis alternativa (H1): mitjana d'alçada en els homes és més alta que en les dones

#necessitem fer les dues mitjanes, per tant, hem d'obtenir les alçades de cada gènere i fer la mitjana:
#Height_M and Height_F

height_F = statistics_df.loc[statistics_df.Sex=="F",  "Height"]
#descartem valors = 0, ja que ens afecten a la mitjana
height_F = height_F.loc[statistics_df.Height !=0]
height_F.describe()

#alpha=0.05


count    67378.000000
mean       167.839740
std          8.778528
min        127.000000
25%        162.000000
50%        168.000000
75%        173.000000
max        213.000000
Name: Height, dtype: float64

In [51]:
height_M = statistics_df.loc[statistics_df.Sex=="M", "Height"]
#descartem valors = 0, ja que ens afecten a la mitjana
height_M = height_M.loc[statistics_df.Height !=0]
height_M.describe()


count    143567.000000
mean        178.858463
std           9.360318
min         127.000000
25%         172.000000
50%         179.000000
75%         185.000000
max         226.000000
Name: Height, dtype: float64

In [80]:
sample_height_F = height_F.sample(40)
sample_height_M = height_M.sample(40)

In [81]:
mean_F = sample_height_F.mean()
round(mean_F,3)

167.275

In [100]:
mean_M = sample_height_M.mean()
round(mean_M,3)

179.3

In [101]:
#ara apliquem ttest per les mitjanes de dues mostres independents, on equal_var = False, ja que no podem assegurar 
#que les dues variances siguin iguals

alpha=0.05
stat, p = stats.ttest_ind(sample_height_M, sample_height_F, equal_var= "False")
print(f't-stat = {stat:.2f}\np-value = {p:.8f}')
print("We can not reject H0") if p > alpha else print("We can reject H0")

t-stat = 5.66
p-value = 0.00000024
We can reject H0


Veient el resultat, podem assegurar amb només un 5% d'error que la mitjana d'alçades dels atletes homes és més gran que la de les dones. Rebutjem la hipòtesi Ho.

### Ex3: Continua amb el conjunt de dades de tema esportiu que t'agradi i selecciona tres atributs del conjunt de dades. Calcula el p-valor i digues si rebutja la hipòtesi nul·la agafant un alfa de 5%

**Choosing a parametric test: regression, comparison, or correlation**
 
Parametric tests usually have stricter requirements than nonparametric tests, and are able to make stronger inferences from the data. They can only be conducted with data that adheres to the common assumptions of statistical tests.

The most common types of parametric test include regression tests, comparison tests, and correlation tests.

#### Comparison tests

Comparison tests look for differences among group means. They can be used to test the effect of a categorical variable on the mean value of some other characteristic.

**T-tests** are used when comparing the means of precisely **two groups** (e.g. the average heights of men and women). 

**ANOVA and MANOVA tests** are used when comparing the means of **more than two groups** (e.g. the average heights of children, teenagers, and adults)

#Utilitzarem ANOVA test ja que tenim més de dos grups!

The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.

#He triat com a atributs, **la mitjana de pes(Weight) i Sport(Basketball,Weightlifting)**

#Hipòtesis nul·la(Ho): la mitjana de pes d'atletes de Basketball i la mitjana d'atletes de Halterofilia són iguals
#Hipòtesis alternativa(H1): la mitjana de pes d'atletes de Basketball i la mitjana d'atletes de Halterofilia  no són iguals

#Calculem les mitjanes de cada grup, i agafem una mostra aleatoria pels dos grups!

In [109]:
Weight_Basket = statistics_df.loc[statistics_df.Sport=="Basketball", "Weight"]
#descartem valors = 0, ja que ens afecten a la mitjana
Weight_Basket = Weight_Basket.loc[statistics_df.Weight !=0]
Weight_Basket.describe()


count    3678.000000
mean       85.777053
std        14.817590
min        50.000000
25%        75.000000
50%        85.000000
75%        95.000000
max       156.000000
Name: Weight, dtype: float64

In [110]:
Weight_Halt = statistics_df.loc[statistics_df.Sport=="Weightlifting", "Weight"]
#descartem valors = 0, ja que ens afecten a la mitjana
Weight_Halt = Weight_Halt.loc[statistics_df.Weight !=0]
Weight_Halt.describe()

count    3803.000000
mean       78.726663
std        22.602393
min        47.000000
25%        60.000000
50%        75.000000
75%        90.000000
max       176.500000
Name: Weight, dtype: float64

In [111]:
sample_Weight_Basket = Weight_Basket.sample(100)
sample_Weight_Halt = Weight_Halt.sample(100)

In [112]:
sample_Weight_Basket.mean()

84.47

In [113]:
sample_Weight_Halt.mean()

79.13

In [119]:
alpha = 0.05
stat, p = f_oneway(sample_Weight_Basket, sample_Weight_Halt)

print(f'F-statistic = {stat:.3f}\np-value = {p:.3f}')
print('We can not reject H0') if p > alpha else print('We can reject H0')

F-statistic = 4.288
p-value = 0.040
We can reject H0


Podem concloure que rebutjem Hipòtesis nul·la, i confirmar que la mitjana de pes dels atletes de Basket i la mitjana de pes dels de Halterofilia són diferents, amb una errada del 5%