# S07_T01_Hypothesis Testing

### Ex1: Agafa un conjunt de dades de tema esportiu que t'agradi i selecciona un atribut del conjunt de dades. Calcula el p-valor i digues si rebutja la hipòtesi nul·la agafant un alfa de 5%

#### Terminology
To understand hypothesis testing, there’s some terminology that you have to understand:

 * Null Hypothesis: the hypothesis that sample observations result purely from chance. The null hypothesis tends to state that there’s no change.
 * Alternative Hypothesis: the hypothesis that sample observations are influenced by some non-random cause.
 * P-value: the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct; a smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.
 * Alpha: the significance level; the probability of rejecting the null hypothesis when it is true — also known as Type 1 error.

#### Steps for Hypothesis Testing

Here are the steps to performing a hypothesis test:

 * State your null and alternative hypotheses. To reiterate, the null hypothesis typically states that everything is as normally was — that nothing has changed.

 * Set your significance level, the alpha. This is typically set at 5% but can be set at other levels depending on the situation and how severe it is to committing a type 1 and/or 2 error.

 * Collect sample data and calculate sample statistics.

 * Calculate the p-value given sample statistics. Once you get the sample statistics, you can determine the p-value through different methods. The most common methods are the T-score and Z-score for normal distributions. Learn more about T-score and Z-score here.

 * Reject or do not reject the null hypothesis
 
 
 If the P-value is Greater than the Alpha, Do not Reject the Null

In [23]:
#Importem llibreries necessàries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_1samp,ttest_ind

In [3]:
#Agafem el mateix dataset del Sptint anterior de la web Zenodo, amb alguns professionals del pàdel, que conté dades com el nom, alçada, ranking.
#punts, partits guanyats... de l'any 2020 i 2021.

pd.set_option("display.max_rows",None)
statistics_df= pd.read_csv("athlete_events.csv")
statistics_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [4]:
#Comprovem si tenim NaN values i si hi ha els reemplaçem per zero's
statistics_df.fillna(value=0, inplace = True)
statistics_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,0
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,0
2,3,Gunnar Nielsen Aaby,M,24.0,0.0,0.0,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,0
3,4,Edgar Lindenau Aabye,M,34.0,0.0,0.0,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,0


In [5]:
statistics_df.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,271116.0,271116.0,271116.0,271116.0
mean,68248.954396,24.663827,136.424553,54.305674,1978.37848
std,39022.286345,7.840652,73.45056,32.381492,29.877632
min,1.0,0.0,0.0,0.0,1896.0
25%,34643.0,21.0,157.0,47.0,1960.0
50%,68205.0,24.0,171.0,64.0,1988.0
75%,102097.25,28.0,180.0,75.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


In [36]:
#Torno a triar l'atribut de l'edat per aquest primer exercici
#Agafo mostra aleatòria simple
sample_dades = statistics_df.sample(50)
sample_dades.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,50.0,50.0,50.0,50.0,50.0
mean,64476.14,25.58,124.14,48.49,1971.56
std,36474.993342,9.73546,82.418298,33.587305,30.38606
min,5867.0,0.0,0.0,0.0,1908.0
25%,34081.25,22.25,0.0,0.0,1953.0
50%,63647.0,26.0,170.0,61.5,1976.0
75%,89576.75,29.75,182.0,74.75,1996.0
max,134002.0,65.0,190.0,88.0,2012.0


In [37]:
#calculo mitjana de la mostra seleccionada (x̄)
sample_dades["Age"].mean()

25.58

In [34]:
#Calculo la mitjana de les edats dels atletes (pooulation mean μ = statistics_df["Age"].mean())
age_mitjana = statistics_df["Age"].mean()
age_mitjana

24.663826553947388

In [38]:
#i faig les hipòtesis
#Hipotesi nul.la (H0): μ=25
#Hipotesi alternativa (H1): μ<25
#Sample mean (x̄) = 25.58
#sample Standard deviation(s) = 9.73
#Number of observations(n) = 50
#alpha = 5% (0.05)

#fem one sample t-test
from scipy import stats
alpha=0.05
stat, p = stats.ttest_1samp(statistics_df["Age"], popmean=25)
print(f't-stat = {stat:.2f}\np-value = {p:.2f}')
print("We can not reject H0") if p > alpha else print("We can reject H0")


t-stat = -22.32
p-value = 0.00
We can reject H0


Amb els càlculs obtinguts, podem pràcticament assegurar amb un error de 0.05, que la mitjana de les edats dels atletes 
serà menor de 25 anys! Hem obtingut un p-value < alpha, amb el que podem rebutjar la hipòtesis nul.la Ho.

In [None]:
#Sample mean (x̄) = 25.58
#population mean (μ) = 24.66
#sample Standard deviation(s) = 9.73
#Number of observations(n) = 50
#alpha = 5% (0.05)
# t=(25.58-24.66)/(9.73/√50) = 0.6685

### Ex2: Continua amb el conjunt de dades de tema esportiu que t'agradi i selecciona dos altres atributs del conjunt de dades. Calcula els p-valors i digues si rebutgen la hipòtesi nul·la agafant un alfa de 5%

### Ex3: Continua amb el conjunt de dades de tema esportiu que t'agradi i selecciona tres atributs del conjunt de dades. Calcula el p-valor i digues si rebutja la hipòtesi nul·la agafant un alfa de 5%