<a href="https://colab.research.google.com/github/Azimoj/WCS/blob/main/Hypothesis_testing_with_Python_sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hypothesis testing with Python**

Use your knowledge in hypothesis testing with python to answer the following research questions.

Please provide an interpretation of your result: whether or not you reject the null hypothesis and why.

## **Research question 1**
In the previous years, 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media?
.

**Population:** Parents with a teenager (age 13-18)

**Parameter of Interest:** p (proportion)

**Null Hypothesis:** p = 0.52

**Alternative Hypthosis:** p > 0.52 (note that this is a one-sided test)


**Data:** 1018 people were surveyed. 56% of those who were surveyed believe that their teenager’s lack of sleep is caused due to electronics and social media.



proportions_ztest documentation : https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportions_ztest.html#statsmodels.stats.proportion.proportions_ztest

In [None]:
from statsmodels.stats.proportion import proportions_ztest

# Données
count = 0.56 * 1018  # 56% des 1018 personnes sondées
nobs = 1018  # Nombre total de personnes sondées
value = 0.52  # La valeur de la proportion sous l'hypothèse nulle

# Test de proportion
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, value=value, alternative='larger')

# Afficher les résultats
print(f"Statistique Z : {z_stat:.3}")
print(f"Valeur P : {p_value:.3%}")

# Interprétation des résultats
alpha = 0.05  # Niveau de signification
if p_value < alpha:
    print("La p-value est inférieure au niveau de signification. Rejetez l'hypothèse nulle.")
else:
    print("La p-value est supérieure au niveau de signification. L'hypothèse nulle n'est pas rejetée.")


Statistique Z : 2.57
Valeur P : 0.507%
La p-value est inférieure au niveau de signification. Rejetez l'hypothèse nulle.


La statistique Z  mesure à quel point l'échantillon diffère de l'hypothèse nulle. Plus la statistique Z est éloignée de zéro, plus l'écart entre les données observées et l'hypothèse nulle est important.

La p-value est la probabilité d'obtenir une statistique Z aussi extrême (ou plus extrême) que celle observée, sous l'hypothèse nulle. Une valeur p faible indique que les données sont statistiquement significatives et que l'écart par rapport à l'hypothèse nulle est peu probable en l'absence d'un véritable effet.

On aura donc tendance à penser que les parents pensent de plus en plus que les réseaux sociaux entrainent un manque de sommeil chez les adolescents.

## **Research Question 2** Indice de masse corporelle
Considering adults in the [NHANES data](https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv), do the mean Body Mass Index of men and women are significantly different?

**Population:** Adults in the NHANES data.

**Parameter of Interest:** $\mu_1$: mean BMI of men. $\mu_2$: mean BMI of women.

**Null Hypothesis:** $\mu_1 = \mu_2$

**Alternative Hypthosis:** $\mu_1 \neq \mu_2$

**Data:**
2976 Females $\mu_1 = 29.94$
$\sigma_1 = 7.75$

2759 Male Adults
$\mu_2 = 28.78$
$\sigma_2 = 6.25$

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv"
df = pd.read_csv(url)

df.head()


Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5735 entries, 0 to 5734
Data columns (total 28 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      5735 non-null   int64  
 1   ALQ101    5208 non-null   float64
 2   ALQ110    1731 non-null   float64
 3   ALQ130    3379 non-null   float64
 4   SMQ020    5735 non-null   int64  
 5   RIAGENDR  5735 non-null   int64  
 6   RIDAGEYR  5735 non-null   int64  
 7   RIDRETH1  5735 non-null   int64  
 8   DMDCITZN  5734 non-null   float64
 9   DMDEDUC2  5474 non-null   float64
 10  DMDMARTL  5474 non-null   float64
 11  DMDHHSIZ  5735 non-null   int64  
 12  WTINT2YR  5735 non-null   float64
 13  SDMVPSU   5735 non-null   int64  
 14  SDMVSTRA  5735 non-null   int64  
 15  INDFMPIR  5134 non-null   float64
 16  BPXSY1    5401 non-null   float64
 17  BPXDI1    5401 non-null   float64
 18  BPXSY2    5535 non-null   float64
 19  BPXDI2    5535 non-null   float64
 20  BMXWT     5666 non-null   floa

Après recheche, on s'intéresse à la colonne BMXBMI.

In [None]:
men = df[df["RIAGENDR"] == 1]
women = df[df["RIAGENDR"] == 2]

print(f"{women.shape[0]} Female Adults  μ1={women['BMXBMI'].mean():.4}   σ1={women['BMXBMI'].std():.3}")
print(f"{men.shape[0]} Male Adults  μ2={men['BMXBMI'].mean():.4}   σ2={men['BMXBMI'].std():.3}")

2976 Female Adults  μ1=29.94   σ1=7.75
2759 Male Adults  μ2=28.78   σ2=6.25


ztest documentation : https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html

ztest_ind documentation : https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.ztest_ind.html#statsmodels.stats.weightstats.CompareMeans.ztest_ind

In [None]:
#Cleaning
men_clean = men['BMXBMI'].dropna()
women_clean = women['BMXBMI'].dropna()

print(men_clean.shape[0], women_clean.shape[0])

2718 2944


In [None]:
import statsmodels.stats.weightstats as ws

col1 = ws.DescrStatsW(men_clean)
col2 = ws.DescrStatsW(women_clean)

cm_obj = ws.CompareMeans(col1, col2)

zstat, z_pval = cm_obj.ztest_ind(alternative='two-sided', usevar='unequal')

print(f"Statistique Z : {zstat:.3}")
print(f"p-value : {z_pval:.10%}")

# Interpréter les résultats
alpha = 0.05

if z_pval < alpha:
    print("La p-value est inférieure au seuil alpha.")
    print("Nous rejetons donc l'hypothèse nulle.")
else:
    print("La p-value est supérieure au seuil alpha.")
    print("Nous n'avons pas suffisamment de preuves pour rejeter l'hypothèse nulle.")

Statistique Z : -6.23
p-value : 0.0000000472%
La p-value est inférieure au seuil alpha.
Nous rejetons donc l'hypothèse nulle.


In [None]:
import statsmodels.stats.weightstats as ws

col1 = ws.DescrStatsW(men_clean)
col2 = ws.DescrStatsW(women_clean)

cm_obj = ws.CompareMeans(col1, col2)

zstat, z_pval = cm_obj.ztest_ind(alternative='two-sided', usevar='pooled')

print(f"Statistique Z : {zstat:.3}")
print(f"p-value : {z_pval:.10%}")

# Interpréter les résultats
alpha = 0.05

if z_pval < alpha:
    print("La p-value est inférieure au seuil alpha.")
    print("Nous rejetons donc l'hypothèse nulle.")
else:
    print("La p-value est supérieure au seuil alpha.")
    print("Nous n'avons pas suffisamment de preuves pour rejeter l'hypothèse nulle.")

Statistique Z : -6.18
p-value : 0.0000000659%
La p-value est inférieure au seuil alpha.
Nous rejetons donc l'hypothèse nulle.


L'IMC semble différent chez les hommes et les femmes.

# **Optional**

## Theoretical exercises

### Question 1

Some of the following statements refer to the null hypothesis, some to the alternate hypothesis.

State the null hypothesis,  H0 , and the alternative hypothesis  Ha , in terms of the appropriate parameter  (*μ* or *p*) .

**For example:**

*Statement:* At most 60% of Americans vote in presidential elections.

*Answer:* H0: p≤0.60; Ha: p>0.60

*Statement:* The mean number of years Americans work before retiring is 34.

*Answer:* H0: μ=34; Ha: μ≠34


**Your turn:**
1. The mean starting salary for San Jose State University graduates is at least USD 100,000 per year.

    **H0: μ>=100,000; Ha: μ<100,000**
2. Twenty-nine percent of high school seniors get drunk each month.

    **H0: p=0.29; Ha: p=/=0.29**
3. Fewer than 5% of adults ride the bus to work in Los Angeles.

    **H0: p<0.05; Ha: p>0.05**
4. The mean number of cars a person owns in her lifetime is not more than ten.

    **H0: μ<=10; Ha: μ>10**
5. About half of Americans prefer to live away from cities, given the choice.

    **H0: p=0.5; Ha: p=/=0.5**
6. Europeans have a mean paid vacation each year of six weeks.

    **H0: μ=6; Ha: μ=/=6**
7. The chance of developing breast cancer is under 11% for women.

    **H0: p<0.11; Ha: p>0.11**
8. Private universities' mean tuition cost is more than $20,000 per year.

    **H0: μ>20,000; μ<20,000**



### Question 2

Over the past few decades, public health officials have examined the link between weight concerns and teen girls' smoking. Researchers surveyed a group of 273 randomly selected teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls were surveyed again. Sixty-three said they smoked to stay thin. Is there good evidence that more than thirty percent of the teen girls smoke to stay thin? The alternative hypothesis is:

> a. p<0.30

> b. p≤0.30

> c. p≥0.30

> d. p>0.30

Hypothèse Nulle (H0) : La proportion de filles adolescentes qui fument pour rester minces est égale ou inférieure à 30% :

H0: p ≤ 0.30

Hypothèse Alternative (Ha) : La proportion de filles adolescentes qui fument pour rester minces est supérieure à 30% :

Ha: p > 0.30

### Question 3

State the Type I and Type II errors in complete sentences given the following statements.

**For example:**

*Statement:* The mean number of years Americans work before retiring is 34.

*Answer:* We conclude that the mean is not 34 years, when it really is 34 years. Type II error: We conclude that the mean is 34 years, when in fact it really is not 34 years.

**Your turn:**
1. At most 60% of Americans vote in presidential elections.
2. The mean starting salary for San Jose State University graduates is at least USD 100,000 per year.
3. Twenty-nine percent of high school seniors get drunk each month.
4. Fewer than 5% of adults ride the bus to work in Los Angeles.
5. The mean number of cars a person owns in his or her lifetime is not more than ten.
6. About half of Americans prefer to live away from cities, given the choice.
7. Europeans have a mean paid vacation each year of six weeks.
8. The chance of developing breast cancer is under 11% for women.
9. Private universities mean tuition cost is more than USD 20,000 per year.

A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population.

**1. At most 60% of Americans vote in presidential elections:**

Type I error: We conclude that more than 60% of Americans vote in presidential elections when, in reality, it's not the case.

Type II error: We conclude that at most 60% of Americans vote in presidential elections when, in reality, more than 60% do vote.

**2. The mean starting salary for San Jose State University graduates is at least USD 100,000 per year:**

Type I error: We conclude that the mean starting salary is not at least USD 100,000 per year when, in reality, it is.

Type II error: We conclude that the mean starting salary is at least USD 100,000 per year when, in reality, it's not.

**3. Twenty-nine percent of high school seniors get drunk each month:**

Type I error: We conclude that 29% of high school seniors didn't got drunk each month when, in reality, it's the case.

Type II error: We conclude that 29% of high school seniors get drunk each month when, in reality, it's note the case.

**4. Fewer than 5% of adults ride the bus to work in Los Angeles:**

Type I error: We conclude that 5% or more of adults ride the bus to work in Los Angeles when, in reality, it's not the case.

Type II error: We conclude that fewer than 5% of adults ride the bus to work in Los Angeles when, in reality, more than 5% do.

**5. The mean number of cars a person owns in his or her lifetime is not more than ten:**

Type I error: We conclude that the mean number of cars a person owns in their lifetime is more than ten when, in reality, it's not.

Type II error: We conclude that the mean number of cars a person owns in their lifetime is not more than ten when, in reality, it is.

**6. About half of Americans prefer to live away from cities, given the choice:**

Type I error: We conclude that less than half of Americans prefer to live away from cities when, in reality, it's not the case.

Type II error: We conclude that about half of Americans prefer to live away from cities when, in reality, it's less than half.

**7. Europeans have a mean paid vacation each year of six weeks:**

Type I error: We conclude that the mean paid vacation for Europeans each year is less than six weeks when, in reality, it's not.

Type II error: We conclude that the mean paid vacation for Europeans each year is six weeks when, in reality, it's more than six weeks.

**8. The chance of developing breast cancer is under 11% for women:**

Type I error: We conclude that the chance of developing breast cancer is 11% or more for women when, in reality, it's not.

Type II error: We conclude that the chance of developing breast cancer is under 11% for women when, in reality, it's more than 11%.

**9. Private universities' mean tuition cost is more than USD 20,000 per year:**

Type I error: We conclude that the mean tuition cost for private universities is USD 20,000 or less per year when, in reality, it's more.

Type II error: We conclude that the mean tuition cost for private universities is more than USD 20,000 per year when, in reality, it's less.

## **Python exercises**

### Research Question
Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

**Populations:** All parents of black children age 6-18 and all parents of Hispanic children age 6-18

**Parameter of Interest:** p1 - p2, where p1 = black and p2 = hispanic

**Null Hypothesis:** p1 - p2 = 0

**Alternative Hypthosis:** p1 - p2 $\neq$ 0

**Data:**

247 Parents of Black Children. 36.8% of parents report that their child has had some swimming lessons.

308 Parents of Hispanic Children. 38.9% of parents report that their child has had some swimming lessons.

*Hint:* use ttest_ind() from statsmodels

In [None]:
import numpy as np
from statsmodels.stats.weightstats import ttest_ind

n_black = 247
p1 = 0.368
n_hisp = 308
p2 = 0.389

sample_black = np.random.binomial(n=1, p=p1, size=n_black)
sample_hisp = np.random.binomial(n=1, p=p2, size=n_hisp)

In [None]:
print(sample_black)
print(sample_black.shape)
print(f"{sample_black.mean():.2%}")

[0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1
 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0
 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0
 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1
 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 1
 0 1 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0 1]
(247,)
36.03%


In [None]:
tstat, p_value, d = ttest_ind(sample_black, sample_hisp)

print(tstat, p_value)

# Interpréter les résultats
alpha = 0.05

if p_value < alpha:
    print("La p-value est inférieure au seuil alpha.")
    print("Nous rejetons donc l'hypothèse nulle.")
else:
    print("La p-value est supérieure au seuil alpha.")
    print("Nous n'avons pas suffisamment de preuves pour rejeter l'hypothèse nulle.")

-0.5509753903525183 0.5818730678288508
La p-value est supérieure au seuil alpha.
Nous n'avons pas suffisamment de preuves pour rejeter l'hypothèse nulle.


-> Pas de différence significative entre les deux populations

### Research Question
Let's say a cartwheeling competition was organized for some adults. The data looks like following,

(80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01)

Is the average cartwheel distance (in inches) for adults more than 80 inches?

**Population:** All adults

**Parameter of Interest:** $\mu$, population mean cartwheel distance.

**Null Hypothesis:** $\mu$ = 80

**Alternative Hypthosis:** $\mu$ > 80

**Data:**

25 adult participants.

$\mu$ = 83.84

$\sigma$ = 10.72



# New Section

In [None]:
from statsmodels.stats.weightstats import ztest

data = np.array([80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01])

n = data.shape
mu = data.mean()
std = data.std()
print(n, mu, std, '\n')

zstat, p_value = ztest(data, value=80, alternative='larger')
print(zstat, p_value, '\n')

# Interpréter les résultats
alpha = 0.05
if p_value < alpha:
    print("La p-value est inférieure au seuil alpha.")
    print("Nous rejetons donc l'hypothèse nulle.")
else:
    print("La p-value est supérieure au seuil alpha.")
    print("Nous n'avons pas suffisamment de preuves pour rejeter l'hypothèse nulle.")

(25,) 83.84320000000001 10.716018932420752 

1.756973189172546 0.039461189601168366 

La p-value est inférieure au seuil alpha.
Nous rejetons donc l'hypothèse nulle.
