![](https://i.ibb.co/j5Zk7wT/dst-eda-4-11.png) <br/><br/>
![](https://i.ibb.co/R6RqSYM/dst-eda-4-12.png) <br/><br/>
![](https://i.ibb.co/Y8BLZYc/dst-eda-4-10.png) <br/><br/>

In [8]:
import pandas as pd

In [9]:
data = pd.read_csv('Data/pizzas.csv')

print('Table Shape: ', data.shape)
data.head()

Table Shape:  (35, 2)


Unnamed: 0,Making Unit 1,Making Unit 2
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522


# Checking the data for normality


In [10]:
H0 = 'Data is distributed normally'
Ha = 'The data is not distributed normally'

In [11]:
alpha = 0.05 # Significance level

## Shapiro–Wilk test

In [12]:
from scipy.stats import shapiro

In [13]:
_, p = shapiro(data)

print(f"{'%.3f' % p} > {alpha} {H0 if p > alpha else Ha}")

0.204 > 0.05 Data is distributed normally


## D'Agostino test

In [14]:
from scipy.stats import normaltest

In [15]:
_, p = normaltest(data)

print(f"{'%.3f' % p[0]} > {alpha} {H0 if p[0] > alpha/2 else Ha}")

0.251 > 0.05 Data is distributed normally


-----

# Independent T-test

In [16]:
from scipy.stats import ttest_ind

In [17]:
H0 = 'There is no significant difference between pizza diameters in different pizzerias.'
Ha = 'There is a significant difference between pizza diameters in different pizzerias.'

The dependent variable (pizza diameter) is quantitative. The groups come from different aggregates. Hence, we use an independent T-test.

In [18]:
test_results = ttest_ind(data['Making Unit 1'], data['Making Unit 2'], equal_var=True)

p = round(test_results[1],2)
print(f"{'%.3f' % p} > {alpha} {H0 if p > alpha else Ha}")

0.470 > 0.05 There is no significant difference between pizza diameters in different pizzerias.


# Spearman Correlation

In [19]:
from numpy.random import rand
from scipy.stats import spearmanr

In [20]:
H0 = 'There is no dependence between the variables'
Ha = 'There is dependence between the variables'

In [21]:
data1 = rand(1000) * 20             # make data with abnormal distribution
data2 = data1 + (rand(1000) * 10)   # make data with abnormal distribution

corr, p = spearmanr(data1, data2)

# print(corr, p)
print(f"{'%.3f' % p} > {alpha} {H0 if p > alpha else Ha}")

0.000 > 0.05 There is dependence between the variables


# ANOVA test

In [22]:
from scipy.stats import f_oneway

As data, we will take information on the size of the shell of mussels grown in different places.

In [23]:
petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]
magadan    = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764, 0.0689]
tvarminne  = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]

In [24]:
_, p = f_oneway(petersburg, magadan, tvarminne)

In [25]:
H0 = 'There is no significant difference between the average size of a mussel shell in three different locations.'
Ha = 'There is a significant difference between the average size of a mussel shell in three different locations.'

In [26]:
print(f"{'%.3f' % p} > {alpha} {H0 if p > alpha else Ha}")

0.008 > 0.05 There is a significant difference between the average size of a mussel shell in three different locations.


_____

In [27]:
data = pd.read_csv('Data/blood_pressure.csv')

print('Table Shape: ', data.shape)
data.head()

Table Shape:  (120, 5)


Unnamed: 0,patient,sex,agegrp,bp_before,bp_after
0,1,Male,30-45,143,153
1,2,Male,30-45,163,170
2,3,Male,30-45,153,168
3,4,Male,30-45,153,142
4,5,Male,30-45,146,141


## Z-test

In [28]:
from statsmodels.stats import weightstats

In [34]:
H0 = 'There is no difference between the variables'
Ha = 'There is difference between the variables'

In [35]:
_ ,p = weightstats.ztest(data['bp_before'], x2=data['bp_after'], value=0, alternative='two-sided')

#print(float(p))
print(f"{'%.3f' % p} > {alpha} {H0 if p > alpha else Ha}")

0.002 > 0.05 There is difference between the variables


## Proportion Z-test

The Z-criterion of one proportion is used to compare the observed proportion with the theoretical one.

The following null hypotheses are used in this test:

$H_0: p = p_0$ (for men who provided data on their blood pressure, it is equal to the hypothetical proportion of $p_0$)

An alternative hypothesis can be two-sided, left-sided or right-sided:

$H_1 (two-sided): p ≠ p_0$ (the proportion of men is not equal to some hypothetical value $p_0$)
$H_1 (left-sided): p < p_0$ (the proportion of men is less than some hypothetical value $p_0$)
$H_1 (right-sided): p > p_0$ (the proportion of men is greater than some hypothetical value of $p_0$)

Let's assume that the share of men in our dataset is 40%.

$p_0$: hypothetical proportion of males = 0.40

$x$: number of males in the sample of males: `len(data[data.sex == 'Male'])`

$n$: sample size = `len(data)`

Let's show how to use the `proportions_ztest` function to perform a **z-test**:

In [44]:
from statsmodels.stats.proportion import proportions_ztest

In [45]:
p_0 = 0.4

n = len(data)
x = len(data[data.sex == 'Male'])

print(f"Data length: {n}\nNumber of men: {x}")

Data length: 120
Number of men: 60


In [47]:
#perform one proportion z-test
_, p = proportions_ztest(count=x, nobs=n, value=p_0)

if p < 0.05:
    print("We reject the null hypothesis")
else:
    print("We cannot reject the null hypothesis.")

We reject the null hypothesis
