# Non-parametric test(advanced)

In [1]:
import numpy as np
import pandas as pd

data1 = np.random.rand(10)
data2 = np.random.rand(10)

df = pd.DataFrame({'a':data1, 'b': data2})
df

C:\Users\Teerawat\anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\Teerawat\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


Unnamed: 0,a,b
0,0.848166,0.676082
1,0.902807,0.873736
2,0.70549,0.761784
3,0.209794,0.687549
4,0.111564,0.898474
5,0.904098,0.100625
6,0.140254,0.750389
7,0.440624,0.1286
8,0.812959,0.369444
9,0.246469,0.383149


# 1. Mann-Whitney U test:
Use this test when you have two independent samples, and you want to determine if there's a significant difference between their distributions. The test is appropriate when the data is not normally distributed or when you have ordinal data (rankings or ordered categories). The Mann-Whitney U test is a non-parametric alternative to the independent t-test.

Objective: To compare two independent samples and determine if there is a significant difference between their distributions.

(H0): The two samples come from the same population, and there is no difference between their distributions.

(H1): The two samples come from different populations, and there is a difference between their distributions.

Interpretation: If the p-value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there is a significant difference between the distributions of the two samples.

In [2]:
from scipy.stats import mannwhitneyu



stat, p = mannwhitneyu(df['a'], df['b'])
print('stat=%.3f, p=%.3f' % (stat, p))
print('Fail to reject H0, a nad b are diffence.')

stat=53.000, p=0.850
Fail to reject H0, a nad b are diffence.


# 2. Wilcoxon signed-rank test:

Use this test when you have two related (paired) samples, and you want to determine if there's a significant difference between their distributions. The test is suitable for non-normally distributed data or ordinal data. The Wilcoxon signed-rank test is a non-parametric alternative to the paired t-test.

Objective: To compare two related (paired) samples and determine if there is a significant difference between their distributions.

(H0): The median difference between the paired samples is zero.

(H1): The median difference between the paired samples is not zero.

Interpretation: If the p-value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there is a significant difference between the distributions of the two paired samples.

In [3]:
from scipy.stats import wilcoxon

stat, p = wilcoxon(df['a'], df['b'])
print('stat=%.3f, p=%.3f' % (stat, p))
print('Fail to reject H0, a nad b are diffence median')

stat=26.000, p=0.922
Fail to reject H0, a nad b are diffence median


# 3. Kruskal-Wallis test:

Use this test when you have three or more independent samples, and you want to determine if there's a significant difference between their distributions. The test is appropriate for non-normally distributed data or ordinal data. The Kruskal-Wallis test is a non-parametric alternative to the one-way ANOVA.

Objective: To compare three or more independent samples and determine if there is a significant difference between their distributions.

(H0): All samples come from the same population, and there is no difference between their distributions.

(H1): At least one sample comes from a different population, and there is a difference between their distributions.

Interpretation: If the p-value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there is a significant difference between the distributions of at least one of the samples.

In [4]:
from scipy.stats import kruskal

data3 = np.random.rand(10)

df['c'] = data3

stat, p = kruskal(df['a'], df['b'], df['c'])
print('stat=%.3f, p=%.3f' % (stat, p))
print('Fail to reject H0, At least one sample are diffence.')
df

stat=3.097, p=0.213
Fail to reject H0, At least one sample are diffence.


Unnamed: 0,a,b,c
0,0.848166,0.676082,0.124798
1,0.902807,0.873736,0.446107
2,0.70549,0.761784,0.171693
3,0.209794,0.687549,0.264731
4,0.111564,0.898474,0.010127
5,0.904098,0.100625,0.827749
6,0.140254,0.750389,0.462716
7,0.440624,0.1286,0.058524
8,0.812959,0.369444,0.49711
9,0.246469,0.383149,0.254358


# 4. Spearman's rank correlation coefficient:

Use this test when you want to assess the strength and direction of the association between two continuous or ordinal variables, especially when the relationship is not necessarily linear. The Spearman's rank correlation coefficient is a non-parametric alternative to the Pearson correlation coefficient.

Objective: To assess the strength and direction of the association between two variables, when the relationship is not necessarily linear.

(H0): There is no association between the two variables.

(H1): There is an association between the two variables.

Interpretation: The coefficient value ranges from -1 to 1, where -1 indicates a strong negative association, 0 indicates no association, and 1 indicates a strong positive association. If the p-value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there is a significant association between the two variables.

In [5]:
from scipy.stats import spearmanr

coef, p = spearmanr(df['a'],df['b'])
print('coef=%.3f, p=%.3f' % (coef, p))
print('Fail to reject H0, There is an association between the two variables.')

coef=-0.418, p=0.229
Fail to reject H0, There is an association between the two variables.


# 5. The Chi-squared test:
Use to compares the observed frequencies in the contingency table to the expected frequencies under the null hypothesis of independence. If the observed frequencies deviate significantly from the expected frequencies, the test will provide a small p-value, indicating that there is evidence to reject the null hypothesis.

Objective: To assess the independence or association between two categorical variables in a contingency table.

(H0): The two categorical variables are independent, and there is no association between them.

(H1): The two categorical variables are dependent, and there is an association between them.

Interpretation: If the p-value is less than your chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there is a significant association between the two categorical variables.

In [6]:
from scipy.stats import chi2_contingency

# Example contingency table
observed_array = np.array([[10, 20, 30],
                           [20, 30, 20]])

# Convert the array to a DataFrame
observed_df = pd.DataFrame(observed_array, columns=['Category1', 'Category2', 'Category3'], index=['Group1', 'Group2'])

# Chi-squared test
chi2, p, dof, expected = chi2_contingency(observed_array)

print("Crosstab table:")
print(observed_df)
print(f"\nChi2: {chi2:.3f}, p-value: {p:.3f}, degrees of freedom: {dof}")
print("\nExpected frequencies:")
print(expected)


print('\nReject Ho, The two categorical variables are independent.')

Crosstab table:
        Category1  Category2  Category3
Group1         10         20         30
Group2         20         30         20

Chi2: 6.603, p-value: 0.037, degrees of freedom: 2

Expected frequencies:
[[13.84615385 23.07692308 23.07692308]
 [16.15384615 26.92307692 26.92307692]]

Reject Ho, The two categorical variables are independent.
