<a href="https://colab.research.google.com/github/11eeys/data_analysis/blob/main/11_NormalityTest_SWtest_KStest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 💦🔥 **Normality test**

1. Small sample size (N<50), which is comparable to **Shapiro-Wilk** test on SPSS

2. Big sample size (N>50), which is comparable to **Kolmogorov-Smirnov** test on SPSS



# <font color = 'green'> **1️⃣  Small Sample Size (N < 50)**
    **Be** aware that the following script generates different results for everyone running it!

## CTT (Critical Test Threshold)

- The p-value represents the probability of observing the test statistic (or more extreme) under the assumption that the null hypothesis is true. However, it doesn't provide information about the size of the effect or the practical significance of the result.

- On the other hand, the critical test threshold (CTT) is the predetermined significance level (often denoted as alpha) at which you're willing to reject the null hypothesis. It is typically set to 0.05.

In [2]:
import numpy as np # numeric calculation
import pandas as pd # data analysis
import seaborn as sns
from scipy.stats import shapiro # stats

# Generate a random dataset (replace this with your own data)
# 🔔 The following code line generates random numbers from a normal (Gaussian) distribution using NumPy's random.normal function.
# 🔔 loc=0, scale=1, size=50 (creating an array of 50 numbers drawn from a normal distribution with a mean (loc) of 0 and a standard deviation (scale) of 1.
data1 = np.random.normal(loc=0, scale=1, size=50)
print(data1)
print('\n')

# Convert the data to a pandas DataFrame
df = pd.DataFrame(data1, columns=['Values'])
print(df)
print('\n')

# Save the DataFrame to a CSV file
df.to_csv('generated_data1.csv', index=True) # index=True for inncluding index

# Perform Shapiro-Wilk test for normality
# 🔔 Tuple unpacking: shapiro() function returns a tuple containing two values: the test statistic and the p-value.
# By separating statistic and p_value with a comma, Python interprets the returned tuple and assigns each value to its corresponding variable.
statistic, p_value = shapiro(data1)

# Print the test statistic and p-value
print("Shapiro-Wilk Test Statistic:", statistic)
print("p-value:", p_value)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")



[ 0.13851682 -0.24666814  0.58791532  1.74441864 -0.07168202  0.31477223
  0.99115799 -0.74500005 -1.4451541   0.03077577  1.58468012 -0.45968916
  1.28929157  0.4049937  -0.46124217 -1.01394486 -0.50548163  0.35812892
  0.24101221  1.08364181 -0.94653664  0.22388825 -0.88254279 -0.60172592
  1.11592743  0.30063153  0.38804849  2.70159459  1.70557959 -0.0646981
  0.17773989  0.70815758 -1.62998429 -0.64886783 -0.30384395  0.00589283
 -0.38871424 -0.864991   -0.94263019  0.57786982 -1.11815901 -1.51330653
  0.93265581  0.50532732  0.00526326 -0.15085962  0.97522819  1.45389727
 -0.63892462  0.77402993]


      Values
0   0.138517
1  -0.246668
2   0.587915
3   1.744419
4  -0.071682
5   0.314772
6   0.991158
7  -0.745000
8  -1.445154
9   0.030776
10  1.584680
11 -0.459689
12  1.289292
13  0.404994
14 -0.461242
15 -1.013945
16 -0.505482
17  0.358129
18  0.241012
19  1.083642
20 -0.946537
21  0.223888
22 -0.882543
23 -0.601726
24  1.115927
25  0.300632
26  0.388048
27  2.701595
28  1.705580

# <font color = 'green'> **2️⃣ Big Sample Size (N > 50)**
    Be aware that the following script generates different results for everyone running it!

In [1]:
import numpy as np # numeric calculation
import pandas as pd # data analysis
import seaborn as sns
from scipy.stats import kstest, norm # stats


# Generate a random dataset (replace this with your own data)
# loc=0, scale=1, size=100 (creating an array of 100 numbers drawn from a normal distribution with a mean (loc) of 0 and a standard deviation (scale) of 1.
data2 = np.random.normal(loc=0, scale=1, size=100)
print(data2)
print('\n')

# Convert the data to a pandas DataFrame
df = pd.DataFrame(data2, columns=['Values'])
print(df)
print('\n')

# Save the DataFrame to a CSV file
df.to_csv('generated_data2.csv', index=True)

# Perform Shapiro-Wilk test for normality
statistic, p_value = kstest(data2, 'norm')

# Print the test statistic and p-value
print("Kolmogorov-Smirnov Test Statistic:", statistic)
print("p-value:", p_value)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")

[-0.05180021 -0.17033552 -1.08745259 -1.0695269   1.31826621  1.33390365
 -1.3502719   1.46420154 -0.83039565  0.99843118  0.29265313 -1.27753863
  0.55278496 -1.15639912  0.81176796 -1.11027006  0.49318096  0.78587046
  1.1743581   2.65434614  0.81946106 -1.14578076 -1.06891344 -0.53341636
 -1.11087379  0.74547015  0.55199202 -0.22748093 -1.18069207  1.62437556
 -0.43894072 -1.02185948 -1.42771677 -1.04798866  1.10731008 -0.50665729
 -0.77921503  0.97252945  0.31500836  0.46193639  0.46750104  0.58087268
  0.12155903 -0.39403344 -0.9582721  -0.14544193  1.14940524 -0.08996439
 -1.20003645  1.52608333 -0.63094616 -0.520058   -0.84160219 -0.09575411
  0.53268374  0.39681726  0.31301332  0.13123437 -1.98069048 -0.35093974
  1.32839207  0.39753966 -0.15713512 -0.07387575 -2.3979359   1.25456809
 -0.20560868  1.4167996  -1.10939267  0.0066255  -0.07881875  0.68052731
  0.06736497  1.04970494  0.95947593  0.24401673 -1.02821396  0.22737295
  0.59064246  0.82019161  0.73860718 -0.56633691  0

## 💿 💿 2️⃣-1️⃣ When you use your csv. file...
    Be aware that the following script generates the same result for everyone running it!

In [3]:
import numpy as np
import pandas as pd
from scipy.stats import kstest, norm
import urllib.request

# URL of the CSV file
url = 'https://raw.githubusercontent.com/ms624atyale/Data_NLP2024/main/GeneratedSampleData.csv'
response = urllib.request.urlopen(url)
content = response.read().decode('utf-8')
print(content)

# Read the CSV data from the URL into a pandas DataFrame
df = pd.read_csv(url)

# Specify the column name
df_final = pd.DataFrame(df, columns=['Values'])
print(df_final)
print('\n')

# Save the DataFrame to a CSV file with index numbers
df_final.to_csv('GeneratedSampleData.csv', index=True)

# Extract the column containing the data you want to test
data = df_final['Values']

# Perform Shapiro-Wilk test for normality
statistic, p_value = kstest(data, 'norm')

# Print the test statistic and p-value
print("Kolmogorov-Smirnov Test Statistic:", statistic)
print("p-value:", p_value)

# Define the critical test threshold (alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")

HTTPError: HTTP Error 404: Not Found

## 💿 💿 2️⃣-2️⃣ When you use your csv. file...
    Be aware that the following script generates the same result for everyone running it!

In [4]:
import numpy as np
import pandas as pd
from scipy.stats import kstest, norm


file = pd.read_csv('/content/GeneratedSampleData.csv')
final_data = file['Values'] # [IMPORTANT] Column name should be specified, otherwise, ValueError occurs.
print(final_data)
print('\n')

# Convert the data to a pandas DataFrame
df = pd.DataFrame(final_data, columns=['Values'])
print(df)
print('\n')

# Perform Shapiro-Wilk test for normality
statistic, p_value = kstest(final_data, 'norm')

# Print the test statistic and p-value
print("Kolmogorov-Smirnov Test Statistic:", statistic)
print("p-value:", p_value)

# Interpret the result
alpha = 0.05
if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")

FileNotFoundError: [Errno 2] No such file or directory: '/content/GeneratedSampleData.csv'

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import kstest, norm

# Load the data from the CSV file
file = pd.read_csv('/content/GeneratedSampleData.csv')
final_data = file['Values']  # Extract the column containing the data

# Perform Kolmogorov-Smirnov test for normality
statistic, p_value = kstest(final_data, 'norm')

# Print the test statistic and p-value
print("Kolmogorov-Smirnov Test Statistic:", statistic)
print("p-value:", p_value)

# Define the CTT (critical test threshold, aka., alpha)
alpha = 0.05

# Interpret the result
if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")

# Plot histogram of the data
plt.figure(figsize=(10, 6))
sns.histplot(final_data, kde=True, color='skyblue', stat='density')

# Plot normal distribution curve
mu, sigma = np.mean(final_data), np.std(final_data)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, norm.pdf(x, mu, sigma), 'r-', label='Normal Distribution')

plt.xlabel('Data')
plt.ylabel('Density')
plt.title('Histogram of Data with Normal Distribution')
plt.legend()
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '/content/GeneratedSampleData.csv'

## ➡️ T-test (with either SW test or KS test, p>0.05)
## ➡️➡️ Wilcoxon Rank Sum text (with either SW test or KS test, p<0.05)