<a href="https://colab.research.google.com/github/MuhammadHelmyOmar/NLP_From_Scratch/blob/main/chapter_4/statistical_significance_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistical Significance Testing

### Intro
- This test is used to compare the performance of two systems (NLP classifiers here).
- Checks the superiority of one model over the other under different sets and circumstances.
$$δ(x) = M(A,x) − M(B,x)$$
- Effect Size δ(x) is the difference between the score of model A and model B on test set x using some metric.

### Steps of the statistical significance test:
1. Set up a null hypothesis $H0$ and an alternative hypothesis $H1$.
  - Null hypothesis mostly states that there is **no difference** between the two systems.
  - Alternative hypothesis is the claim we want to support it with an evidence.
2. Set up a  $α$ acting as the significance level.
3. Take a new sample from the original population and calculate the statistics needed.
  - The statistics could be: mean, std,.. etc.
4. Calculate the p-value.
  - Assuming the null hypothesis is true, p-value is the probability of getting a statistics compared with the statistics in (3) (less/higher than or equal).
5. If the p-value is less than $α$, then we now have enough evidence to support $H1$ and reject $H0$. Otherway, we can't reject our null hypothesis.



 -----------
- The two hypotheses of statistical hypothesis testing
$$H0: δ(x) ≤ 0$$
$$H1: δ(x) > 0$$
- The question now is can we rule out this hypothesis and support $H1$ that one system is significantly better than the other (our interesting observation)?
- p-value: the probability of seeing $δ(x)$ or higher assuming the null hypothesis is true.
$$P(δ(X) ≥ δ(x)\ |\ H0\ is\ true)$$
- We reject our null hypothesis and adopt the alternative hypothesis if we found that $δ(x)$ was unlikely to happen (has probability below the specified threshold); that is the p-value is very low.
  - If I was right from the begining (my null hypothesis was right), I would have got a fairly high results (high p-value).
  - But, because I was wrong from the begining (my null hypothesis was wrong), I got low results (low p-value).
- If we reject the null hypothesis, then "A is better than B" is statistically significant.
- To compute the p-value in NLP, we use non-parametric tests based on sampling.
  - Re-generate many instances of the experimental setup.
  - $x$ is the original test set and $x'$ is the synthesized test set.
  - Types of non-parametric tests in NLP:
    - Approximate randomization
    - Bootstrap test




### The Paired Bootsrap Test
- It is called "paired" because we compare the performance of one system on an observation with its pair system on the same observation.
-

---

- It is not enough to test on only one particular test set x. We should generalize and test with other sets.
- The null hypothesis, $H0 : δ(x)≤0$, assumes that A is not better than B or there is no significant difference between A and B.
- Given the assumption: $H0 : δ(x)≤0$, how likely we will encounter the value of $δ(x)$ with other sets described as the random variable $X$. This is called the p-value. $$ P(δ(X) ≥δ(x)|H0\ is\ true) $$
- Common thresholds to measure the values of p-value: 0.05 or 0.01; We reject the null hypothesis $H0$ if we encountered a value less than the threshold.
  - High p-value: our hypothesis holds (gives no confidence that the two models differ)
  - Low p-value: reject the hypothesis (the event is rare)
- A result is statistically significant if the probability of $δ$ is below a certain threshold and therefore we reject the null hypothesis.

> **Computing p-value**
- In NLP, we use non-parametric tests based on sampling.
- Measure all the $δ(x')$ for all the test sets $x'$.
- If 99% or more of the distribution of $δ(x')$ are smaller than $δ(x)$ (p-value < threshold), we reject the null hypothesis $H0$ -> A is better than B.
- Common non-parametric tests in NLP are approximate randomization and the bootstrap test (most common).
- In paired tests we compare two aligned sets of observations. An observation of one set is paired with another from the other set.

> **The Paired Bootstrap Test**
- Can apply to precision, recall, F1, or BLEU.
- Name Intuition: repeatedly sample with replacement from an original observed test set assuming that the sample represents the population.

# Implementation of the paired bootstrap test
- Using sklearn

### Load data

- We will classify social media posts, news articles, or non-governmental organization reports.
- [Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset)

In [None]:
# import kagglehub
import pandas as pd
# import os
import numpy as np

##### English Dataset

In [None]:
# dataset_dir = kagglehub.dataset_download("rmisra/news-category-dataset")

# json_file = os.path.join(dataset_dir, "News_Category_Dataset_v3.json")

# data = pd.read_json(json_file, lines=True)

In [None]:
# data.head(5)

In [None]:
# combine title and short description

# data["text"] = data["headline"] + "\n" + data["short_description"]

In [None]:
# category filteration

# print(data["category"].unique())

# categories = ["COMEDY", 'SPORTS', 'WELLNESS']

# data = data[data['category'].isin(categories)]
# data.head(3)

In [None]:
# Convert categories into discrete numerical values

# cat_map = {
#     "COMEDY": 0,
#     "SPORTS": 1,
#     "WELLNESS": 2
# }

# data["label"] = data['category'].apply(lambda x: cat_map[x])
# data.head(3)

##### Arabic Dataset

In [None]:
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Speech and Language Processing/chapter_4/Arabic_classifcation.csv")
data.head()

Unnamed: 0,Text,topic
0,استمر المنتخب اليوناني في تفوقه على نظيره المص...,sport
1,استمرار تفوق اليونان,sport
2,البطولة صعبة على جميع الفريق، وبطولات الكؤوس ل...,sport
3,إذن لماذا وافقت على القيام بالمهمة؟,sport
4,أتوجه بجزيل الشكر إلى رئيس القادسية السابق، فو...,sport


In [None]:
data.rename(columns={" topic ":"topic"}, inplace=True)

In [None]:
data['topic'].unique()

array(['sport', 'Politics', 'Technology', 'Economy'], dtype=object)

In [None]:
# Convert categorizes into discrete numerical values

cat_map = {
    "sport":0,
    "Politics":1,
    "Technology":2,
    "Economy":3
}

data['label'] = data["topic"].apply(lambda x: cat_map[x])

In [None]:
data['label'].unique()

array([0, 1, 2, 3])

In [None]:
data["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,100
2,100
3,100
0,99


### Convert raw text to numerical values



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [None]:
X, y = data["Text"].to_numpy(), data["label"].to_numpy()

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(X)

In [None]:
X.shape, type(y)

((399, 5831), numpy.ndarray)

### Split the data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


### Train a naive bayes classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
model_a = MultinomialNB()

model_a.fit(X_train, y_train)

### Train a logistic regression classifier

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_b = LogisticRegression(random_state=42)
model_b.fit(X_train, y_train)

### The Paired Bootstrap Test

In [None]:
# create a histogram of bootsrapping

In [None]:
model_a.predict(X_test) == y_test

array([False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True, False, False,  True,
        True,  True, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True, False,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [None]:
model_b.predict(X_test) == y_test

array([False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True, False,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True])

# Resources
- [College Statistics | Khan Academy](https://www.khanacademy.org/math/ap-statistics/xfb5d8e68:inference-categorical-proportions/idea-significance-tests/v/idea-behind-hypothesis-testing)
- [Bootstrapping Statistics](https://www.youtube.com/watch?v=O_Fj4q8lgmc&list=PLqzoL9-eJTNDp_bWyWBdw2ioA43B3dBrl&index=1)
- [Bootstrap Hypothesis Testing in Statistics with Example |MarinStatsLectures](https://www.youtube.com/watch?v=9STZ7MxkNVg&list=WL&index=18)
- [Hypothesis Testing and The Null Hypothesis, Clearly Explained!!!](https://www.youtube.com/watch?v=0oc49DyA3hU)
- [p-values: What they are and how to interpret them](https://www.youtube.com/watch?v=vemZtEM63GY)
- [Bootstrapping Main Ideas!!!](https://www.youtube.com/watch?v=Xz0x-8-cgaQ)
- [Using Bootrsapping to Calculate p-values](https://youtu.be/N4ZQQqyIf6k?si=FkfqJQIKw79nD79L)
- [How P-Values Help Us Test Hypotheses: Crash Course Statistics #21
](https://www.youtube.com/watch?v=bf3egy7TQ2Q&list=WL&index=94&ab_channel=CrashCourse)

If you found any mistake, please contact me at: muhammadhelmymmo@gmail.com