# Numerical Experiments

Authors: Zeyad Ahmed and Paul Sheridan

Last update: 2024-06-10

Description: This notebook contains code for reproducing the numerical experimental results from Table 3 in the manuscript "A Fisher's exact test interpretation of the TF–IDF term-weighting scheme".

## Preliminaries

In [1]:
# Imports
import math
import scipy.stats as stats
import pandas as pd

In [5]:
# Custom functions
def tf_icf(nij, n, ni):
    return nij * math.log(n / ni)


def tf_idf(nij, d, bi):
    return nij * math.log(d / bi)


def H(nij, ni, nj, n):
    return stats.hypergeom.sf(nij - 1, n, ni, nj)


def h(nij, ni, nj, n):
    return math.comb(ni, nij) * math.comb(n - ni, nj - nij) / math.comb(n, nj)


def LogH(nij, ni, nj, n):
    return -math.log(H(nij, ni, nj, n))


def b(nij, nj, pi):
    return (math.comb(nj, nij) * pi**nij) * (1 - pi) ** (nj - nij)


def Qij(nij, ni, nj, n):
    pi = ni / n
    return H(nij + 1, ni, nj, n) / b(nij, nj, pi)


def tficf_phi(nij, ni, nj, n):
    tficf = tf_icf(nij, n, ni)
    pij = nij / nj
    pi = ni / n
    return tficf + nij * math.log(pij) + (nj - nij) * (pi - pij) - Qij(nij, ni, nj, n)


def tfidf_psi(nij, ni, nj, n, bi, d):
    tfidf = tf_idf(nij, d, bi)
    pij = nij / nj
    return tfidf - nij * (1 - bi / d) * (1 - pij) - Qij(nij, ni, nj, n)


def display_params(nij, ni, nj, n, bi, d):
    pi = ni / n
    pij = nij / nj

    print(
        f"Parameter settings:\n"
        f"n = {n}\tnj = {nj}\n"
        f"ni = {ni}\tnij = {nij}\n"
        f"bi = {bi}\t\td = {d}\n"
        f"pi = {pi}\tpij = {pij}"
    )


def generate_stats(nij, ni, nj, n, bi, d):
    data = {
        "Formula": ["-logH", "TF-ICF+Phi", "TF-IDF+Psi", "TF-IDF"],
        "Value": [
            LogH(nij, ni, nj, n),
            tficf_phi(nij, ni, nj, n),
            tfidf_psi(nij, ni, nj, n, bi, d),
            tf_idf(nij, d, bi)
            ]
    }
    return data

## Numerical Experiments

The functions are tested under different parameter settings to compare their performance when $ n $ is small and when $ n $ is large.

### Theorem 1 verification

#### Using small $n$

In [10]:
# Input values
n = 1000
ni = 150
bi = 4
nj = 100
nij = 25
d = 20

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 1000	nj = 100
ni = 150	nij = 25
bi = 4		d = 20
pi = 0.15	pij = 0.25


Unnamed: 0,Formula,Value
0,-logH,5.542875
1,TF-ICF+Phi,4.711109
2,TF-IDF+Psi,24.676417
3,TF-IDF,40.235948


$ - \log{H_{ij}} $ and $ TF-ICF(i,j) + \Phi_{ij} $ are significantly different than the other two formulas since Corollary 1 and Corollary 2 conditions are violated.

#### Using large $ n $

In [11]:
# Input values
n = 10000
ni = 200
bi = 20
nj = 75
nij = 15
d = 75

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 75
ni = 200	nij = 15
bi = 20		d = 75
pi = 0.02	pij = 0.2


Unnamed: 0,Formula,Value
0,-logH,24.897081
1,TF-ICF+Phi,23.689756
2,TF-IDF+Psi,10.977317
3,TF-IDF,19.826338


As $ n $ gets larger, the value of $ - \log{H_{ij}} $ and $ TF-ICF(i,j) + \Phi_{ij} $ gets closer.

### Corollary 1 verification

#### Using small $ n $

In [12]:
# Input values
n = 1000
ni = 100
bi = 10
nj = 25
nij = 10
d = 40

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 1000	nj = 25
ni = 100	nij = 10
bi = 10		d = 40
pi = 0.1	pij = 0.4


Unnamed: 0,Formula,Value
0,-logH,9.740736
1,TF-ICF+Phi,9.244649
2,TF-IDF+Psi,9.244649
3,TF-IDF,13.862944


When Corollary 1 conditions are satisfied, the value of $ TF-IDF(i,j) + \Psi_{ij} $ gets closer to $ - \log{H_{ij}} $

#### Using large $ n $

In [13]:
# Input values
n = 10000
ni = 200
bi = 8
nj = 100
nij = 25
d = 100

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 100
ni = 200	nij = 25
bi = 8		d = 100
pi = 0.02	pij = 0.25


Unnamed: 0,Formula,Value
0,-logH,46.76981
1,TF-ICF+Phi,45.879105
2,TF-IDF+Psi,45.879105
3,TF-IDF,63.143216


When Corollary 1 conditions are satisfied, the value of $ TF-IDF(i,j) + \Psi_{ij} $ gets closer to $ - \log{H_{ij}} $ and the difference becomes less significant when $ n $ gets larger.

### Corollary 2 verification

#### Using small $ n $

In [14]:
# Input values
n = 1000
ni = 160
bi = 8
nj = 20
nij = 20
d = 50

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 1000	nj = 20
ni = 160	nij = 20
bi = 8		d = 50
pi = 0.16	pij = 1.0


Unnamed: 0,Formula,Value
0,-logH,37.699296
1,TF-ICF+Phi,36.651629
2,TF-IDF+Psi,36.651629
3,TF-IDF,36.651629


When Corollary 2 conditions are satisfied, the value of $ TF-IDF(i,j) $ gets closer to $ - \log{H_{ij}} $

#### Using large $ n $

In [15]:
# Input values
n = 10000
ni = 1200
bi = 15
nj = 80
nij = 80
d = 125

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 80
ni = 1200	nij = 80
bi = 15		d = 125
pi = 0.12	pij = 1.0


Unnamed: 0,Formula,Value
0,-logH,171.997735
1,TF-ICF+Phi,169.621083
2,TF-IDF+Psi,169.621083
3,TF-IDF,169.621083


When Corollary 2 conditions are satisfied, the value of $ TF-IDF(i,j) $ gets closer to $ - \log{H} $ and the difference becomes less significant when $ n $ gets larger.