# Numerical Experiments

Authors: Zeyad Ahmed and Paul Sheridan

Last update: 2024-12-19

Description: This notebook contains code for reproducing the numerical experimental results from Table 3 and Table 4 in the manuscript "A Fisher's exact test interpretation of the TF–IDF term-weighting scheme".

## Preliminaries

In [1]:
# Imports
import math
import scipy.stats as stats
import pandas as pd

In [2]:
# Custom functions
def tf_icf(nij, n, ni):
    return nij * math.log(n / ni)


def tf_idf(nij, d, bi):
    return nij * math.log(d / bi)


def H(nij, ni, nj, n):
    return stats.hypergeom.sf(nij - 1, n, ni, nj)


def h(nij, ni, nj, n):
    return math.comb(ni, nij) * math.comb(n - ni, nj - nij) / math.comb(n, nj)


def LogH(nij, ni, nj, n):
    return -math.log(H(nij, ni, nj, n))


def b(nij, nj, pi):
    return (math.comb(nj, nij) * pi**nij) * (1 - pi) ** (nj - nij)


def Qij(nij, ni, nj, n):
    pi = ni / n
    return H(nij + 1, ni, nj, n) / b(nij, nj, pi)


def tficf_phi(nij, ni, nj, n):
    tficf = tf_icf(nij, n, ni)
    pij = nij / nj
    pi = ni / n
    return tficf + nij * math.log(pij) + (nj - nij) * (pi - pij) - Qij(nij, ni, nj, n)


def tfidf_psi(nij, ni, nj, n, bi, d):
    tfidf = tf_idf(nij, d, bi)
    pij = nij / nj
    return tfidf - nij * (1 - bi / d) * (1 - pij) - Qij(nij, ni, nj, n)


def display_params(nij, ni, nj, n, bi, d):
    pi = ni / n
    pij = nij / nj

    print(
        f"Parameter settings:\n"
        f"n = {n}\tnj = {nj}\n"
        f"ni = {ni}\tnij = {nij}\n"
        f"bi = {bi}\t\td = {d}\n"
        f"pi = {pi}\tpij = {pij}"
    )


def generate_stats(nij, ni, nj, n, bi, d):
    data = {
        "Formula": ["-logH", "TF-ICF+Phi", "TF-IDF+Psi", "TF-IDF"],
        "Value": [
            round(LogH(nij, ni, nj, n), 4),
            round(tficf_phi(nij, ni, nj, n), 4),
            round(tfidf_psi(nij, ni, nj, n, bi, d), 4),
            round(tf_idf(nij, d, bi), 4)
            ],
        "Delta %": [calc_delta(LogH(nij, ni, nj, n), func) for func in [LogH(nij, ni, nj, n), tficf_phi(nij, ni, nj, n), tfidf_psi(nij, ni, nj, n, bi, d), tf_idf(nij, d, bi)]]
    }
    return data
def calc_delta(val1, val2):
    return round(abs(100*(val2-val1)/val2), 4)

## Numerical Experiments

The functions are tested under different parameter settings to compare their performance when $ n $ is small and when $ n $ is large.

### Theorem 1 verification

#### Using small $n$

In [3]:
# Input values
n = 1000
ni = 150
bi = 4
nj = 100
nij = 25
d = 20

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 1000	nj = 100
ni = 150	nij = 25
bi = 4		d = 20
pi = 0.15	pij = 0.25


Unnamed: 0,Formula,Value,Delta %
0,-logH,5.5429,0.0
1,TF-ICF+Phi,4.7111,17.6554
2,TF-IDF+Psi,24.6764,77.5378
3,TF-IDF,40.2359,86.2241


$ - \log{H_{ij}} $ and $ TF-ICF(i,j) + \Phi_{ij} $ are significantly different than the other two formulas since Corollary 1 and Corollary 2 conditions are violated.

#### Using large $ n $

In [4]:
# Input values
n = 10000
ni = 200
bi = 20
nj = 75
nij = 15
d = 75

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 75
ni = 200	nij = 15
bi = 20		d = 75
pi = 0.02	pij = 0.2


Unnamed: 0,Formula,Value,Delta %
0,-logH,24.8971,0.0
1,TF-ICF+Phi,23.6898,5.0964
2,TF-IDF+Psi,10.9773,126.8048
3,TF-IDF,19.8263,25.5758


As $ n $ gets larger, the value of $ - \log{H_{ij}} $ and $ TF-ICF(i,j) + \Phi_{ij} $ gets closer.

### Corollary 1 verification

#### Using small $ n $

In [5]:
# Input values
n = 1000
ni = 100
bi = 10
nj = 25
nij = 10
d = 40

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 1000	nj = 25
ni = 100	nij = 10
bi = 10		d = 40
pi = 0.1	pij = 0.4


Unnamed: 0,Formula,Value,Delta %
0,-logH,9.7407,0.0
1,TF-ICF+Phi,9.2446,5.3662
2,TF-IDF+Psi,9.2446,5.3662
3,TF-IDF,13.8629,29.7354


When Corollary 1 conditions are satisfied, the value of $ TF-IDF(i,j) + \Psi_{ij} $ gets closer to $ - \log{H_{ij}} $

#### Using large $ n $

In [6]:
# Input values
n = 10000
ni = 200
bi = 8
nj = 100
nij = 25
d = 100

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 100
ni = 200	nij = 25
bi = 8		d = 100
pi = 0.02	pij = 0.25


Unnamed: 0,Formula,Value,Delta %
0,-logH,46.7698,0.0
1,TF-ICF+Phi,45.8791,1.9414
2,TF-IDF+Psi,45.8791,1.9414
3,TF-IDF,63.1432,25.9306


When Corollary 1 conditions are satisfied, the value of $ TF-IDF(i,j) + \Psi_{ij} $ gets closer to $ - \log{H_{ij}} $ and the difference becomes less significant when $ n $ gets larger.

### Corollary 2 verification

#### Using small $ n $

In [7]:
# Input values
n = 1000
ni = 160
bi = 8
nj = 20
nij = 20
d = 50

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 1000	nj = 20
ni = 160	nij = 20
bi = 8		d = 50
pi = 0.16	pij = 1.0


Unnamed: 0,Formula,Value,Delta %
0,-logH,37.6993,0.0
1,TF-ICF+Phi,36.6516,2.8584
2,TF-IDF+Psi,36.6516,2.8584
3,TF-IDF,36.6516,2.8584


When Corollary 2 conditions are satisfied, the value of $ TF-IDF(i,j) $ gets closer to $ - \log{H_{ij}} $

#### Using large $ n $

In [8]:
# Input values
n = 10000
ni = 1200
bi = 15
nj = 80
nij = 80
d = 125

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 80
ni = 1200	nij = 80
bi = 15		d = 125
pi = 0.12	pij = 1.0


Unnamed: 0,Formula,Value,Delta %
0,-logH,171.9977,0.0
1,TF-ICF+Phi,169.6211,1.4012
2,TF-IDF+Psi,169.6211,1.4012
3,TF-IDF,169.6211,1.4012


When Corollary 2 conditions are satisfied, the value of $ TF-IDF(i,j) $ gets closer to $ - \log{H} $ and the difference becomes less significant when $ n $ gets larger.

### Examples Typical of Real Data

#### Case 1

Large number of documents $d$ and small number of occurences of a given word in a given document, $n_{ij}$.

In [9]:
# Input values
n = 10000
ni = 125
bi = 12
nj = 75
nij = 7
d = 175

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 10000	nj = 75
ni = 125	nij = 7
bi = 12		d = 175
pi = 0.0125	pij = 0.09333333333333334


Unnamed: 0,Formula,Value,Delta %
0,-logH,10.1385,0.0
1,TF-ICF+Phi,8.4774,19.5938
2,TF-IDF+Psi,12.7487,20.474
3,TF-IDF,18.7592,45.9544


This setup violates the conditions of Theorem 1, such as $n_j$ being sufficiently large.

#### Case 2

Rare term $i$ that occurs in a few documents, i.e., $p_i$ and $b_i/d$ are small.

In [10]:
# Input values
n = 12500
ni = 6
bi = 3
nj = 80
nij = 2
d = 200

# Result
display_params(nij, ni, nj, n, bi, d)
df = pd.DataFrame(generate_stats(nij, ni, nj, n, bi, d))
df

Parameter settings:
n = 12500	nj = 80
ni = 6	nij = 2
bi = 3		d = 200
pi = 0.00048	pij = 0.025


Unnamed: 0,Formula,Value,Delta %
0,-logH,7.424,0.0
1,TF-ICF+Phi,5.986,24.0226
2,TF-IDF+Psi,6.4716,14.7178
3,TF-IDF,8.3994,11.6125


This setup violates some conditions of Theorem 1 (such as $n_{ij}$ being large) and Corollaries 1 & 2 (such as documents being of equal length)