# Numerical Experiments

Authors: Zeyad Ahmed and Paul Sheridan

Last update: 2024-06-10

Description: This notebook contains code for reproducing the numerical experimental results from Table 3 in the manuscript "A Fisher's exact test interpretation of the TF–IDF term-weighting scheme".

## Preliminaries

In [1]:
# Imports
import math
import scipy.stats as stats

ImportError: No module named scipy.stats

In [2]:
# Custom functions
# @zeyad: Is anything gained by embedding these functions within a class? If we can drop the class without any harm, I'd do that.
# @zeyad: How about making function names and variable names match the names used in the paper? For example "hypergeometric_pmf" could be "h", and "Nij" could be "nij".
# @zeyad: Run the code through a linter so we are consistent with using whitespace, etc.
class Functions():    
    def tficf(self, Nij, N, Ni):
        return Nij * math.log(N/Ni)
    
    def tfidf(self, Nij, D, Bi):
        return Nij * math.log(D/Bi)
    
    def fisher(self, Nij, Ni, Nj, N, return_table=False): #H
        """Not used"""
        contingency_table = [[0, 0], [0, 0]]
        contingency_table[0][0] = Nij
        contingency_table[0][1] = Ni - Nij
        contingency_table[1][0] = Nj - Nij
        contingency_table[1][1] = N - Ni - (Nj - Nij)
        try:
            if return_table:
                return stats.fisher_exact(contingency_table, alternative='greater')[1], contingency_table
            return stats.fisher_exact(contingency_table, alternative='greater')[1]
        except:
            print("Error: Check the values again.")
            print(contingency_table)
    
    def H(self, Nij, Ni, Nj, N):
        return stats.hypergeom.sf(Nij-1, N, Ni, Nj)

    def hypergeom_pmf(self, Nij, Ni, Nj, N): #h
        return math.comb(Ni, Nij) * math.comb(N - Ni, Nj - Nij) / math.comb(N, Nj)


    def LogH(self, Nij, Ni, Nj, N):
        return -math.log(self.H(Nij, Ni, Nj, N))

    
    def binmomial(self, Nij, Nj, Pi): #b
        return (math.comb(Nj, Nij)*Pi**Nij)*(1-Pi)**(Nj-Nij)
        
    # @zeyad: I'd call this "Qij"
    def compute_Qij(self, Nij, Ni, Nj, N):
        Pi = Ni/N
        return self.H(Nij+1, Ni, Nj, N)/self.binmomial(Nij, Nj, Pi)

    def eq8(self, Nij, Ni, Nj, N):
        tficf = self.tficf(Nij, N, Ni)
        Pij = Nij/Nj
        Pi = Ni/N
        Qij = self.compute_Qij(Nij, Ni, Nj, N)
        return tficf + Nij*math.log(Pij) + (Nj - Nij)*(Pi - Pij) - Qij
    
    def eq11(self, r, Ni, Nj, N, Bi, D):
        """
        nij = r
        nil = rBil; l != j
        """
        tfidf = self.tfidf(r, D, Bi)
        Pi = Ni/N
        Pij = r/Nj
        return tfidf - r*(1-Bi/D)*(1-Pij) - self.compute_Qij(r, Ni, Nj, N)
    
    # @zeyad: We shouldn't have a function named "temp".
    def temp(self, r, Ni, Nj, N, Bi, D):
        """
        nij = r
        nil = rBil; l != j
        """
        tfidf = self.tfidf(r, D, Bi)
        Pi = Ni/N
        return tfidf - r*( (Bi/D) * (1-Pi) - 1 + r/D ) - self.compute_Qij(r, Ni, Nj, N)     
    
    def display_stats(self, Nij, Ni, Nj, N, Bi, D):
        Pi = Ni/N
        Pij = Nij/Nj
        
        print(f'Nij={Nij}\tNi={Ni}\tNj={Nj}\tN={N}\t\nBi={Bi}\tD={D}\tPi={Pi}\tPij={Pij}\n\nEq3: {self.LogH(Nij, Ni, Nj, N):.4f}\nEq8: {self.eq8(Nij, Ni, Nj, N):.4f}\nEq11: {self.eq11(Nij, Ni, Nj, N, Bi, D):.4f}\nEq12: {self.tfidf(Nij, D, Bi):.4f}\n{"="*13}')


SyntaxError: invalid syntax (<ipython-input-2-e9bb8392b20f>, line 79)

## Numerical Experiments

@zeyad: We should write a one sentence summary for each of these code blocks.

In [4]:
# Input values
n=1000
ni=150
bi=4
nj=100
nij=25
d=20

# Result
f = Functions()
f.display_stats(nij, ni, nj, n, bi, d)

NameError: name 'Functions' is not defined

In [21]:
## KEEP

n=10000
ni=200
bi=20
nj=75
nij=15
d=75
pi=ni/n
pij=nij/nj

f = Functions()
f.display_stats(nij, ni, nj, n, bi, d)

Nij=15	Ni=200	Nj=75	N=10000	
Bi=20	D=75	Pi=0.02	Pij=0.2

Eq3: 24.8971
Eq8: 23.6898
Eq11: 10.9773
Eq12: 19.8263


In [19]:
## KEEP

n=1000
ni=160
bi=8
nj=20
nij=20
d=50

f=Functions()
f.display_stats(nij, ni,nj,n,bi,d)

Nij=20	Ni=160	Nj=20	N=1000	
Bi=8	D=50	Pi=0.16	Pij=1.0

Eq3: 37.6993
Eq8: 36.6516
Eq11: 36.6516
Eq12: 36.6516


In [22]:
## KEEP

n=10000
ni=200
bi=8
nj=100
nij=25
d=100
pi=ni/n
pij=nij/nj

f = Functions()
f.display_stats(nij, ni, nj, n, bi, d)



Nij=25	Ni=200	Nj=100	N=10000	
Bi=8	D=100	Pi=0.02	Pij=0.25

Eq3: 46.7698
Eq8: 45.8791
Eq11: 45.8791
Eq12: 63.1432


In [23]:
## KEEP

n=1000
ni=100
bi=10
nj=25
nij=10
d=40
pi=ni/n
pij=nij/nj

f = Functions()
f.display_stats(nij, ni, nj, n, bi, d)



Nij=10	Ni=100	Nj=25	N=1000	
Bi=10	D=40	Pi=0.1	Pij=0.4

Eq3: 9.7407
Eq8: 9.2446
Eq11: 9.2446
Eq12: 13.8629


In [24]:
## KEEP

n=10000
ni=1200
bi=15
nj=80
nij=80
d=125
pi=ni/n
pij=nij/nj

f = Functions()
f.display_stats(nij, ni, nj, n, bi, d)



Nij=80	Ni=1200	Nj=80	N=10000	
Bi=15	D=125	Pi=0.12	Pij=1.0

Eq3: 171.9977
Eq8: 169.6211
Eq11: 169.6211
Eq12: 169.6211
