# Heaps' Law

This law describes the relationship between the number of distinct words in a document and the number of words in the document. It is given by the equation:

$$
V = KN^\beta
$$

Where:
- $V$ is the number of distinct words in a document
- $N$ is the number of words in the document
- $K$ and $\beta$ are free parameters

## Implementation

In [None]:
LOW_HEAPS_K = 10
LOW_HEAPS_BETA = 0.4
HIGH_HEAPS_K = 100
HIGH_HEAPS_BETA = 0.6

def linear_midpoint(lowpoint, highpoint):
    return lowpoint + (highpoint - lowpoint) * 0.5

def exponential_midpoint(lowpoint, highpoint):
    return lowpoint * (highpoint / lowpoint) ** 0.5

TYPICAL_HEAPS_K = exponential_midpoint(LOW_HEAPS_K, HIGH_HEAPS_K)
TYPICAL_HEAPS_BETA = linear_midpoint(LOW_HEAPS_BETA, HIGH_HEAPS_BETA)

def calculate_heaps_law_size(text_size, k=TYPICAL_HEAPS_K, beta=TYPICAL_HEAPS_BETA):
    '''Calculates the number of distinct vocabulary words for a given number text using Heaps' law.
    
    text_size: int, the text size
    k: float, the k parameter of Heaps' law
    beta: float, the beta parameter of Heaps' law
    return: float, the predicted number of distinct vocabulary words
    '''
    return k * (text_size ** beta)

## Application

In [19]:
def calculate_and_print_document_sizes(filename):
    with open(filename, 'r', encoding="mbcs") as file:
        document = file.read()
        document_size = len(document)
        print_document_sizes(filename, document, document_size)

def print_document_sizes(filename, document, document_size):
    print('\n' + filename + ': ')
    print('Document size: ' + str(document_size))
    print('Distinct words: ' + str(len(set(document.split()))))
    print('Low Heaps\' law size: ' + str(calculate_heaps_law_size(document_size, LOW_HEAPS_K, LOW_HEAPS_BETA)))
    print('Typical Heaps\' law size: ' + str(calculate_heaps_law_size(document_size)))
    print('High Heaps\' law size: ' + str(calculate_heaps_law_size(document_size, HIGH_HEAPS_K, HIGH_HEAPS_BETA)))

calculate_and_print_document_sizes('./datasets/Alice.txt')
calculate_and_print_document_sizes('./datasets/Shakespeare.txt')


../Word2Vec/Alice.txt: 
Document size: 170548
Distinct words: 5980
Low Heaps' law size: 1238.0513336871213
Typical Heaps' law size: 13059.402742851606
High Heaps' law size: 137755.1926640068

../Word2Vec/Shakespear.txt: 
Document size: 5458198
Distinct words: 67505
Low Heaps' law size: 4952.445607328613
Typical Heaps' law size: 73879.61829895985
High Heaps' law size: 1102121.746056731
