# Part 1 Documentation

## Implementation explenation

Below is my implementation for producing the n_grams' frequency counts. It consists of 4 different functions:
- frequency_counts()
- traverse_tree(node, number_of_words, counts)
- handle_sentence(sentence_node, number_of_words, counts)
- retrieve_text(node)

### frequency_counts()
This function starts by parsing either a single or multiple *.xml* files which contain the corpus used as training data for this project. Once the root of each *.xml* file has been found, the function iterates over a loop three times, generating a frequency count for each n_gram *(unigram, bigram, trigram)*. 
This is achieved by calling the traverse_tree() function for each of the roots children. 


Once the frequency count dictionary has been created, it is stored in a json file so that in the future this process doesn't need to be repeated everytime the language model is to be used.

### traverse_tree(node, number_of_words, counts)
This function takes 3 parameters:
- node (xml.etree.ElementTree.Element): The current node in the XML tree.
- number_of_words (int): The number of words in the n-grams to be counted.
- counts (collections.defaultdict): A dictionary to store the n-gram counts.


This function is a recursive depth first function that traverses the tree until it finds either a *teiHeader* tag or an *s* tag. Due to the structure of the *.xml* files, the first child of the root (*teiHeader*) contains information about the text and its origins. This is irrelevant to our frequency counts so we simply ignore it and keep searching the tree until an *s* tag is found. Once found the handle_sentence() function is called.

### handle_sentence(sentence_node, number_of_words, counts)
This function takes 3 parameters:
- sentence_node (xml.etree.ElementTree.Element): The current sentence node in the XML tree.
- number_of_words (int): The number of words in the n-grams to be counted.
- counts (collections.defaultdict): A dictionary to store the n-gram counts.
  
The function adds an end *s* tag to the string returned by retrieve_text(), processes each sentence and computes and updates n-gram frequencies stored in counts. It will only update n-gram frequencies if the sentence contained words, if no words were present then it moves onto the next sentence.

### retrieve_text(node)
This function takes one parameter:
- node (xml.etree.ElementTree.Element): The current node in the XML tree.

It returns a str, which represents the concatenated text from the node and its descendants.

This function extracts and concatenates text from XML nodes, adding start markers to each sentence.

In [2]:
import xml.etree.ElementTree as ET
from collections import defaultdict
import json
import os

def frequency_counts():
    training_dir_path = '../corpus/train/'
    xml_files = [f for f in os.listdir(training_dir_path) if f.endswith('.xml')]

    # Create frequency counts for unigrams, bigrams, and trigrams and save to a json file
    for number_of_words in range(1, 4):
        n_gram_counts = defaultdict(int)

        for file in xml_files:
            path = os.path.join(training_dir_path, file)
            tree = ET.parse(path)
            root = tree.getroot()
            for child in root:
                traverse_tree(child, number_of_words, n_gram_counts)

        with open(f'../src/n_grams/{number_of_words}_gram_counts.json', 'w', encoding='utf-8') as fp:
            json.dump(n_gram_counts, fp)


def traverse_tree(node, number_of_words, counts):
    if node.tag != 'teiHeader':
        for child in node:
            if child.tag == 's':
                handle_sentence(child, number_of_words, counts)
            else:
                traverse_tree(child, number_of_words, counts)


def handle_sentence(sentence_node, number_of_words, counts):
    text = retrieve_text(sentence_node) + " </s>"
    if text != "<s> ":
        words = text.split()
        for index in range(len(words) - number_of_words + 1):
            if number_of_words == 1:
                n_gram = words[index]
            else:
                n_gram = " ".join(words[index:index + number_of_words])
            counts[n_gram] += 1


def retrieve_text(node):
    text = "<s> "
    for child in node:
        if child.tag == 'w':
            if child.text:
                text += child.text.lower()
        else:
            if len(node) > 0:
                for grandchild in child:
                    text += retrieve_text(grandchild)
    return text


frequency_counts()


## Implementation stages

The above is the final edition of the code. Initially it was developed to only handle one xml file from the test corpus, this was later extended to handle the whole training set, contained in the *corpus/train/* directory. 

Below I have implemented said *single file* implementation, along with a modified version of the final product that does not change or generate the json files. These 2 functions will be used later in our testing phase.

In [6]:
import os

def single_file_frequency_counts(path):
    tree = ET.parse(path)
    root = tree.getroot()

    # Create frequency counts for unigrams, bigrams, and trigrams
    for number_of_words in range(1, 4):
        n_gram_counts = defaultdict(int)
        for child in root:
            traverse_tree(child, number_of_words, n_gram_counts)
            
def corpus_frequency_counts():
    test_dir_path = "test_3/"
    xml_files = [f for f in os.listdir(test_dir_path) if f.endswith('.xml')]

        # Create frequency counts for unigrams, bigrams, and trigrams
    for number_of_words in range(1, 4):
        n_gram_counts = defaultdict(int)

        for xml_file in xml_files:
            path = os.path.join(test_dir_path, xml_file)
            tree = ET.parse(path)
            root = tree.getroot()
            for child in root:
                traverse_tree(child, number_of_words, n_gram_counts)

## Testing

Before extending the initial *single file* implementation, a short 3 line extract of one of the training files was used and the json files were checked to verify that the generated counts were correct.


Finally once the implemenation was finished, the performance of this section was tested using 2 different metrics.
1. Time taken to create each n_gram
2. CPU and RAM usage

### Time Taken

In [4]:
import time

def timed_single_file_frequency_counts(path):
    tree = ET.parse(path)
    root = tree.getroot()
    text_node = root.find('.//wtext')

    # Create frequency counts for unigrams, bigrams, and trigrams and save to a json file
    for number_of_words in range(1, 4):
        start_time = time.time()

        n_gram_counts = defaultdict(int)
        traverse_tree(text_node, number_of_words, n_gram_counts)

        end_time = time.time()  
        print(f"Time to process {number_of_words}_gram: {end_time - start_time} seconds")

def timed_corpus_frequency_counts():
    test_dir_path = "test_3/"
    xml_files = [f for f in os.listdir(test_dir_path) if f.endswith('.xml')]

    # Create frequency counts for unigrams, bigrams, and trigrams
    for number_of_words in range(1, 4):
        start_time = time.time()

        n_gram_counts = defaultdict(int)

        for file in xml_files:
            path = os.path.join(test_dir_path, file)
            tree = ET.parse(path)
            root = tree.getroot()
            for child in root:
                traverse_tree(child, number_of_words, n_gram_counts)

        end_time = time.time()
        print(f"Time to process {number_of_words}_gram: {end_time - start_time} seconds")

print("Test 1:")
timed_single_file_frequency_counts("test_1.xml")
print()

print("Test 2:")
timed_single_file_frequency_counts("test_2.xml")
print()

print("Test 3:")
timed_corpus_frequency_counts()
print()

Test 1:
Time to process 1_gram: 1.6927719116210938e-05 seconds
Time to process 2_gram: 7.867813110351562e-06 seconds
Time to process 3_gram: 4.76837158203125e-06 seconds

Test 2:
Time to process 1_gram: 0.010452032089233398 seconds
Time to process 2_gram: 0.010739803314208984 seconds
Time to process 3_gram: 0.010126829147338867 seconds

Test 3:
Time to process 1_gram: 5.182066917419434 seconds
Time to process 2_gram: 6.037155866622925 seconds
Time to process 3_gram: 6.28548789024353 seconds



### Computer Resources

In [5]:
import psutil

def test_cpu_usage_single_file(path):
    process = psutil.Process(os.getpid())
    start_cpu = process.cpu_percent()

    single_file_frequency_counts(path)

    end_cpu = process.cpu_percent()

    cpu_usage = end_cpu - start_cpu

    print(f"CPU Usage: {cpu_usage}%")

def test_ram_usage_single_file(path):
    process = psutil.Process(os.getpid())
    start_ram = process.memory_info().rss / 1024 / 1024

    single_file_frequency_counts(path)

    end_ram = process.memory_info().rss / 1024 / 1024

    ram_usage = end_ram - start_ram

    print(f"RAM Usage: {ram_usage} MB")

def test_cpu_usage_corpus():
    process = psutil.Process(os.getpid())
    start_cpu = process.cpu_percent()

    corpus_frequency_counts()

    end_cpu = process.cpu_percent()

    cpu_usage = end_cpu - start_cpu

    print(f"CPU Usage: {cpu_usage}%")

def test_ram_usage_corpus():
    process = psutil.Process(os.getpid())
    start_ram = process.memory_info().rss / 1024 / 1024

    corpus_frequency_counts()

    end_ram = process.memory_info().rss / 1024 / 1024

    ram_usage = end_ram - start_ram

    print(f"RAM Usage: {ram_usage} MB")

print("Few lines test:")
test_cpu_usage_single_file("test_1.xml")
test_ram_usage_single_file("test_1.xml")

print("Single file test:")
test_cpu_usage_single_file("test_2.xml")
test_ram_usage_single_file("test_2.xml")

print("Corpus Test:")
test_cpu_usage_corpus()
test_ram_usage_corpus()

Few lines test:
CPU Usage: 154.3%
RAM Usage: 0.0 MB
Single file test:
CPU Usage: 99.2%
RAM Usage: 0.921875 MB
Corpus Test:
CPU Usage: 99.6%
RAM Usage: 56.125 MB
