### Code Explanation

The code below defines a function `extract_chunks` that processes Latin texts by splitting them into non-overlapping chunks of $n$ length. 
Here is a step-by-step explanation of what the code does:

1. **Import Necessary Libraries**:
   - `cltk`: Used for the Latin tokenizer.
   - `re`, `os`, `glob`: Used for file operations and regular expressions.
   - `pandas`, `numpy`: Imported but not used in the code.
   
2. **Function Definition**: `extract_chunks`
   - **Purpose**: To split Latin texts into non-overlapping chunks based on the specified chunk size.
   - **Parameters**:
     - `directory_to_read`: Directory containing the text files to process.
     - `directory_to_write`: Directory where the processed chunks will be saved.
     - `threshold_to_slice`: Token count threshold above which texts will be split into chunks.
     - `chunk_size`: The size of each chunk in tokens.

3. **Helper Functions**:
   - `count_files(directory)`: Counts the number of `.txt` files in a directory.
   - `read_file(filepath)`: Reads the content of a file.
   - `preprocess(text)`: Removes Arabic numbers and non-word characters from the text.
   - `tokenize_latin_text(text)`: Lowercases and tokenizes Latin text using CLTK's Latin tokenizer.

4. **Directory Check**:
   - Ensures that the directory to save the chunks exists. If not, it creates the directory.

5. **Processing Each File**:
   - Iterates through each `.txt` file in the `directory_to_read`.
   - Tokenizes the text content.
   - If the number of tokens exceeds the `threshold_to_slice`, splits the text into non-overlapping chunks of `chunk_size` tokens.
   - Writes each chunk to a new file with a modified name indicating the chunk number.
   - If the number of tokens is below the threshold, writes the text as is to the output directory.

6. **Summary Print Statement**:
   - Prints a summary message indicating the number of text samples written to the `directory_to_write`.

In [11]:
import cltk
from cltk.tokenizers import LatinWordTokenizer
from glob import glob
import re
import os
import pandas as pd
import numpy as np

def extract_chunks(directory_to_read, directory_to_write, threshold_to_slice, chunk_size):
    """
    The function `extract_chunks` slices texts in non-overlapping chunks based on the specified chunk size.
    Before extracting the chunks, it performs brief preprocessing, such as removing Arabic numbers,
    lowercasing, and tokenizing the texts.

    Note: This function is designed for Latin texts using CLTK's Latin tokenizer.

    Parameters:
        directory_to_read (str): The directory containing the texts to slice.
        directory_to_write (str): The directory to write the results. If it doesn't exist, it will be created.
        threshold_to_slice (int): The token count threshold above which texts will be split into chunks.
        chunk_size (int): The size of each non-overlapping chunk.

    Returns:
        A directory with the txt files provided split into non-overlapping chunks of 500 tokens.
    """

    def count_files(directory):
        """Count the number of .txt files in the given directory."""
        count = 0
        for path in os.scandir(directory):
            if os.path.isfile(os.path.join(directory, path)) and path.name.endswith(".txt"):
                count += 1
        return count

    def read_file(filepath):
        """Read the content of a file."""
        with open(filepath, 'r', encoding='utf-8') as file:
            return file.read()

    def preprocess(text):
        """Remove Arabic numbers from the text and return the cleaned text."""
        text = re.sub(r'[^\w\s]', '', text)
        return text

    def tokenize_latin_text(text):
        """Lowercase and tokenize Latin text."""
        latin_tokenizer = LatinWordTokenizer()
        text = preprocess(text.lower())
        tokens = latin_tokenizer.tokenize(text)
        return tokens

    # ensure the directory to write exists
    if not os.path.exists(directory_to_write):
        os.makedirs(directory_to_write)
        print("Directory successfully created!")
    else:
        print(f"Directory {directory_to_write} already exists!")

    # process each file in the directory
    for file_name in os.listdir(directory_to_read):
        if file_name.endswith(".txt"):
            file_path = os.path.join(directory_to_read, file_name)
            tokens = tokenize_latin_text(read_file(file_path))

            if len(tokens) > threshold_to_slice:
                chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]
                for i, chunk in enumerate(chunks):
                    chunk_text = " ".join(chunk)
                    chunk_file_name = f"{file_name[:-4]}_chunk{i + 1}.txt"
                    with open(os.path.join(directory_to_write, chunk_file_name), "w", encoding='utf-8') as f:
                        f.write(chunk_text)
            else:
                text = " ".join(tokens)
                with open(os.path.join(directory_to_write, file_name), "w", encoding='utf-8') as f:
                    f.write(text)

    print(f"""
    Every file has been written successfully.
    The new directory (path={directory_to_write}) contains {count_files(directory_to_write)} text samples.""")

In [12]:
%%time
directory_to_read = "../../corpora/corpus_test_chunks/"  # get the working directory
# set the directory where you want to write the results
directory_to_write = "../../corpora/corpus_chunks/"
extract_chunks(directory_to_read=directory_to_read,
               directory_to_write=directory_to_write, threshold_to_slice=500, chunk_size=500)

Directory ../../corpora/corpus_chunks/ already exists!

    Every file has been written successfully.
    The new directory (path=../../corpora/corpus_chunks/) contains 1196 text samples.
CPU times: user 10.9 s, sys: 602 ms, total: 11.5 s
Wall time: 11.6 s
