## Introduction: Parallel File Processing for 3GPP Documents

This Python script is designed to automate and optimize the processing of document files, specifically for a collection of 3GPP documents. It follows a two-step parallel processing approach to efficiently handle large volumes of files. The script operates within a specified base directory, targeting the `3GPP-all` folder, and processes documents found in its subdirectories.

Key Features:
1. `file_exists`: Verifies the existence of files, ensuring efficient handling of file operations.
2. `unzip_task_directory`: Automates the unzipping of archives in the `3GPP-all` directory, with checks to avoid unnecessary processing of already unzipped files.
3. Systematic traversal through nested directory structures, identifying and preparing files for processing.
4. Implements `ThreadPoolExecutor` for parallel processing, significantly enhancing the efficiency of unzipping and document conversion tasks.
5. Innovative use of multiple LibreOffice instances for parallel


In [None]:
import os
import zipfile
import subprocess
from concurrent.futures import ThreadPoolExecutor, TimeoutError

def file_exists(file_path):
    return os.path.exists(file_path)

def unzip_task_directory(directory):
    print(f"Unzipping files in {directory}...")
    for item in os.listdir(directory):
        if item.endswith('.zip'):
            file_path = os.path.join(directory, item)
            if not file_exists(file_path.replace('.zip', '')):
                try:
                    with zipfile.ZipFile(file_path, 'r') as zip_ref:
                        zip_ref.extractall(directory)
                    print(f"  Unzipped {item}")
                except zipfile.BadZipFile:
                    print(f"  Warning: {file_path} is not a valid zip file and will be skipped.")
            else:
                print(f"  Skipping unzipping {item}, already exists.")


base_directory = "./3GPP-all"
base_directory = os.path.abspath(base_directory)
directories_to_process = []

# Collect directories for unzipping
for release_dir in os.listdir(base_directory):
    release_path = os.path.join(base_directory, release_dir)
    if os.path.isdir(release_path):
        for version_dir in os.listdir(release_path):
            version_path = os.path.join(release_path, version_dir)
            if os.path.isdir(version_path):
                directories_to_process.append(version_path)




## Unzip files

In [None]:


# First ThreadPoolExecutor for unzipping
with ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(unzip_task_directory, directories_to_process)

### Use Libreoffice to convert doc files to docx and the to markdown in parallel

THey key problem I had solved:
I ended up going with an advice for starting many libreoffice instances in parallel. This works by adding a -env:UserInstallation=file:///tmp/... command line variable:

libreoffice -env:UserInstallation=file:///tmp/delete_me_#{timestamp} \
            --headless \
            --convert-to pdf \
            --outdir /tmp \
            /path/to/my_file.doc

The advice itself was spotted in a long discussion to an issue on GitHub called "Parallel conversions and synchronization".

GPT4 break down:
The response you found describes a method for running multiple instances of LibreOffice in parallel for file conversion tasks. This technique is particularly useful when you need to process a large number of documents simultaneously, which can be a common requirement in server-side applications or batch processing scripts.

Hereâ€™s a breakdown of the response and how the method works:

1. **Multiple LibreOffice Instances**: By default, LibreOffice is designed to run as a single instance. This can be a limitation when trying to convert multiple documents at the same time, as each conversion task would need to wait for the previous one to complete.

2. **Using `-env:UserInstallation`**: The key to running multiple instances is the `-env:UserInstallation` command-line option. This option allows you to specify a unique user profile directory for each LibreOffice instance. By setting a different user profile directory for each instance (like using a unique `/tmp/delete_me_#{timestamp}` in the example), you essentially isolate these instances from each other.

3. **How it Works**:
   - `libreoffice`: The command to run LibreOffice.
   - `-env:UserInstallation=file:///tmp/delete_me_#{timestamp}`: This sets a unique user profile directory. The `#{timestamp}` part is a placeholder and should be replaced with a unique identifier for each instance, such as a timestamp or a unique sequence number.
   - `--headless`: This option runs LibreOffice without its GUI, which is necessary for server-side or command-line operations.
   - `--convert-to pdf`: This instructs LibreOffice to convert the input document to a PDF. This can be changed to other formats as needed.
   - `--outdir /tmp`: Specifies the output directory for the converted file.
   - `/path/to/my_file.doc`: The path to the document that needs to be converted.

4. **Benefits**:
   - **Parallel Processing**: This approach allows for true parallel processing of document conversions, significantly reducing the time required to process multiple files.
   - **Isolation of Instances**: Each instance operates independently, reducing the chances of conflicts or crashes affecting other instances.

5. **Use Cases**: This method is particularly beneficial in scenarios where you have to convert a large batch of documents in a short amount of time, such as in web servers, document management systems, or batch processing scripts.

6. **Cleanup**: Since this approach creates temporary user profiles, it's important to implement a cleanup mechanism to delete these temporary directories after the conversions are complete to avoid cluttering the file system.

This method is an effective solution for overcoming the limitations of LibreOffice's default single-instance mode, enabling efficient parallel processing of document conversion tasks.


In [None]:
import os
import subprocess
import time

def file_exists(file_path):
    return os.path.exists(file_path)


def convert_doc_to_docx_and_markdown(doc_path):
    directory = os.path.dirname(doc_path)
    docx_path = doc_path + 'x'
    markdown_file = os.path.splitext(docx_path)[0] + '.md'

    # Ensure a temp directory exists in the current working directory
    temp_dir = os.path.join(os.getcwd(), 'temp')
    os.makedirs(temp_dir, exist_ok=True)

    # Generate a unique identifier for the LibreOffice user profile
    unique_id = str(time.time()).replace('.', '')

    # Create a LibreOffice user profile directory inside the temp folder
    temp_libreoffice_dir = os.path.join(temp_dir, f"libreoffice_temp_{unique_id}")
    os.makedirs(temp_libreoffice_dir, exist_ok=True)
    user_installation_path = f"file://{temp_libreoffice_dir}"

    # Convert DOC to DOCX
    if not file_exists(docx_path):
        try:
            subprocess.run([
                "libreoffice", 
                "-env:UserInstallation=" + user_installation_path,
                "--headless", 
                "--convert-to", "docx", 
                doc_path, 
                "--outdir", directory], 
                check=True, 
                stderr=subprocess.PIPE)
            print(f"    Converted {os.path.basename(doc_path)} to .docx")
        except subprocess.CalledProcessError as e:
            print(f"    Error converting {os.path.basename(doc_path)} to .docx: {e}")
            print(f"    LibreOffice error: {e.stderr.decode()}")

    # Check if DOCX file exists before converting to Markdown
    if file_exists(docx_path):
        if not file_exists(markdown_file):
            try:
                subprocess.run(['pandoc', '-s', docx_path, '-o', markdown_file], check=True)
                print(f"    Converted {os.path.basename(docx_path)} to Markdown")
            except subprocess.CalledProcessError as e:
                print(f"    Error converting {os.path.basename(docx_path)} to Markdown: {e}")
    else:
        print(f"    {docx_path} does not exist. Skipping Markdown conversion.")

def process_task_file(doc_file):
    print(f"Processing {doc_file}...")
    convert_doc_to_docx_and_markdown(doc_file)


## Convert files to docs and markdown format in parallel

In [None]:
# Collect .doc files for processing
doc_files_to_process = []
for directory in directories_to_process:
    for item in os.listdir(directory):
        if item.endswith('.doc'):
            doc_files_to_process.append(os.path.join(directory, item))
            
# Second ThreadPoolExecutor for processing .doc files
with ThreadPoolExecutor(max_workers=20) as executor:
    list(executor.map(process_task_file, doc_files_to_process)) 

## Now let's clean up the folder. First we copy the files to a new folder and then keep only the markdown files and docx files.

In [None]:
import os

def clean_directory(directory, keep_extensions=['.docx', '.md']):
    total_files = sum([len(files) for r, d, files in os.walk(directory)])
    processed_files = 0

    for root, dirs, files in os.walk(directory):
        for file in files:
            processed_files += 1
            if not any(file.endswith(ext) for ext in keep_extensions):
                file_path = os.path.join(root, file)
                print(f"Deleting: {file_path}")
                os.remove(file_path)  # Remove the file
            
            # Update and display the progress
            progress = (processed_files / total_files) * 100
            print(f"Progress: {progress:.2f}% ({processed_files}/{total_files})")

# Path to the directory you want to clean
directory_path = './3GPP-clean'

# Perform the cleaning
clean_directory(directory_path)



## 3GPP-Clean Directory Markdown and DOCX File Size Analysis

This Python script is designed to analyze the file sizes of Markdown (`.md`) documents in the `3GPP-clean` directory structure. The script will:

1. Traverse through the `Rel-*` folders, each corresponding to a different release of the 3GPP documentation.
2. Within each release, iterate through version subfolders.
3. Calculate the accumulated file size of all `.md` files within each version and release.
4. Compile this data into a comprehensive report, breaking down the sizes by version and release.
5. Convert file sizes to a more human-readable format (megabytes).
6. Save this report as a JSON file for easy reference.
7. Print a summary to the console for the entire repository and each individual release.

This utility is particularly useful for managing and understanding the distribution of document sizes within structured documentation repositories.

### How to Run the Script

- Ensure the script is executed in an environment with access to the `3GPP-clean` directory.
- Modify `directory_path` in the script to point to the location of your `3GPP-clean` directory.
- Run the script using a Python interpreter.
- The output will be a JSON file named `md_sizes_report.json`, and a console printout of the summarized data.

Below is the Python script that performs this analysis:


In [2]:
import os
import json

def bytes_to_megabytes(bytes_value):
    return bytes_value / (1024 * 1024)

def calculate_md_sizes(directory):
    report = {"total_size": 0, "releases": {}}

    for release in os.listdir(directory):
        release_path = os.path.join(directory, release)
        if os.path.isdir(release_path):
            release_size = 0
            report["releases"][release] = {"total_size": 0, "versions": {}}

            for version in os.listdir(release_path):
                version_path = os.path.join(release_path, version)
                if os.path.isdir(version_path):
                    version_size = 0

                    for file in os.listdir(version_path):
                        if file.endswith('.md'):
                            file_path = os.path.join(version_path, file)
                            version_size += os.path.getsize(file_path)

                    report["releases"][release]["versions"][version] = bytes_to_megabytes(version_size)
                    report["releases"][release]["total_size"] += version_size
                    release_size += version_size

            report["releases"][release]["total_size"] = bytes_to_megabytes(release_size)
            report["total_size"] += release_size

    report["total_size"] = bytes_to_megabytes(report["total_size"])
    return report

def save_report_to_json(report, filename):
    with open(filename, 'w') as file:
        json.dump(report, file, indent=4)

def print_summary(report):
    print(f"Total size of .md files in the repository: {report['total_size']:.2f} MB")
    for release, data in report['releases'].items():
        print(f"{release}: Total size of .md files: {data['total_size']:.2f} MB")

# Main execution
directory_path = './3GPP-clean'
md_sizes_report = calculate_md_sizes(directory_path)
json_filename = 'md_sizes_report.json'
save_report_to_json(md_sizes_report, json_filename)
print_summary(md_sizes_report)


Total size of .md files in the repository: 4238.73 MB
Rel-11: Total size of .md files: 256.03 MB
Rel-10: Total size of .md files: 215.36 MB
Rel-12: Total size of .md files: 280.84 MB
.ipynb_checkpoints: Total size of .md files: 0.00 MB
Rel-18: Total size of .md files: 594.03 MB
Rel-14: Total size of .md files: 340.49 MB
Rel-19: Total size of .md files: 9.28 MB
Rel-15: Total size of .md files: 523.03 MB
Rel-13: Total size of .md files: 353.18 MB
Rel-16: Total size of .md files: 548.83 MB
Rel-8: Total size of .md files: 186.79 MB
Rel-9: Total size of .md files: 194.27 MB
Rel-17: Total size of .md files: 736.61 MB


## 3GPP Documentation Analysis

This repository contains analysis data for the 3GPP documentation releases. The primary focus is on the file sizes of Markdown documents within each release.

## File Size Analysis

The analysis involves calculating the total size of Markdown (`.md`) files in each release of the 3GPP documentation. The data provides insights into the volume of documentation across different releases.

### Graphical Representation

Below is a bar plot that shows the total size of `.md` files in each release, from `Rel-8` to `Rel-19`. The sizes are represented in megabytes (MB).

<!-- ![3GPP Releases MD File Sizes](results/3gpp_releases_md_file_sizes.png) -->
<img src="3gpp_releases_md_file_sizes.png" alt="3GPP Releases MD File Sizes" width="50%" height="50%">



In [1]:
import os
import json

def bytes_to_megabytes(bytes_value):
    return bytes_value / (1024 * 1024)

def count_words_in_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        contents = file.read()
        words = contents.split()
        return len(words)

def calculate_md_sizes_and_word_count(directory):
    report = {"total_size": 0, "total_words": 0, "releases": {}}

    for release in os.listdir(directory):
        release_path = os.path.join(directory, release)
        if os.path.isdir(release_path):
            release_size, release_word_count = 0, 0
            report["releases"][release] = {"total_size": 0, "total_words": 0, "versions": {}}

            for version in os.listdir(release_path):
                version_path = os.path.join(release_path, version)
                if os.path.isdir(version_path):
                    version_size, version_word_count = 0, 0

                    for file in os.listdir(version_path):
                        if file.endswith('.md'):
                            file_path = os.path.join(version_path, file)
                            version_size += os.path.getsize(file_path)
                            version_word_count += count_words_in_file(file_path)

                    report["releases"][release]["versions"][version] = {
                        "size_mb": bytes_to_megabytes(version_size),
                        "words": version_word_count
                    }
                    report["releases"][release]["total_size"] += version_size
                    report["releases"][release]["total_words"] += version_word_count
                    release_size += version_size
                    release_word_count += version_word_count

            report["releases"][release]["total_size"] = bytes_to_megabytes(release_size)
            report["releases"][release]["total_words"] = release_word_count
            report["total_size"] += release_size
            report["total_words"] += release_word_count

    report["total_size"] = bytes_to_megabytes(report["total_size"])
    return report

def save_report_to_json(report, filename):
    with open(filename, 'w') as file:
        json.dump(report, file, indent=4)

def print_summary(report):
    print(f"Total size of .md files in the repository: {report['total_size']:.2f} MB")
    print(f"Total words in .md files in the repository: {report['total_words']}")
    for release, data in report['releases'].items():
        print(f"{release}: Total size of .md files: {data['total_size']:.2f} MB, Total words: {data['total_words']}")

# Main execution
directory_path = './3GPP-clean'
md_sizes_report = calculate_md_sizes_and_word_count(directory_path)
json_filename = 'md_sizes_word_count_report.json'
save_report_to_json(md_sizes_report, json_filename)
print_summary(md_sizes_report)


Total size of .md files in the repository: 4238.73 MB
Total words in .md files in the repository: 534914482
Rel-11: Total size of .md files: 256.03 MB, Total words: 32817026
Rel-10: Total size of .md files: 215.36 MB, Total words: 27820131
Rel-12: Total size of .md files: 280.84 MB, Total words: 36218498
Rel-18: Total size of .md files: 594.03 MB, Total words: 73825439
Rel-14: Total size of .md files: 340.49 MB, Total words: 43484442
Rel-19: Total size of .md files: 9.28 MB, Total words: 1221658
Rel-15: Total size of .md files: 523.03 MB, Total words: 65165959
Rel-13: Total size of .md files: 353.18 MB, Total words: 45118710
Rel-16: Total size of .md files: 548.83 MB, Total words: 69425169
Rel-8: Total size of .md files: 186.79 MB, Total words: 24117232
Rel-9: Total size of .md files: 194.27 MB, Total words: 24953249
Rel-17: Total size of .md files: 736.61 MB, Total words: 90746969
