## Introduction: Parallel File Processing for 3GPP Documents

This Python script is designed to automate and optimize the processing of document files, specifically for a collection of 3GPP documents. It follows a two-step parallel processing approach to efficiently handle large volumes of files. The script operates within a specified base directory, targeting the `3GPP-all` folder, and processes documents found in its subdirectories.

Key Features:
1. `file_exists`: Verifies the existence of files, ensuring efficient handling of file operations.
2. `unzip_task_directory`: Automates the unzipping of archives in the `3GPP-all` directory, with checks to avoid unnecessary processing of already unzipped files.
3. Systematic traversal through nested directory structures, identifying and preparing files for processing.
4. Implements `ThreadPoolExecutor` for parallel processing, significantly enhancing the efficiency of unzipping and document conversion tasks.
5. Innovative use of multiple LibreOffice instances for parallel


## Unzip files

### Use Libreoffice to convert doc files to docx and the to markdown in parallel

THey key problem I had solved:
I ended up going with an advice for starting many libreoffice instances in parallel. This works by adding a -env:UserInstallation=file:///tmp/... command line variable:

libreoffice -env:UserInstallation=file:///tmp/delete_me_#{timestamp} \
            --headless \
            --convert-to pdf \
            --outdir /tmp \
            /path/to/my_file.doc

The advice itself was spotted in a long discussion to an issue on GitHub called "Parallel conversions and synchronization".

GPT4 break down:
The response you found describes a method for running multiple instances of LibreOffice in parallel for file conversion tasks. This technique is particularly useful when you need to process a large number of documents simultaneously, which can be a common requirement in server-side applications or batch processing scripts.

Here’s a breakdown of the response and how the method works:

1. **Multiple LibreOffice Instances**: By default, LibreOffice is designed to run as a single instance. This can be a limitation when trying to convert multiple documents at the same time, as each conversion task would need to wait for the previous one to complete.

2. **Using `-env:UserInstallation`**: The key to running multiple instances is the `-env:UserInstallation` command-line option. This option allows you to specify a unique user profile directory for each LibreOffice instance. By setting a different user profile directory for each instance (like using a unique `/tmp/delete_me_#{timestamp}` in the example), you essentially isolate these instances from each other.

3. **How it Works**:
   - `libreoffice`: The command to run LibreOffice.
   - `-env:UserInstallation=file:///tmp/delete_me_#{timestamp}`: This sets a unique user profile directory. The `#{timestamp}` part is a placeholder and should be replaced with a unique identifier for each instance, such as a timestamp or a unique sequence number.
   - `--headless`: This option runs LibreOffice without its GUI, which is necessary for server-side or command-line operations.
   - `--convert-to pdf`: This instructs LibreOffice to convert the input document to a PDF. This can be changed to other formats as needed.
   - `--outdir /tmp`: Specifies the output directory for the converted file.
   - `/path/to/my_file.doc`: The path to the document that needs to be converted.

4. **Benefits**:
   - **Parallel Processing**: This approach allows for true parallel processing of document conversions, significantly reducing the time required to process multiple files.
   - **Isolation of Instances**: Each instance operates independently, reducing the chances of conflicts or crashes affecting other instances.

5. **Use Cases**: This method is particularly beneficial in scenarios where you have to convert a large batch of documents in a short amount of time, such as in web servers, document management systems, or batch processing scripts.

6. **Cleanup**: Since this approach creates temporary user profiles, it's important to implement a cleanup mechanism to delete these temporary directories after the conversions are complete to avoid cluttering the file system.

This method is an effective solution for overcoming the limitations of LibreOffice's default single-instance mode, enabling efficient parallel processing of document conversion tasks.


## Convert files to docs and markdown format in parallel

## Now let's clean up the folder. First we copy the files to a new folder and then keep only the markdown files and docx files.

## 3GPP-Clean Directory Markdown and DOCX File Size Analysis

This Python script is designed to analyze the file sizes of Markdown (`.md`) documents in the `3GPP-clean` directory structure. The script will:

1. Traverse through the `Rel-*` folders, each corresponding to a different release of the 3GPP documentation.
2. Within each release, iterate through version subfolders.
3. Calculate the accumulated file size of all `.md` files within each version and release.
4. Compile this data into a comprehensive report, breaking down the sizes by version and release.
5. Convert file sizes to a more human-readable format (megabytes).
6. Save this report as a JSON file for easy reference.
7. Print a summary to the console for the entire repository and each individual release.

This utility is particularly useful for managing and understanding the distribution of document sizes within structured documentation repositories.

### How to Run the Script

- Ensure the script is executed in an environment with access to the `3GPP-clean` directory.
- Modify `directory_path` in the script to point to the location of your `3GPP-clean` directory.
- Run the script using a Python interpreter.
- The output will be a JSON file named `md_sizes_report.json`, and a console printout of the summarized data.

Below is the Python script that performs this analysis:


## 3GPP Documentation Analysis

This repository contains analysis data for the 3GPP documentation releases. The primary focus is on the file sizes of Markdown documents within each release.

## File Size Analysis

The analysis involves calculating the total size of Markdown (`.md`) files in each release of the 3GPP documentation. The data provides insights into the volume of documentation across different releases.

### Graphical Representation

Below is a bar plot that shows the total size of `.md` files in each release, from `Rel-8` to `Rel-19`. The sizes are represented in megabytes (MB).

<!-- ![3GPP Releases MD File Sizes](results/3gpp_releases_md_file_sizes.png) -->
<img src="3gpp_releases_md_file_sizes.png" alt="3GPP Releases MD File Sizes" width="50%" height="50%">



In [1]:
import os

def chunk_text(text, chunk_size=300, overlap=100):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks


In [3]:
import torch
print(torch.cuda.is_available())


False


In [5]:
import sqlite3

def store_chunks_in_db(chunks, db_path='chunks.db'):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute('CREATE TABLE IF NOT EXISTS chunks (id INTEGER PRIMARY KEY, text TEXT)')

    for chunk in chunks:
        cursor.execute('INSERT INTO chunks (text) VALUES (?)', (chunk,))

    conn.commit()
    conn.close()


In [6]:
# !pip install sentence_transformers

In [7]:
from sentence_transformers import SentenceTransformer
import numpy as np
import torch

def extract_chunks_from_db(db_path='chunks.db'):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute('SELECT text FROM chunks')
    rows = cursor.fetchall()
    conn.close()
    return [row[0] for row in rows]



def convert_chunks_to_vectors(chunks):
    # Check if CUDA is available and set the device accordingly
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # Load the model and move it to the appropriate device
    model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

    # Encode the chunks using the model on the specified device
    chunk_embeddings = model.encode(chunks, convert_to_tensor=True, device=device)

    # Move the embeddings to CPU and convert to numpy array
    return chunk_embeddings.cpu().detach().numpy()



  from .autonotebook import tqdm as notebook_tqdm


In [8]:
import torch
print(torch.cuda.is_available())


False


In [9]:
def store_embeddings(embeddings, file_path='embeddings.npy'):
    np.save(file_path, embeddings)


In [10]:
import os

def collect_md_files(directory):
    md_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.md'):
                md_files.append(os.path.join(root, file))
    return md_files


In [11]:
def read_md_files(md_files):
    contents = []
    for file_path in md_files:
        with open(file_path, 'r', encoding='utf-8') as file:
            contents.append(file.read())
    return contents


In [9]:
# from tqdm import tqdm

# # Assuming the functions from the previous response are already defined:
# # chunk_text, store_chunks_in_db, extract_chunks_from_db, convert_chunks_to_vectors, store_embeddings

# # Main execution
# directory_path = '/kaggle/input/3GPP-clean'  # Adjust this path to your directory
# md_files = collect_md_files(directory_path)
# md_contents = read_md_files(md_files)

# # Process each document with a progress bar
# for content in tqdm(md_contents, desc="Processing Documents"):
#     chunks = chunk_text(content)

#     print("Storeing chunks in SQL")
#     store_chunks_in_db(chunks)

#     print("Extracting chunks from SQL")
#     extracted_chunks = extract_chunks_from_db()

#     print("Converting chunks to vectors")
#     embeddings = convert_chunks_to_vectors(extracted_chunks)

#     print("Storeing embeddings as numpy array")
#     store_embeddings(embeddings)


In [13]:
from sentence_transformers import SentenceTransformer
import torch
import numpy as np
from tqdm import tqdm
import gc

def convert_chunks_to_vectors(chunks, batch_size=16):  # Reduced batch size
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
    
    all_embeddings = []
    for i in tqdm(range(0, len(chunks), batch_size), desc="Encoding Batches"):
        batch_chunks = chunks[i:i + batch_size]
        batch_embeddings = model.encode(batch_chunks, convert_to_tensor=True, device=device)
        all_embeddings.append(batch_embeddings.cpu().detach().numpy())
    
    return np.concatenate(all_embeddings, axis=0)

# Main execution
directory_path = '3GPP-clean\Rel-17 - Copy'  # Adjust this path to your directory
md_files = collect_md_files(directory_path)
md_contents = read_md_files(md_files)

# Process each document with a progress bar
for content in tqdm(md_contents, desc="Processing Documents"):
    chunks = chunk_text(content)
    
    print("Storing chunks in SQL")
    store_chunks_in_db(chunks)
    
    print("Extracting chunks from SQL")
    extracted_chunks = extract_chunks_from_db()
    
    print("Converting chunks to vectors")
    embeddings = convert_chunks_to_vectors(extracted_chunks)
    
    print("Storing embeddings as numpy array")
    store_embeddings(embeddings)
    
    # Clear memory
    del chunks, extracted_chunks, embeddings
    torch.cuda.empty_cache()
    gc.collect()


  directory_path = '3GPP-clean\Rel-17 - Copy'  # Adjust this path to your directory
Processing Documents:   0%|          | 0/2812 [00:00<?, ?it/s]

Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 49/49 [00:28<00:00,  1.69it/s]
Processing Documents:   0%|          | 1/2812 [00:34<26:50:24, 34.37s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 52/52 [00:30<00:00,  1.73it/s]
Processing Documents:   0%|          | 2/2812 [01:06<25:55:40, 33.22s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 52/52 [00:31<00:00,  1.68it/s]
Processing Documents:   0%|          | 3/2812 [01:40<25:57:40, 33.27s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 52/52 [00:30<00:00,  1.73it/s]
Processing Documents:   0%|          | 4/2812 [02:12<25:47:20, 33.06s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 53/53 [00:29<00:00,  1.79it/s]
Processing Documents:   0%|          | 5/2812 [02:44<25:26:39, 32.63s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 59/59 [00:30<00:00,  1.93it/s]
Processing Documents:   0%|          | 6/2812 [03:17<25:33:03, 32.78s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 60/60 [00:32<00:00,  1.87it/s]
Processing Documents:   0%|          | 7/2812 [03:52<26:01:56, 33.41s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 67/67 [00:35<00:00,  1.88it/s]
Processing Documents:   0%|          | 8/2812 [04:30<27:06:54, 34.81s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches: 100%|██████████| 75/75 [00:40<00:00,  1.84it/s]
Processing Documents:   0%|          | 9/2812 [05:13<29:07:09, 37.40s/it]

Storing embeddings as numpy array
Storing chunks in SQL
Extracting chunks from SQL
Converting chunks to vectors


Encoding Batches:  33%|███▎      | 37/112 [00:19<00:40,  1.85it/s]
Processing Documents:   0%|          | 9/2812 [05:35<29:01:50, 37.29s/it]
  directory_path = '3GPP-clean\Rel-17 - Copy'  # Adjust this path to your directory


KeyboardInterrupt: 