### Overview 
* This file is about a github repository that allows extracting text from pdfs. The link to the github repository is https://github.com/VikParuchuri/marker?tab=readme-ov-file.
* The main package is called marker-pdf. You can install it using pip install marker-pdf.
* Be careful that it installs by default torch with cpu support only. Make sure to open the terminal in the folder where you are going to work and save your pdfs and the parsed markdown.
* Create a virtual environment first " *python -m venv parse_pdfs* " then type " *parse_pdfs\Scripts\activate* " to activate the virtual environment. After that you can install easily the marker-pdf package.
* If you have an Nvidia GPU, do not forget to go to *www.pytorch.org* and install the latest pytorch with cuda support using the following command in your virtual environment " *pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121* "

### Convert a single file
* You can convert a single file by following this command : " *marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10 --langs English* "
* --batch_multiplier is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
* --max_pages is the maximum number of pages to process. Omit this to convert the entire document.
* --langs is a comma separated list of the languages in the document, for OCR
* In my experience, I followed the following command and it was enough for my use case : " *marker_single /path/to/file.pdf /path/to/output/folder* "

### Convert multiple files
* You can convert multiple files by following this command: " *marker /path/to/input/folder /path/to/output/folder --workers 10 --max 10 --metadata_file /path/to/metadata.json --min_length 10000* "
* --workers is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyond INFERENCE_RAM / VRAM_PER_TASK if you're using GPU.
* --max is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
* --min_length is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
* --metadata_file is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not, DEFAULT_LANG will be used. 
* In my experience, I followed the following command and it was enough for my use case: " *marker /path/to/input/folder /path/to/output/folder* "

In [1]:
import re
import json
import tkinter as tk
from tkinter import filedialog

# Function to choose multiple input files
def choose_files():
    root = tk.Tk()
    root.withdraw()  # Hide the root window
    file_paths = filedialog.askopenfilenames(
        title="Select Markdown files",
        filetypes=(("Markdown files", "*.md"), ("All files", "*.*"))
    )
    return file_paths

# Function to choose the location to save the output file
def save_file_dialog():
    root = tk.Tk()
    root.withdraw()  # Hide the root window
    output_file_path = filedialog.asksaveasfilename(
        title="Save Processed File",
        defaultextension=".jsonl",
        filetypes=(("JSON Lines files", "*.jsonl"), ("All files", "*.*"))
    )
    return output_file_path

# Function to process the content by removing unwanted text and adjusting spacing
def process_content(content):
    content = re.sub(r'!\[\d+_image_\d+\.png\]\(\d+_image_\d+\.png\)', '', content)  # Remove image references
    content = content.replace("assessment Not", "")  # Remove specific text
    content = content.replace("Not", "")  # Remove specific text
    content = content.replace("Not assessment", "")  # Remove specific text
    content = content.replace("for assessment", "")  # Remove specific text
    content = re.sub(r'\n\s*\n', '\n\n', content)  # Remove extra blank lines
    return content

# Function to extract labeled data from the content
def extract_labeled_data(content):
    data = []
    # Split content into sections based on the competency type
    sections = re.split(r'##\s+(Mandatory|Core|Optional) Competencies', content)
    
    #Creating an array containing the title of the competency group in an entry and all the competencies within each group in another entry
    # e.g. 3 entries for competencies groups and 3 entries for all the competencies within every group
    for i in range(1, len(sections), 2):
        section = sections[i]
        section_content = sections[i+1]
        current_section = {"level": 1, "text": section}
        data.append(current_section)
        
        # Split section content into 3 levels, level 1 - competency group , level 2 - name of the competency , level 3 - each level in the competency
        competencies = re.split(r'Competency:', section_content)
        
        for competency in competencies[1:]:
            competency_lines = competency.strip().split('\n')
            competency_name = competency_lines[0].strip()
            current_competency = {"level": 2, "text": "Competency: " + competency_name}
            data.append(current_competency)
            
            current_level = None
            for line in competency_lines[1:]:
                if line.startswith("Level:"):
                    if current_level:
                        data.append(current_level)
                    current_level = {"level": 3, "text": line.strip()}
                else:
                    if current_level:
                        current_level["text"] += " " + line.strip()
            if current_level:
                data.append(current_level)
    
    return data

# Function to create a conversational structure from the labeled data
def create_conversational_structure(data):
    conversations = []
    human_prompt = "<human>"
    bot_response = "<bot>"

    competencies = {}
    current_competency = None

    # Organize the data into competencies and their respective levels
    for item in data:
        if item['level'] == 1:
            current_section = item['text']
        elif item['level'] == 2:
            current_competency = item['text']
            if current_competency not in competencies:
                competencies[current_competency] = {"section": current_section, "levels": []}
        elif item['level'] == 3:
            if current_competency:
                competencies[current_competency]['levels'].append(item['text']) 

    # Find the latest level for each competency
    latest_levels = {}
    for competency, details in competencies.items():
        latest_level = 0
        for level_text in details['levels']:
            level_match = re.search(r'Level:\s*(\d+)', level_text)
            if level_match:
                level = int(level_match.group(1))
                if level > latest_level:
                    latest_level = level
        latest_levels[competency] = latest_level

    # Create conversation structure for each level of each competency
    for competency, details in competencies.items():
        for level_text in details['levels']:
            level_match = re.search(r'Level:\s*(\d+)', level_text)
            level = level_match.group(1) if level_match else "unknown"
            latest_level = latest_levels[competency]
            conversation = {
                "text": (
                    f"{human_prompt} I want to write about {competency}\n"
                    f"{bot_response} Sure I can help with that. What level are you seeking?\n"
                    f"{human_prompt} How many levels are there in this competency?\n"
                    f"{bot_response} Based on the information that I have, the latest level for this {competency} competency is {latest_level}.\n"
                    f"{human_prompt} Okay, you could help me with level {level}.\n"
                    f"{bot_response} Sure, here is a proposed response: {level_text}\n"
                )
            }
            conversations.append(conversation)
    
    return conversations

# Function to save the conversations as a JSON Lines file
def save_conversations_as_jsonl(conversations, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        for conversation in conversations:
            file.write(json.dumps(conversation) + '\n')

# Main function to execute the overall process
def main():
    file_paths = choose_files()
    if file_paths:
        all_data = []
        for file_path in file_paths:
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
                processed_content = process_content(content)
                labeled_data = extract_labeled_data(processed_content)
                all_data.extend(labeled_data)

        conversations = create_conversational_structure(all_data)

        output_file_path = save_file_dialog()
        if output_file_path:
            save_conversations_as_jsonl(conversations, output_file_path)
            print(f"Processed file saved as: {output_file_path}")
        else:
            print("Save operation cancelled.")
    else:
        print("No files selected.")

if __name__ == "__main__":
    main()

Processed file saved as: D:/Documents/Dossiers du travail/Technical Excellence/Chatbots/Created chatbots/Parse PDFs faster/converted_pdfs/Raouf_SoE_RICS_APC/Raouf_SoE_RICS_APC_processed_v1.jsonl
