### Load the PCG file

In [2]:
# Download specific files from GitHub
!wget https://raw.githubusercontent.com/MasrourTawfik/Textra_Insights/main/Files/PCG_file.pdf

--2024-12-14 23:14:51--  https://raw.githubusercontent.com/MasrourTawfik/Textra_Insights/main/Files/PCG_file.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6092025 (5.8M) [application/octet-stream]
Saving to: ‘PCG_file.pdf.1’


2024-12-14 23:14:51 (81.4 MB/s) - ‘PCG_file.pdf.1’ saved [6092025/6092025]



### Digitalisation

- We need to digitize the filtered file because it is not editable. We will convert it to Markdown.
- for doing that we will use <a href = "https://github.com/VikParuchuri/marker">Marker engine</a> implimented in the top of <a href = "https://github.com/VikParuchuri/surya">Surya OCR</a>
- Use a T4 .

In [None]:
!pip install marker-pdf
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [8]:
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("PCG_file.pdf")
text, _, images = text_from_rendered(rendered)

Loaded layout model datalab-to/surya_layout0 on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Loaded recognition model vikp/surya_rec2 on device cuda with dtype torch.float16
Loaded table recognition model vikp/surya_tablerec on device cuda with dtype torch.float16
Loaded detection model vikp/surya_det3 on device cuda with dtype torch.float16


Recognizing layout: 100%|██████████| 5/5 [00:03<00:00,  1.32it/s]
Detecting bboxes: 100%|██████████| 7/7 [00:07<00:00,  1.07s/it]
Recognizing Text: 100%|██████████| 59/59 [01:16<00:00,  1.30s/it]
Recognizing equations: 0it [00:00, ?it/s]
Detecting bboxes: 100%|██████████| 26/26 [00:23<00:00,  1.11it/s]
Recognizing Text: 100%|██████████| 24/24 [00:26<00:00,  1.09s/it]
Recognizing tables: 100%|██████████| 17/17 [00:05<00:00,  3.40it/s]


In [9]:
markdown_file_path = "PCG_markdown1.md"

with open(markdown_file_path, "w") as md_file:
    md_file.write(text)

print(f"Markdown file saved as {markdown_file_path}")

Markdown file saved as PCG_markdown1.md


### Cleaning

- Remove Tables

In [10]:
def Remove_Tables(input_path, output_path):
    """
    Removes lines that start with '|' (table lines) from a Markdown file.

    Args:
        input_path (str): Path to the input Markdown file.
        output_path (str): Path to save the cleaned Markdown file.
    """
    try:
        with open(input_path, "r") as md_file:
            lines = md_file.readlines()

        cleaned_lines = [line for line in lines if not line.strip().startswith("|")]

        with open(output_path, "w") as cleaned_file:
            cleaned_file.writelines(cleaned_lines)

        print(f"Cleaned Markdown file saved as {output_path}")
    except Exception as e:
        print(f"An error occurred: {e}")



In [11]:
PCG_Markdown_path = "PCG_markdown1.md"
Cleaned_Markdown_path = "PCG_markdown2.md"

Remove_Tables(PCG_Markdown_path, Cleaned_Markdown_path)

Cleaned Markdown file saved as PCG_markdown2.md


- Keep Only accounts (enlever les Rubriques et  Postes)
- You need to check the **PCG_markdown2** to see what is the pattern of Accounts, in this execution an account start by ### xxxx or #### xxxx

In [12]:
import re

def Keep_Headers_and_NonHashed(input_path, output_path):
    """
    Keeps lines that either:
    1. Start with '### xxxx' or '#### xxxx' (where xxxx are exactly four digits), or
    2. Do not start with '#'.

    Args:
        input_path (str): Path to the input Markdown file.
        output_path (str): Path to save the filtered Markdown file.
    """
    try:
        pattern1 = re.compile(r"^### \d{4}")
        pattern2 = re.compile(r"^#### \d{4}")

        with open(input_path, "r") as md_file:
            lines = md_file.readlines()

        cleaned_lines = [
            line for line in lines
            if pattern1.match(line.strip()) or pattern2.match(line.strip()) or not line.strip().startswith("#")
        ]

        with open(output_path, "w") as cleaned_file:
            cleaned_file.writelines(cleaned_lines)

        print(f"Filtered Markdown file saved as {output_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

In [13]:
input_path = "PCG_markdown2.md"
output_path = "PCG_markdown3.md"
Keep_Headers_and_NonHashed(input_path, output_path)

Filtered Markdown file saved as PCG_markdown3.md


#### Now Insure the the **PCG_markdown3** is properly orginized and that only Class 2 and 6 accounts are there.

- To this point ensure that your accounts are somthing like this :

```bash
# 2111. Frais de constitution :
Les frais de constitution correspondent aux frais engagés lors de la constitution de l'entreprise. Il s'aqit généralement des honoraires, des frais liés aux formalités légales de constitution d'entreprise, les droits d'enregistrement sur les apports, les frais liés à la publicité ...
Le compte « 2111. Frais de constitution » est débité, pour le montant total des charges liées à la constitution, par le crédit des comptes de transferts de charges concernés (7197 ; 7397).
```

### To Json

- We convert our cleaned Markdown file to json so it becomes easier to deal with.

In [14]:
import json
import re

def markdown_to_json(input_path, output_path):
    """
    Converts a Markdown file with headers and definitions into a JSON file.

    Args:
        input_path (str): Path to the input Markdown file.
        output_path (str): Path to save the resulting JSON file.
    """
    try:
        # Read the content of the Markdown file
        with open(input_path, "r") as md_file:
            lines = md_file.readlines()

        data = []  # List to hold JSON objects
        current_entry = {}

        # Parse the lines
        for line in lines:
            line = line.strip()

            # Match headers starting with "### 211x"
            if re.match(r"^### (\d{4})\.\s(.+):", line):
                if current_entry:  # Save the previous entry
                    data.append(current_entry)
                match = re.match(r"^### (\d{4})\.\s(.+):", line)
                current_entry = {
                    "Id": match.group(1),
                    "Title": match.group(2),
                    "Definition": ""
                }
            elif current_entry and line:  # Add content to the "Definition"
                current_entry["Definition"] += line + " "

            # Match headers starting with "#### 211x"
            if re.match(r"^#### (\d{4})\.\s(.+):", line):
                if current_entry:  # Save the previous entry
                    data.append(current_entry)
                match = re.match(r"^#### (\d{4})\.\s(.+):", line)
                current_entry = {
                    "Id": match.group(1),
                    "Title": match.group(2),
                    "Definition": ""
                }
            elif current_entry and line:  # Add content to the "Definition"
                current_entry["Definition"] += line + " "
        # Add the last entry
        if current_entry:
            data.append(current_entry)

        # Save to JSON file
        with open(output_path, "w") as json_file:
            json.dump(data, json_file, indent=4, ensure_ascii=False)

        print(f"JSON file saved as {output_path}")
    except Exception as e:
        print(f"An error occurred: {e}")



In [15]:
markdown_path = "PCG_markdown3.md"
json_output_path = "PCG.json"

markdown_to_json(markdown_path, json_output_path)

JSON file saved as PCG.json


### To CSV

In [16]:
import json
import csv

def json_to_csv(json_path, csv_path):
    """
    Converts a JSON file to a CSV file with columns Id, Title, and Definition.

    Args:
        json_path (str): Path to the input JSON file.
        csv_path (str): Path to save the output CSV file.
    """
    try:
        # Load the JSON data
        with open(json_path, "r", encoding="utf-8") as json_file:
            data = json.load(json_file)

        # Open the CSV file for writing
        with open(csv_path, "w", newline="", encoding="utf-8") as csv_file:
            # Create a CSV writer object
            csv_writer = csv.writer(csv_file)

            # Write the header row
            csv_writer.writerow(["Id", "Title", "Definition"])

            # Write the data rows
            for entry in data:
                csv_writer.writerow([entry["Id"], entry["Title"], entry["Definition"]])

        print(f"CSV file has been saved at {csv_path}")
    except Exception as e:
        print(f"An error occurred: {e}")


In [17]:
json_data_path = "PCG.json"
csv_output_path = "PCG.csv"

json_to_csv(json_data_path, csv_output_path)


CSV file has been saved at PCG.csv
