# Overview(problematics&highlights)

# Our problematics:
Through carefuly analysis of the baseline solution and our own experimentations, we've been targetting at tackling:

-**Date missing** in the extracted text due to inadequate OCR or leaving out pages(e.g. date information in irregular font or image format, and dates appearing at the end of the document)

-**Lack of a benchmark dataset** and a set of consistent annotation rules with high-quality human annotation on which to test and improve the predictor.

-**Economize the computation** of LLM's inference by giving as the input only most relevant informations

-**Fully utilize the LLM's knowledge** to make a good judgement among several possible dates and even correct some OCR errors

Responding to the above problematics, we present:


# Highlights of the work:

*  **Better quality OCR**(we used the best-performing open-source OCR model we found: **PaddleOCR** by Baidu China), especially effective for keeping and recognizing the dates of irregular font/format

*  **A benchmark dataset** strictly **annotated by our native French** members following a set of **reasonable and consistent annotation rules**(see evaluation part), including the most challenging examples

*  **Economize the computation of LLM** while keeping good performance with a highly efficient input: **NER Dates with their contexutal characters** of the doc's **first pages and final pages**

*  Utilize the **prompting(few-shot learning and simple CoT)** to refine LLM(Qwen)'s inference and further **compensente inevitable OCR flaws**


Result: compared with the Datapolitcs baseline, **our predictor shows obvious performance improvement(more than 10% on our challenge benchmark dataset** and more than 5% on the class's collaborative annotation dataset, see the Evaluation part)



# This pipeline conerns:

1. PDF and original Data preparation

2. OCR: PDF(to image)to Text

3. NER dates extraction with contextual information

4. LLM(Qwen 2.5)-based predictor(prompted)

5. Evaluation

## PDF and original Data preparation


In [None]:
# %%capture
!pip install pymupdf pytesseract pdfplumber
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import io
import pdfplumber
import requests
!pip install datasets
from datasets import load_dataset,Dataset
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, AutoModelForTokenClassification, pipeline
import os
import re

Collecting pymupdf
  Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m41.2 

In [None]:
#Get the url/cache and doc_id from the original dataset provided by the teacher
import pandas as pd

original_data_df = pd.read_csv("/content/dataset (2).csv")

new_data_df = original_data_df[['doc_id', 'url', 'cache']]


In [None]:
#Function to download the pdf from the URL
def download_and_save_pdf(cache_url, url, filename):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    # try cache first
    try:
        response = requests.get(cache_url, headers=headers)
        response.raise_for_status()

        with open(f"{filename}.pdf", "wb") as file: # make name of file its row in the df
            file.write(response.content)

    except Exception as e:
        print(f"Error cache url: {e}") # get row id in google sheet file where there is a problem
        print("trying other url ...")

        # if it doesn't work try url
        try:
          response = requests.get(url, headers=headers)
          response.raise_for_status()

          with open(f"{filename}.pdf", "wb") as file: # make name of file its row in the df
              file.write(response.content)

        except Exception as e2:
           print(f"Error normal url: {e2}") # get row id in google sheet file where there is a problem
           print("No urls work")

## OCR: PDF to Text


In [None]:
!pip install paddlepaddle-gpu==2.5.2 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
!pip install paddleocr
from paddleocr import PaddleOCR, draw_ocr
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np


# Initialize the OCR model
ocr = PaddleOCR(use_angle_cls=True, lang="fr",show_log=False)  # 使用中文模型


def text_from_pdf(pdf_path):
    pdf_doc = fitz.open(pdf_path)
    num_pages = len(pdf_doc)

    extracted_content = []  # list[str]
    # define a range of targed pages
    pages2consider = [page_id for page_id in [0,1,2,-2,-1] if -num_pages <= page_id < num_pages]

    for page_idx in pages2consider:
        page = pdf_doc[page_idx]

        #PaddleOCR only accepts image: convert PDF to image
        pix = page.get_pixmap(dpi=300)
        image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        image_np = np.array(image)

        #OCR-processing
        ocr_text = ocr.ocr(image_np, cls=True)
        if ocr_text[0] != None:
            texts = [res[1][0] for res in ocr_text[0]]  # convert to texts
            extracted_content.extend(texts)

    return extracted_content


Looking in links: https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
Collecting paddlepaddle-gpu==2.5.2
  Downloading paddlepaddle_gpu-2.5.2-cp310-cp310-manylinux1_x86_64.whl.metadata (8.5 kB)
Collecting astor (from paddlepaddle-gpu==2.5.2)
  Downloading astor-0.8.1-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting opt-einsum==3.3.0 (from paddlepaddle-gpu==2.5.2)
  Downloading opt_einsum-3.3.0-py3-none-any.whl.metadata (6.5 kB)
Downloading paddlepaddle_gpu-2.5.2-cp310-cp310-manylinux1_x86_64.whl (542.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.5/542.5 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Installing collected packages: opt-einsum, astor, paddlepaddle-gpu
  Attempting uninstall: opt-einsum
    Found existing

Collecting paddleocr
  Downloading paddleocr-2.9.1-py3-none-any.whl.metadata (8.5 kB)
Collecting pyclipper (from paddleocr)
  Downloading pyclipper-1.3.0.post6-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (9.0 kB)
Collecting lmdb (from paddleocr)
  Downloading lmdb-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.1 kB)
Collecting rapidfuzz (from paddleocr)
  Downloading rapidfuzz-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting python-docx (from paddleocr)
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting fire>=0.3.0 (from paddleocr)
  Downloading fire-0.7.0.tar.gz (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting albumentations==1.4.10 (from paddleocr)
  Downloading albumentations-1.4.10-py3-none-any.whl.metadata (38 

100%|██████████| 3910/3910 [00:15<00:00, 248.03it/s] 


download https://paddleocr.bj.bcebos.com/PP-OCRv3/multilingual/latin_PP-OCRv3_rec_infer.tar to /root/.paddleocr/whl/rec/latin/latin_PP-OCRv3_rec_infer/latin_PP-OCRv3_rec_infer.tar


100%|██████████| 9930/9930 [00:17<00:00, 554.86it/s] 


download https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar to /root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer/ch_ppocr_mobile_v2.0_cls_infer.tar


100%|██████████| 2138/2138 [00:15<00:00, 141.89it/s]








In [None]:
#The mapping function
def pdf_to_text_paddleOCR(row):
  download_and_save_pdf(row["cache"], row["url"], row.name)

  if os.path.exists(f"{row.name}.pdf"):
    text = text_from_pdf(f"{row.name}.pdf")
    row["text"] = text
    os.remove(f"{row.name}.pdf") # save memory
  print(row.name)
  return row

In [None]:
#Apply the pdf_to_text_paddleOCR mapping for examples(previous 200 rows)
new_data_df["text"] = None
new_data_df.loc[:199] = new_data_df.loc[:199].apply(pdf_to_text_paddleOCR, axis=1)

#Saving for safe
new_data_df_2 = new_data_df.copy()

new_data_df_2.to_pickle("new_data_df_text_200.pkl")

0
1
2
3
4
Error cache url: 403 Client Error: Forbidden for url: https://datapolitics-public.s3.gra.io.cloud.ovh.net/LORIA/2785/384c7_D%C3%A9lib%C3%A9rations_Conseil_Communautaire_27_f%C3%A9vrier_2023.pdf
trying other url ...
5
6
7
8
9
10
11
Error cache url: 403 Client Error: Forbidden for url: https://datapolitics-public.s3.gra.io.cloud.ovh.net/LORIA/2512/b2cf4_CR_09_f%C3%A9vrier_2023.pdf
trying other url ...
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Error cache url: 403 Client Error: Forbidden for url: https://datapolitics-public.s3.gra.io.cloud.ovh.net/LORIA/2965/18c3595af8e450d0b8afffe9827a617fcfa8450f_Rapport%20de%20P
trying other url ...
49
50
Error cache url: 403 Client Error: Forbidden for url: https://datapolitics-public.s3.gra.io.cloud.ovh.net/LORIA/1970/5011763f908fe9bdec498bdf9cb1517bb66fbb56_PV%20CC%20du%203
trying other url ...
51
52
53
54
55
56
57
58
59
60
61
62
Error cache url: 403 Client Error: Forbidd

KeyboardInterrupt: 

## Dates NER with contexutal information

In [None]:
#Intitialize a good-performing French NER model 'camembert-ner-with-dates' (thanks to the recommendation of our classmate Xu Sun)
ner_tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")
ner_model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner-with-dates")
ner_pipe = pipeline('ner', model=ner_model, tokenizer=ner_tokenizer, aggregation_strategy="simple",device=0)

In [None]:
#Function to extract dates and the contextual informations
def get_original_text_for_dates(texts,window_size=1):
    results=[]
    if not texts :
        return []

    else:
        for idx in range(len(texts)):
            text = texts[idx]
            ner_results = ner_pipe(text)
            for ner_dict in ner_results:
                if ner_dict["entity_group"] == "DATE":
                    if idx > 0:
                        results.extend([texts[idx-1], text]) #in case the pre-context is seperated from the date itself
                    else: #idx == 0
                        results.append(text)

    return results

def text_to_NER_Date_map(row):
    Dates = get_original_text_for_dates(row['text'])
    row["Date_NER"] = Dates
    print(row.name)
    return row

In [None]:
# Apply text_to_NER_Date_map to examples(the first 200 rows)
new_data_df['Date_NER'] = None
new_data_df.loc[:199] = new_data_df.loc[:199].apply(text_to_NER_Date_map, axis=1)
new_data_df.to_pickle("new_data_df_NER_200.pkl")#save for safe

## LLM-based predictor

In [None]:
model_name = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)

model.eval()

In [None]:
import urllib.parse
def get_llm_date_prediction(text,url):
  prompt = f"""
  Voici des informations extraites d'un document administratif :
  {text}

  Question : Quelle est la date de publication de ce document ?

  Répondez uniquement par la date de publication au format JJ/MM/AAAA (jour/mois/année).
  Si aucune date correcte n'est pas mentionnée dans le texte, essayez de l'extraire à partir de l'URL suivante : {urllib.parse.unquote(url)}.
  Si la date semble incorrecte, corrigez-la utilisant tes connaissances et donnez uniquement la date corrigée sous la forme JJ/MM/AAAA.
  Exemple：Dans '37/02/2023', 37 est invalide pour le mois 02, corrigez-la et donnez donc 27/02/2023.
  Exemple: Dans'3 0 MAl 2023'corrigez-la et donnez donc '30/05/2023'



  Ne donnez aucune autre information ou texte en dehors de la date sous la forme JJ/MM/AAAA
  Réponse :
  """
  messages = [
    {"role": "system", "content": "Tu es un humain identifieur de la date de publication du texte"},
    {"role": "user", "content": prompt}
  ]

  text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
  )

  inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

  model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

  generated_ids = model.generate(
      **model_inputs,
      max_new_tokens=512
  )

  generated_ids = [
      output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
  ]

  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

  return response


def text_to_prediction(row):
    llm_answer = get_llm_date_prediction(row['Date_NER'],row['url'])
    row["pred"] = llm_answer
    print(row.name, llm_answer)
    return row


In [None]:
#Get the prediction (previous 200 rows)
new_data_df["pred"] = None
new_data_df.loc[:199] = new_data_df.loc[:199].apply(text_to_prediction, axis=1)
new_data_df.to_pickle("new_data_df_pred_200.pkl")


## Evaluation and our annotation rule
The evaluation part features also **our benchmark gold standard** which we annotate consistently with the rules follwing the priority order:

 1.  Date prefixed exactly with ‘Mise en ligne le’, “Publié le”,etc
 2. Dates declaring the approval or holding of the event: “avis du…”, ‘procès verbal du …’
 3.  Date appearing at the end of the document with the signature and prefixed with “fait à XX le…”, "approuvé le...",etc
 4. If the above rules fail, use the one in the URL


 We'll then compare the performance of our solution and Datapolitcs on our benchmark and make the analysis.



In [None]:
import pandas as pd
new_data_df = pd.read_pickle("/content/new_data_df_pred_200 .pkl")

gold_99 = pd.read_csv("/content/gold_annotations_99.csv") #Our gold annotation

pred_datapolitics = pd.read_csv("/content/dataset (2).csv") #Datapolics's original dataset

In [None]:
##### Evaluation on our gold standard

# Join the 3 datasets on gold_id
merged_df = (
    new_data_df[['doc_id', 'pred']]
    .rename(columns={'pred': 'pred_new'})
    .merge(pred_datapolitics[['doc_id', 'published']].rename(columns={'published': 'pred_datapolitics'}), on='doc_id', how='left')
    .merge(gold_99[['doc_id', 'REAL_GOLD']], on='doc_id', how='left')
)
merged_df = merged_df.dropna(subset=['pred_new', 'pred_datapolitics', 'REAL_GOLD']) # filter out NaN

# Calculate results
differences_new = merged_df[merged_df['pred_new'] != merged_df['REAL_GOLD']]
differences_datapolitics = merged_df[merged_df['pred_datapolitics'] != merged_df['REAL_GOLD']]

count_differences_new = len(differences_new)
count_differences_datapolitics = len(differences_datapolitics)

accuracy_new = (merged_df['pred_new'] == merged_df['REAL_GOLD']).mean()
accuracy_datapolitics = (merged_df['pred_datapolitics'] == merged_df['REAL_GOLD']).mean()

print("Our gold v.s. Our predictions")
print(f"Number of wrong predictions: {count_differences_new}")
print(f"Accuracy: {accuracy_new:.2%}")
print('')
print("Our gold v.s. DataPolitics")
print(f"Number of wrong predictions: {count_differences_datapolitics}")
print(f"Accuracy: {accuracy_datapolitics:.2%}")


Our gold v.s. Our predictions
Number of wrong predictions: 23
Accuracy: 76.77%

Our gold v.s. DataPolitics
Number of wrong predictions: 34
Accuracy: 65.66%


**Our solution performs outperforms the baseline in:**

- Taking into account the publication date appearing at the end of the document

- Thanks to better OCR, we include more irregular-font date information missing in the texts provided by the baseline solution

**Analysis for the still-existing problems and possible solutions:**

- Hand-writing dates too difficult for OCR.➡️ If we have enough budget, we could try Google Vision API

- The LLM's 'reasoning' doesn't align with our human standards ➡️ Try more advanced prompting(e.g. a more clear stated rule) or even FT strategies(e.g. prefix-tuning)

- Some urls are no longer requestable (approximate ratio of 4 out of 200)

- The publication date falls outside the scope of our targeted pages (rare)




We also run the evalution using the class's collaborative annotation. Despite our performance still higher than the baseline, we consider it less relevant as this collabrative annotation rule is not clear and consistent, with principles different from ours.👇


In [None]:
##### Evaluation on the class's annotations
!pip install datasets
from datasets import load_dataset,Dataset
ds = load_dataset("maribr/publication_dates_fr")

In [None]:
huggingface_df = ds['train'].to_pandas()

# Join the 3 dataframes on url
merged_df = (
    new_data_df[['url', 'pred']]
    .rename(columns={'pred': 'pred_new'})
    .merge(pred_datapolitics[['url', 'published']].rename(columns={'published': 'pred_datapolitics'}), on='url')
    .merge(huggingface_df[['url', 'Gold published date']], on='url', how='left')
)
merged_df = merged_df.dropna(subset=['pred_new', 'pred_datapolitics', 'Gold published date']) # filter out NaN

# Calculate results
differences_new = merged_df[merged_df['pred_new'] != merged_df['Gold published date']]
differences_datapolitics = merged_df[merged_df['pred_datapolitics'] != merged_df['Gold published date']]

count_differences_new = len(differences_new)
count_differences_datapolitics = len(differences_datapolitics)

accuracy_new = (merged_df['pred_new'] == merged_df['Gold published date']).mean()
accuracy_datapolitics = (merged_df['pred_datapolitics'] == merged_df['Gold published date']).mean()

print("The class annotations v.s. Our predictions")
print(f"Number of wrong predictions: {count_differences_new}")
print(f"Accuracy: {accuracy_new:.2%}")
print('')
print("The class annotations v.s. DataPolitics")
print(f"Number of wrong predictions: {count_differences_datapolitics}")
print(f"Accuracy: {accuracy_datapolitics:.2%}")

The class annotations v.s. Our predictions
Number of wrong predictions: 55
Accuracy: 72.36%

The class annotations v.s. DataPolitics
Number of wrong predictions: 65
Accuracy: 67.34%


In [None]:
new_data_df.loc[62]['Date_NER']

['Proces-verbal de la session du',
 'Conseil Communautaire du 6 février 2023',
 'présidence de Monsieur Jean-Louis CAMUS, Président.',
 'Date de convocation : 27 janvier 2023',
 'Le quorum étant atteint, le Président ouvre la séance',
 ' Approbation du proces-verbal de la séance du Conseil Communautaire du 13 décembre 2022',
 ' Approbation du proces-verbal de la séance du Conseil Communautaire du 13 décembre 2022',
 'Le Président donne lecture du procés-verbal de la session du conseil communautaire en date du 13 décembre 2022.',
 "Monsieur le Président rappelle l'ordre du jour :",
 ' Approbation du PV de séance du conseil communautaire du 13 décembre 2022',
 'Décisions du Président',
 " Débat d'orientations budgétaires 2023",
 'Affaires économiques.:Boulangerie de MARTIZAY : Assujettissement a la TVA',
 'Voirie : Programme de voirie 2023 - lancement des consultations',
 'Virement de crédits',
 'ARVC 2022 - 02',
 'En application des articles L 2322-1 et L2322-2 du Code général des colle

In [None]:
differences_new

Unnamed: 0,doc_id,pred_new,pred_datapolitics,REAL_GOLD
4,3132/6df22_cms_viewFile.php,23/01/2023,16/01/2023,16/01/2023
10,3132/b91a1_cms_viewFile.php,13/02/2023,13/02/2023,06/03/2023
15,679/864c8_ca.pdf,14/02/2023,17/02/2023,17/02/2023
20,1058/fb940_2023.020.pdf,17/02/2023,17/02/2023,24/02/2023
23,1220/55539_pv30012023-signe.pdf,30/01/2023,30/01/2023,27/02/2023
24,693/f5db7_RAPPORT_BP_2023_COMMUNE.pdf,31/01/2023,01/01/2023,01/01/2023
27,3031/415be_TOME-3_Plan-dactions-Suivi-Evaluati...,01/01/2023,12/01/2023,12/01/2023
30,2432/1775c_Couvron-2-Règlement.pdf,27/01/2023,01/01/2023,01/01/2023
36,2120/6aa7b_8979238261_1352_pv-conseil-19-janvi...,19/01/2023,19/01/2023,20/02/2023
37,6608/8e06f_Deliberations-du-23-janvier-2023.pdf,17/01/2023,23/01/2023,23/01/2023
