# üìÑ R√©sumeur Intelligent de Documents Business PDF avec Transformers (Fine-tuning Bart)

## üéØ Objectif du projet

Ce projet vise √† concevoir un syst√®me capable de **r√©sumer automatiquement le contenu de documents PDF** gr√¢ce √† des mod√®les de type **Transformers**. Il s'adresse √† des cas d'usage concrets tels que :
- La lecture rapide de **rapports d'entreprise**, **comptes rendus** ou **√©tudes sectorielles**
- La synth√®se de documents volumineux pour **gagner du temps**
- L‚Äôint√©gration dans un outil de **veille, d‚Äôarchivage ou de documentation interne**

Ce projet a √©t√© r√©alis√© dans une logique d‚Äôapprentissage, avec l‚Äôobjectif de d√©montrer :
- La ma√Ætrise des biblioth√®ques NLP modernes (`transformers`, `pdfplumber`, etc.)
- La compr√©hension des mod√®les de r√©sum√© g√©n√©ratif (type BART, T5)
- La capacit√© √† construire un **pipeline complet** et fonctionnel

---

## üß† Contexte technique

Les documents PDF repr√©sentent un format standard en entreprise mais peu structur√©. Leur traitement automatique impose plusieurs d√©fis :
- Extraire correctement le **texte** brut du fichier (√©viter les sauts de lignes inutiles, g√©rer les tableaux)
- Respecter les **limites de longueur** des mod√®les de NLP (token limits)
- Fournir un **r√©sum√© fluide, coh√©rent et fid√®le** au contenu initial

Pour cela, ce projet s'appuie sur :
- `pdfplumber` pour l‚Äôextraction de texte
- `facebook/bart-large-cnn` via la librairie `transformers` de Hugging Face pour le r√©sum√©
- Un d√©coupage (chunking) du document si n√©cessaire

---

## üì¶ Pipeline global

PDF ‚Üí Extraction texte ‚Üí D√©coupage (si besoin) ‚Üí  ‚Üí R√©sum√© (par chunk)

# Importation des Biblioth√®ques

In [None]:
# Installer le n√©cessaire pour le projet
# !pip install pdfplumber
# !pip install ipywidgets
# !pip install langchain_community
# !pip install pypdf
# !pip install kagglehub
# !pip install pandas
# !pip install datasets
# !pip install transformers[torch]
# !pip install tensorboard
# !pip install evaluate
# !pip install nltk
# !pip install rouge-score
# !pip install absl-py

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
   ---------------------------------------- 0.0/84.0 kB ? eta -:--:--
   --------- ------------------------------ 20.5/84.0 kB 640.0 kB/s eta 0:00:01
   ---------------------------------------- 84.0/84.0 kB 1.2 MB/s eta 0:00:00
Installing collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py): started
  Building wheel for rouge-score (setup.py): finished with status 'done'
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24970 sha256=1be8177d236e62b0a3ba7faee6c1dae323d7238c2d0f7e3500a324d15792e836
  Stored in directory: c:\users\tanto\appdata\lo

In [None]:
# üîπ Extraction et manipulation de documents
from langchain_community.document_loaders import PyPDFLoader
from bs4 import BeautifulSoup
import re

# üîπ Traitement de donn√©es
import pandas as pd
import numpy as np

# üîπ √âvaluation NLP
import evaluate

# üîπ Traitement du langage naturel
import nltk
nltk.download("punkt", quiet=True)
nltk.download('punkt_tab', quiet=True)

# üîπ Deep learning
import torch
import tensorboard

# üîπ Pour installer le dataset
import kagglehub

# üîπ Transformers et datasets Hugging Face
from transformers import (
    set_seed,
    AutoTokenizer, AutoModelForSeq2SeqLM,
    BartForConditionalGeneration, BartTokenizer,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
)

from datasets import load_dataset, DatasetDict, Dataset





In [None]:
seed = 42  # Choisis n'importe quelle valeur fixe
set_seed(seed)

# Redondant mais utile pour bien figer tous les niveaux
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

## üß∞ Fonctions utilitaires


In [None]:

def clean_special_characters(text):
    """
    Nettoie les caract√®res sp√©ciaux du texte en rempla√ßant les caract√®res non imprimables
    et les espaces ins√©cables par des espaces normaux.
    """
    text = text.replace("\xa0", " ")  # espace ins√©cable
    text = text.replace("\u200b", "")  # zero-width space
    return text

def normalize_whitespace(text):
    """
    Normalise les espaces dans le texte en rempla√ßant les espaces multiples par un seul espace.
    """
    # Utilise une expression r√©guli√®re pour remplacer les espaces multiples par un seul espace
    return " ".join(text.split())

def remove_code_blocks(text):
    """
    Supprime les blocs de code d√©limit√©s par des backticks (```) dans le texte.
    """
    # Utilise une expression r√©guli√®re pour trouver et supprimer les blocs de code
    return re.sub(r"```.*?```", "", text, flags=re.DOTALL)

def clean_text(text):
    """
    Nettoie le texte en supprimant les caract√®res sp√©ciaux, les blocs de code,  
    et en normalisant les espaces.
    """
    text = clean_special_characters(text)
    text = remove_code_blocks(text)
    text = normalize_whitespace(text)
    return text

In [None]:
def preprocess_function(examples, tokenizer):
    """
    Pr√©traite les exemples en entr√©e pour la t√¢che de r√©sum√©.

    Args:
        examples (dict): Un dictionnaire contenant les donn√©es d'entr√©e.
        tokenizer (transformers.PreTrainedTokenizer): Le tokenizer utilis√© pour l'encodage.

    Returns:
        dict: Un dictionnaire contenant les entr√©es et les √©tiquettes tokenis√©es.
    """

    model_inputs = tokenizer(examples['article'], max_length=1024, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples['highlights'], max_length=128, truncation=True)

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# üì• Chargement et extraction du contenu PDF

In [None]:
# # Pour charger un fichier PDF ou TXT de mani√®re interactive dans un notebook Jupyter
# import ipywidgets as widgets
# from IPython.display import display

# upload = widgets.FileUpload(accept='.pdf,.txt', multiple=False)  # accepter PDF et TXT

# path = display(upload)

In [None]:
try:
    # Charger le fichier PDF
    file_path = "18113_LESSON NOTE ON BUSINESS DOCUMENTS.pdf"
    loader = PyPDFLoader(file_path, mode = "single")
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)      
except FileNotFoundError:
    print("Le fichier PDF n'a pas √©t√© trouv√©. Veuillez v√©rifier le chemin du fichier.")
    # Si le fichier n'est pas trouv√©, vous pouvez d√©finir un chemin par d√©faut ou demander √† l'utilisateur de le fournir.


In [None]:
# Extraire le contenu de la page
doc = pages[0].page_content


In [None]:
print(doc)

BUSINESS DOCUMENTS 
Business documents are official papers which facilitate the act of buying and selling of goods. 
Business documents are documents that serve as records and evidence for both the buyer and 
seller. Each party keeps copies of the documents that it receives as well as those it sends out, in 
order to keep track of the transactions. 
Common examples of commercial documents are: 
1. Invoice                                             13.catalogue 
2. Credit Note                                  14. Price list 
3. Debit Note                                    15. Pro-forma Invoice 
4. Receipt         
5.      Order forms                                              
6.   Advice note   
7.    Quotation                                  
8.    Delivery note 
9.    Letter of enquiry 
10. Statement of account 
11. Consignment notes 
12.Trade journals 
QUOTATION 
This is a statement of the current price and terms of trade of a product or service. 
It is a statement prepared by 

# üß† G√©n√©ration du r√©sum√© avec un mod√®le Transformer (Fine-Tuning)

In [11]:
# Download latest version
path = kagglehub.dataset_download("banuprakashv/news-articles-classification-dataset-for-nlp-and-ml")

print("Path to dataset files:", path)

Path to dataset files: /home/nvidia/.cache/kagglehub/datasets/banuprakashv/news-articles-classification-dataset-for-nlp-and-ml/versions/1


In [13]:
df = pd.read_csv(path+"/business_data.csv")

In [14]:
df

Unnamed: 0,headlines,description,content,url,category
0,Nirmala Sitharaman to equal Morarji Desai‚Äôs re...,With the presentation of the interim budget on...,"Sitharaman, the first full-time woman finance ...",https://indianexpress.com/article/business/bud...,business
1,"‚ÄòWill densify network, want to be at least no....","'In terms of market share, we aim to double it...",The merger of Tata group‚Äôs budget airlines Air...,https://indianexpress.com/article/business/avi...,business
2,Air India group to induct an aircraft every si...,Air India currently has 117 operational aircra...,The Air India group plans to induct one aircra...,https://indianexpress.com/article/business/avi...,business
3,Red Sea woes: Exporters seek increased credit ...,Rising attacks forced shippers to consider the...,Indian exporters have asked the central govern...,https://indianexpress.com/article/business/red...,business
4,Air India group to induct a plane every 6 days...,"Apart from fleet expansion, 2024 will also see...",The Air India group plans to induct one aircra...,https://indianexpress.com/article/business/avi...,business
...,...,...,...,...,...
1995,"Two official teams from India, EU to discuss c...",India raised these issues in the Trade and Tec...,India and the European Union have constituted ...,https://indianexpress.com/article/business/two...,business
1996,"Adani family sells $1 billion stake to GQG, ot...",The group's flagship Adani Enterprises Ltd saw...,US-based boutique investment firm GQG Partners...,https://indianexpress.com/article/business/com...,business
1997,Housing sales up 8% in April-June period acros...,Housing sales rose 8 per cent annually during ...,Housing sales rose 8 per cent annually during ...,https://indianexpress.com/article/business/hou...,business
1998,Spike in tomato prices temporary; rates will c...,The maximum price of Rs 122 per kg has been re...,The spurt in prices of tomato is a temporary s...,https://indianexpress.com/article/business/eco...,business


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   headlines    2000 non-null   object
 1   description  2000 non-null   object
 2   content      2000 non-null   object
 3   url          2000 non-null   object
 4   category     2000 non-null   object
dtypes: object(5)
memory usage: 78.2+ KB


In [16]:
print(df["headlines"][0])

Nirmala Sitharaman to equal Morarji Desai‚Äôs record with her sixth straight budget


In [17]:
print(df["description"][0])

With the presentation of the interim budget on February 1, Nirmala Sitharaman will surpass the records of her predecessors like Manmohan Singh, Arun Jaitley, P Chidambaram, and Yashwant Sinha, who had presented five budgets in a row.


In [18]:
print(df["content"][0])

Sitharaman, the first full-time woman finance minister of the country, has presented five full budgets since July 2019 and will present an interim or vote-on-account budget next week.
With the presentation of the interim budget on February 1, Sitharaman will surpass the records of her predecessors like Manmohan Singh, Arun Jaitley, P Chidambaram, and Yashwant Sinha, who had presented five budgets in a row.
Desai, as finance minister, had presented five annual budgets and one interim budget between 1959-1964. The interim budget 2024-25 to be presented by Sitharaman on February 1, will be a vote-on-account that will give the government authority to spend certain sums of money till a new government comes to office after the April-May general elections.
ADVERTISEMENT
As the Parliamentary elections are due, Sitharaman‚Äôs interim budget may not contain any major policy changes. Speaking at an industry event last month, Sitharaman had ruled out any ‚Äúspectacular announcement‚Äù in the inter

In [19]:
any(df.isnull())

True

Il y a des Nan donc des cas vides.  
Mais pour nous pr√©sentement, notre travail se concentre sur **description** et **content**

In [20]:
any(df["description"].isnull())

False

In [21]:
any(df["content"].isnull())

False

In [22]:
ad = (df["content"].apply(lambda x: len(x)))

longueur moyenne et longueur mediane des textes

In [23]:
ad.mean() ,ad.median()

(np.float64(1650.582), np.float64(1188.0))

In [None]:
# Charger le tokenizer et le mod√®le BART pour la g√©n√©ration de r√©sum√©s
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

In [None]:
# Charger le dataset √† partir du fichier CSV
dataset = load_dataset("csv", data_files=path+"/business_data.csv",)

In [None]:
# Pr√©traiter le dataset
dataset = dataset.remove_columns(['headlines', 'url', 'category'])
dataset = dataset.rename_column('content','article')
dataset = dataset.rename_column('description', 'highlights')

In [31]:
dataset

DatasetDict({
    train: Dataset({
        features: ['highlights', 'article'],
        num_rows: 2000
    })
})

In [None]:

# 1. D√©coupe initiale : train (80%) + temp (20%)
split_dataset = dataset["train"].train_test_split(test_size=0.2, seed=seed)

# 2. D√©coupe de temp en val (10%) + test (10%)
# Ce .test contient les 20%, on les red√©coupe √† 50/50 ‚Üí 10% + 10%
temp_split = split_dataset["test"].train_test_split(test_size=0.5, seed=seed)

# 3. Regrouper dans un nouveau DatasetDict
final_dataset = DatasetDict({
    "train": split_dataset["train"],       # 80%
    "validation": temp_split["train"],     # 10%
    "test": temp_split["test"],            # 10%
})

# V√©rification
print(final_dataset)

DatasetDict({
    train: Dataset({
        features: ['highlights', 'article'],
        num_rows: 1600
    })
    validation: Dataset({
        features: ['highlights', 'article'],
        num_rows: 200
    })
    test: Dataset({
        features: ['highlights', 'article'],
        num_rows: 200
    })
})


In [None]:
# Pr√©traiter le dataset pour la t√¢che de r√©sum√©
# Utiliser la fonction preprocess_function pour tokeniser les entr√©es et les √©tiquettes
tokenized_dataset = final_dataset.map(preprocess_function, batched=True, remove_columns=["highlights", "article"])

In [None]:
#  Initialiser le DataCollator pour la t√¢che de r√©sum√©
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)                                                  

In [None]:
# Charger la m√©trique ROUGE pour l'√©valuation
metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # decode preds and labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # rougeLSum expects newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return result

In [None]:

# D√©finir les arguments d'entra√Ænement pour le mod√®le Seq2Seq
training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=3,
    fp16=True,
    predict_with_generate=True,
    report_to="tensorboard",
    logging_dir="logs",   
    warmup_steps=300,              
    label_smoothing_factor=0.1,  
    generation_max_length=1000,  
    generation_num_beams=4,       
    save_strategy="epoch",  
)


In [None]:
# Initialiser le Seq2SeqTrainer avec le mod√®le, les arguments d'entra√Ænement, le dataset et le data collator   
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset= tokenized_dataset["train"],
    eval_dataset= tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics   
)
trainer.train()

  trainer = Seq2SeqTrainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,No log,2.292958,0.391335,0.264654,0.33804,0.350953
2,2.192200,2.277229,0.406326,0.287527,0.360051,0.37006
3,1.947600,2.309469,0.425335,0.307998,0.37962,0.390772




TrainOutput(global_step=1200, training_loss=2.0301630147298177, metrics={'train_runtime': 241.6931, 'train_samples_per_second': 19.86, 'train_steps_per_second': 4.965, 'total_flos': 6706663897890816.0, 'train_loss': 2.0301630147298177, 'epoch': 3.0})

In [None]:
# pour visualiser les logs de l'entra√Ænement        
%load_ext tensorboard
%tensorboard --logdir ./logs

ERROR: Could not find `tensorboard`. Please ensure that your PATH
contains an executable `tensorboard` program, or explicitly specify
the path to a TensorBoard binary by setting the `TENSORBOARD_BINARY`
environment variable.

In [None]:
# Recharge depuis le dossier de sauvegarde
# Assurez-vous que le mod√®le a √©t√© sauvegard√© dans le dossier "results/checkpoint-1200" ou un autre dossier de checkpoint
model = BartForConditionalGeneration.from_pretrained("results/checkpoint-1200")
tokenizer = BartTokenizer.from_pretrained("results/checkpoint-1200")




In [None]:
# Exemple de texte √† r√©sumer
# Utiliser le texte text2 extrait du PDF ou texte1 du dataset test
text1 = final_dataset["test"]["article"][0]
text2 = clean_text(doc)

# Tokenizer le texte
inputs = tokenizer(text1, return_tensors="pt", max_length=1024, truncation=True)


In [None]:
# G√©n√©rer le r√©sum√©
summary_ids = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=700,
    min_length=100,
    num_beams=4,
    length_penalty=2.0,
    no_repeat_ngram_size=3,
)

# D√©coder le r√©sultat
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("R√©sum√© :", summary)


R√©sum√© : The 30-share BSE Sensex fell sharply by 505.19 points or 0.77 per cent to close at 65,280.45 points as 25 of its constituents ended in red and five in green. The barometer moved between 65,175.74 and 65,898.98 during the day. In global markets, Hong Kong, China, Japan and Australia sank up to 1.7 per cent following overnight losses in US equities as reports suggested the US job market remains much more resilient than expected.


In [49]:
print(text1)

Benchmark stock indices Sensex declined by 505 points at close while Nifty settled lower at the 19,330 level due to profit-taking in financial, IT and oil shares after a record-breaking run and weak global trends.
The 30-share BSE Sensex fell sharply by 505.19 points or 0.77 per cent to close at 65,280.45 points as 25 of its constituents ended in red and five in green. The barometer moved between 65,175.74 and 65,898.98 during the day.
Ending its eight-day winning streak, the broader Nifty of the National Stock Exchange declined by 165.50 or 0.85 per cent to settle at 19,331.80. As many as 44 Nifty shares declined while six gained.
ADVERTISEMENT
Among major Sensex shares, PowerGrid fell the most by 2.76 per cent. IndusInd Bank dropped 2.34 per cent, HUL by 2.23 per cent and NTPC by 2.04 per cent.
ICICI Bank, HDFC Bank, HDFC, ITC, Infosys, L&T, Bajaj Finance, Kotak Bank, HCL Tech and Tech Mahindra were among the losers.
On the other hand, Tata Motors rose the most by 2.94 per cent, foll

# Conclusion


Au terme de ce notebook, nous avons mis en place un pipeline complet de fine-tuning pour une t√¢che de r√©sum√© automatique :

- üìÑ **Chargement de documents PDF avec LangChain**  
- ‚úÇÔ∏è **Pr√©traitement et d√©coupage du texte brut en segments exploitables**  
- üìö **Construction d‚Äôun dataset compatible Hugging Face**  
- üß† **Fine-tuning d‚Äôun mod√®le de type Bart sur notre jeu de donn√©es**  
- üìä **√âvaluation √† l'oeil nu sur du mod√®le**

### üéØ R√©sultats  

Le mod√®le entra√Æn√© est d√©sormais capable de g√©n√©rer des r√©sum√©s adapt√©s √† la structure de notre dataset. Il pourra √™tre int√©gr√© dans des pipelines de traitement de documents, notamment pour des applications de type :

- R√©sum√© automatique de documents business PDF  
- Pr√©traitement pour des t√¢ches de classification ou de QA

### üöÄ Pistes d'am√©lioration

- Utiliser un dataset plus riche avec des r√©sum√©s de meilleure qualit√©
- Am√©lioration du mod√®le entrain√© √† travers la recherche de param√®tres optimaux  
- Une √©tape de nettoyage des donn√©es plus profondes
- Exp√©rimenter avec d'autres mod√®les (T5, Pegasus, Mistral, etc.)  