Nous allons développer un modèle qui extrait et résume automatiquement des rapports médicaux en utilisant LLMs (Large Language Models).

In [1]:
import spacy
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration


Nous avons besoin d’un dataset de rapports médicaux pour entraîner et tester notre modèle. Nous avons choisi le Medical Transcriptions Dataset sur Kaggle.

In [2]:
df = pd.read_csv("/content/mtsamples.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


Les rapports médicaux peuvent contenir du bruit, comme des caractères spéciaux, des espaces inutiles ou des mots non pertinents. On va nettoyer le texte en supprimant ces éléments.

In [4]:
# Drop duplicates and missing values
df.drop_duplicates(inplace=True)
df.dropna(subset=["transcription"], inplace=True)

# Text Cleaning Function
def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r"\s+", " ", text)  # Remove extra spaces
    text = re.sub(r"\[.*?\]|\(.*?\)", "", text)  # Remove text in brackets (annotations)
    text = re.sub(r"[^\w\s.,]", "", text)  # Remove special characters (except punctuation)
    return text.strip()

# Apply cleaning
df["clean_transcription"] = df["transcription"].apply(clean_text)

# Show cleaned data
print(df[["clean_transcription"]].head())

                                 clean_transcription
0  subjective, this 23yearold white female presen...
1  past medical history, he has difficulty climbi...
2  history of present illness , i have seen abc t...
3  2d mmode , ,1. left atrial enlargement with le...
4  1. the left ventricular cavity size and wall t...


In [5]:
# Download necessary NLTK resources
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download('punkt_tab') # Download the punkt_tab resource

# Initialize tools
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    # Sentence tokenization
    sentences = sent_tokenize(text)

    # Word tokenization, stopword removal, and lemmatization
    processed_sentences = []
    for sent in sentences:
        words = word_tokenize(sent)
        words = [lemmatizer.lemmatize(word) for word in words if word.lower() not in stop_words]
        processed_sentences.append(" ".join(words))

    return " ".join(processed_sentences)

# Apply preprocessing
df["preprocessed_text"] = df["clean_transcription"].apply(preprocess_text)

# Show sample output
print(df[["preprocessed_text"]].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                                   preprocessed_text
0  subjective , 23yearold white female present co...
1  past medical history , difficulty climbing sta...
2  history present illness , seen abc today . ple...
3  2d mmode , ,1. left atrial enlargement left at...
4  1. left ventricular cavity size wall thickness...


Nous allons utiliser spaCy pour extraire des entités médicales comme les maladies, médicaments et procédures médicales.

In [6]:
# Install the en_core_sci_sm model if not already installed
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

# Load medical NLP model
nlp = spacy.load("en_core_sci_sm")

# Function to extract named entities
def extract_medical_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]  # Extract entity text and type
    return entities

# Apply NER to dataset
df["medical_entities"] = df["preprocessed_text"].apply(extract_medical_entities)

# Show sample extracted entities
print(df[["preprocessed_text", "medical_entities"]].head())

                                   preprocessed_text  \
0  subjective , 23yearold white female present co...   
1  past medical history , difficulty climbing sta...   
2  history present illness , seen abc today . ple...   
3  2d mmode , ,1. left atrial enlargement left at...   
4  1. left ventricular cavity size wall thickness...   

                                    medical_entities  
0  [(subjective, ENTITY), (white female, ENTITY),...  
1  [(medical history, ENTITY), (difficulty climbi...  
2  [(history, ENTITY), (illness, ENTITY), (pleasa...  
3  [(mmode, ENTITY), (left atrial enlargement, EN...  
4  [(left ventricular cavity size, ENTITY), (thic...  


Nous allons utiliser le modèle T5 (Text-to-Text Transfer Transformer), spécialisé dans la génération de résumés.

In [7]:
# Load T5 model and tokenizer
model_name = "t5-small"  # Use "t5-base" or "t5-large" for better performance
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Function to generate a summary
def summarize_text(text, max_length=150):
    input_text = "summarize: " + text  # T5 requires a task prefix
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(inputs, max_length=max_length, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Apply summarization on a sample report
sample_text = df["preprocessed_text"].iloc[0]  # Take first medical report
summary = summarize_text(sample_text)

print("\nOriginal Report:\n", sample_text)
print("\nGenerated Summary:\n", summary)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]


Original Report:
 subjective , 23yearold white female present complaint allergy . used allergy lived seattle think worse . past , tried claritin , zyrtec . worked short time seemed lose effectiveness . used allegra also . used last summer began using two week ago . appear working well . used overthecounter spray prescription nasal spray . asthma doest require daily medication think flaring up. , medication , medication currently ortho tricyclen allegra. , allergy , known medicine allergies. , objective , vitals weight 130 pound blood pressure 12478. , heent throat mildly erythematous without exudate . nasal mucosa erythematous swollen . clear drainage seen . tm clear. , neck supple without adenopathy. , lung clear. , assessment , allergic rhinitis. , plan,1 . try zyrtec instead allegra . another option use loratadine . think prescription coverage might cheaper.,2 . sample nasonex two spray nostril given three week . prescription written well .

Generated Summary:
 subjective, 23yearol

Nous allons tester le modèle sur des nouveaux rapports provenant d’autres spécialités médicales.



In [9]:
test_1 = "The patient is a 72-year-old female with a long history of migraines and chronic tension headaches. She complains of increased frequency of headaches over the past 6 months. MRI of the brain was unremarkable. A trial of triptan medications was initiated, with instructions for follow-up in 4 weeks. She was also referred to physical therapy for neck stiffness."
summary_1 = summarize_text(test_1)

print("\nOriginal Report:\n", test_1)
print("\nGenerated Summary:\n", summary_1)


Original Report:
 The patient is a 72-year-old female with a long history of migraines and chronic tension headaches. She complains of increased frequency of headaches over the past 6 months. MRI of the brain was unremarkable. A trial of triptan medications was initiated, with instructions for follow-up in 4 weeks. She was also referred to physical therapy for neck stiffness.

Generated Summary:
 the patient is a 72-year-old female with a long history of migraines and chronic tension headaches. she complains of increased frequency of headaches over the past 6 months. MRI of the brain was unremarkable.


In [10]:
test_2 = "Le patient est une femme de 72 ans avec des antécédents de migraines chroniques. Elle se plaint d’une augmentation de la fréquence des maux de tête. L’IRM est normale. Traitement par triptan initié avec suivi dans 4 semaines."
summary_2 = summarize_text(test_2)

print("\nOriginal Report:\n", test_2)
print("\nGenerated Summary:\n", summary_2)


Original Report:
 Le patient est une femme de 72 ans avec des antécédents de migraines chroniques. Elle se plaint d’une augmentation de la fréquence des maux de tête. L’IRM est normale. Traitement par triptan initié avec suivi dans 4 semaines.

Generated Summary:
 le patient est une femme de 72 ans avec des antécédents de migraines chroniques. Elle se plaint d’une augmentation de la fréquence des maux de tête. L’IRM est normale.
