A ideia deste notebook √© coletar a base final ap√≥s toda a padroniza√ß√£o e enriquecimento para iniciar o processo de prepara√ß√£o dos dados. Essa prepara√ß√£o envolve criar pares de caso-diagn√≥stico que ser√£o tokenizados e utilizados para o treinamento do modelo LLM. Al√©m disso, tamb√©m pretendo coletar estudos cl√≠nicos e pesquisas sobre as doen√ßas de modo a utilizar na contextualiza√ß√£o do modelo.

In [None]:
%pip uninstall torch torchvision torchaudio 

In [None]:
#Testando a configura√ß√£o do pytorch para garantir o uso da GPU (Nvidia RTX 3050) no treinamento do modelo
import torch
print(torch.__version__)
print("CUDA dispon√≠vel:", torch.cuda.is_available())
print("Vers√£o do CUDA compat√≠vel com PyTorch:", torch.version.cuda)
print("Dispositivo CUDA:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "Nenhum")

2.4.1+cpu
CUDA dispon√≠vel: False
Vers√£o do CUDA compat√≠vel com PyTorch: None
Dispositivo CUDA: Nenhum


In [1]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


# Processando o dataset

In [None]:
#importando dataset unido e padronizado
merged_dataset = pd.read_csv("./datasets/merged_dataset.csv")

In [None]:
#visualizando o dataset
merged_dataset.drop(columns=["Unnamed: 0"], inplace=True)
merged_dataset.head()

Unnamed: 0,diseases,abdomen acute,abdomen distended,abdominal bloating,abdominal colic,abdominal pain,abdominal tenderness,abnormal appearing skin,abnormal appearing tongue,abnormal breathing sounds,...,wrist pain,wrist stiffness or tightness,wrist swelling,wrist weakness,yellow color,yellow crust ooze,yellow sputum,yellowing of eyes,diseases_description,disease_risk_factors
0,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
1,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
2,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
3,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
4,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...


> Com os dados j√° importados devidamente, √© preciso processar o dataset para o processo de tokeniza√ß√£o. A base ser√° estruturada em pares de caso-diagn√≥stico, no caso ser√£o descritos os sintomas daquela inst√¢ncia marcados como 1 e no diagn√≥stico estar√° a doen√ßa com sua devida descri√ß√£o e fatores de rrisco.

In [6]:
#fun√ß√£o para gera√ß√£o dos pares na base
COLUNAS = merged_dataset.columns
def gerar_pares(row):
    # Gera√ß√£o do input com base nos sintomas marcados como 1
    sintomas = [col.replace("_", " ") for col in COLUNAS if row[col] == 1]
    input_text = f"The pacient presents the following symptoms: {', '.join(sintomas)}."

    # Gera√ß√£o do output com diagn√≥stico + descri√ß√£o + fatores de risco
    output_text = f'''
        Diagnosis: {row['diseases']}.\n
        Description: {row['diseases_description']}.\n
        Risk factors: {row['disease_risk_factors']}.
    '''
    
    return {"input": input_text, "output": output_text} #retorno do par gerado

In [19]:
#agora √© s√≥ ler a base e aplicar a gera√ß√£o dos pares
caso_diagnostico = merged_dataset.apply(gerar_pares, axis=1).tolist()

In [21]:
#exemplo de par gerado a partir da base
caso_diagnostico[0]

{'input': 'The pacient presents the following symptoms: anxiety and nervousness, breathing fast, chest tightness, depressive or psychotic symptoms, irregular heartbeat, palpitations, shortness of breath.',
 'output': '\n        Diagnosis: Panic Disorder.\n\n        Description: A type of anxiety disorder characterized by unexpected panic attacks that last minutes or, rarely, hours. Panic attacks begin with intense apprehension, fear or terror and, often, a feeling of impending doom. Symptoms experienced during a panic attack include dyspnea or sensations of being smothered; dizziness, loss of balance or faintness; choking sensations; palpitations or accelerated heart rate; shakiness; sweating; nausea or other form of abdominal distress; depersonalization or derealization; paresthesias; hot flashes or chills; chest discomfort or pain; fear of dying and fear of not being in control of oneself or going crazy. Agoraphobia may also develop. Similar to other anxiety disorders, it may be inhe

# Selecionando o modelo de LLM

Para o modelo eu decidi utilizar o BioGPT que  √© um modelo de linguagem desenvolvido pela Microsoft Research especificamente para tarefas biom√©dicas. Ele segue a arquitetura dos Transformers (GPT-style), mas foi treinado exclusivamente com textos biom√©dicos, como artigos do PubMed, abstracts cient√≠ficos e literatura m√©dica especializada. 

**Refer√™ncia:** https://huggingface.co/microsoft/biogpt

**Artigo de Refer√™ncia:**

LUO, Renqian et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, [S.l.], v. 23, n. 6, set. 2022. Dispon√≠vel em: https://doi.org/10.1093/bib/bbac409.

In [11]:
#c√≥digo exemplo para a utiliza√ß√£o do modelo BioGPT
from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt") #instanciando o modelo
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt") #instanciando o tokenizer
generator = pipeline('text-generation', model=model, tokenizer=tokenizer) #criando o gerador de texto
set_seed(42) #configurando semente aleat√≥ria

In [None]:
#exemplo de como gerar o texto com o modelo
generator("Influenza is", max_length=20, num_return_sequences=5, do_sample=True, truncation=True)

[{'generated_text': 'Influenza is a respiratory infection caused by the influenza virus.'},
 {'generated_text': 'Influenza is a highly contagious respiratory disease, which causes severe illness of varying severity, with fatalities occurring'},
 {'generated_text': 'Influenza is still a worldwide public health problem.'},
 {'generated_text': 'Influenza is often severe in adults with the sequelae of pneumonia, prolonged viral shedding, and exacerbation of'},
 {'generated_text': 'Influenza is the main cause of seasonal epidemics of respiratory infection in both the hospital and community settings.'}]

# Tokenizando os dados

Modelos de linguagem n√£o entendem texto diretamente. Eles precisam do texto transformado em tokens num√©ricos. A tokeniza√ß√£o converte os inputs e outputs em listas de n√∫meros compreens√≠veis para o modelo. 

In [22]:
from datasets import Dataset

In [None]:
#cria dataset Hugging Face com os pares
dataset = Dataset.from_list(caso_diagnostico)
print(dataset) #dataset preparado

#separando treino/valida√ß√£o
dataset = dataset.train_test_split(test_size=0.15)
train_dataset = dataset['train']
eval_dataset = dataset['test']

Dataset({
    features: ['input', 'output'],
    num_rows: 263609
})


In [24]:
#como o BioGPT √© causal LM (autogerativo), vamos concatenar input + output e treinar o modelo para prever
def tokenize_function(example): #fun√ß√£o para tokenizar os dados antes do treinamento
    prompt = example["input"] + "\n" + example["output"]
    return tokenizer(prompt, truncation=True, padding="max_length", max_length=512)

#dados tokenizados
tokenized_train = train_dataset.map(tokenize_function)
tokenized_eval = eval_dataset.map(tokenize_function)

print(tokenized_train)

Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224067/224067 [06:22<00:00, 585.61 examples/s]
Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 39542/39542 [01:06<00:00, 597.65 examples/s]

Dataset({
    features: ['input', 'output', 'input_ids', 'attention_mask'],
    num_rows: 224067
})





# Treinamento do Modelo

In [4]:
import torch
print(torch.cuda.is_available())

False


In [2]:
#configurando os dados do treinamento
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./biogpt-finetuned",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=10,
)

#como √© causal LM, usamos esse collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  from .autonotebook import tqdm as notebook_tqdm


NameError: name 'tokenizer' is not defined

In [None]:
#agora que tudo j√° foi preparado, vamos realizar o treinamento do modelo
trainer.train()

In [None]:
#salvando o modelo treinado
trainer.save_model("biogpt-finetuned-symptom-diagnosis")
tokenizer.save_pretrained("biogpt-finetuned-symptom-diagnosis")

In [None]:
#teste de infer√™ncia do modelo com fine-tuning
generator = pipeline('text-generation', model="biogpt-finetuned-symptom-diagnosis", tokenizer=tokenizer)
generator("The pacient presents the following symptoms: fever, cough, fatigue.", max_length=100)