A maioria dos dados da base foram enriquecidos com nlp para gerar descrição das doenças e um web scrapping do ste Mayo Clinic para extrair fatores de risco da doença. Entretanto, ainda existem dados faltantes que devem ser preenchidos. Para isso, decidi utilizar o modelo de LLM BioGPT para completar sinteticamente os dados, esse modelo foi desenvolvido pela Microsoft Research especificamente para tarefas biomédicas. Ele segue a arquitetura dos Transformers (GPT-style), mas foi treinado exclusivamente com textos biomédicos, como artigos do PubMed, abstracts científicos e literatura médica especializada. 

**Referência:** https://huggingface.co/microsoft/biogpt

**Artigo de Referência:**

LUO, Renqian et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, [S.l.], v. 23, n. 6, set. 2022. Disponível em: https://doi.org/10.1093/bib/bbac409.

In [1]:
#importando as libs necessárias
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#importando dataset unido e padronizado
merged_dataset = pd.read_csv("./datasets/merged_dataset.csv")

In [3]:
#visualizando o dataset
merged_dataset.drop(columns=["Unnamed: 0"], inplace=True)
merged_dataset.head()

Unnamed: 0,diseases,abdomen acute,abdomen distended,abdominal bloating,abdominal colic,abdominal pain,abdominal tenderness,abnormal appearing skin,abnormal appearing tongue,abnormal breathing sounds,...,wrist pain,wrist stiffness or tightness,wrist swelling,wrist weakness,yellow color,yellow crust ooze,yellow sputum,yellowing of eyes,diseases_description,disease_risk_factors
0,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
1,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
2,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
3,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
4,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...


In [4]:
#código exemplo para a utilização do modelo BioGPT
import torch
from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt") #instanciando o modelo

#movendo o modelo para a gpu do sistema (Nvidia RTX 3050)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt") #instanciando o tokenizer
generator = pipeline('text-generation', model=model, tokenizer=tokenizer) #criando o gerador de texto
set_seed(42) #configurando semente aleatória

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


## Preenchendo descrição e fatores de risco

In [5]:
#separando as doenças sem descrição 
diseases_list = merged_dataset.loc[merged_dataset['diseases_description'] == 'There is no description']['diseases'].unique()
diseases_list

array(['Juvenile osteochondrosis of tibial tubercle',
       'Rotator Cuff Injuries', 'Infectious gastroenteritis',
       'hypertrophic obstructive cardiomyopathy (hocm)',
       'Pulmonary congestion', 'Diabetes', 'idiopathic absence',
       'Allergy Specialty', 'Polymyalgia Rheumatica',
       'Acute bronchospasm', 'acute glaucoma', 'Insulin overdose',
       'stenosis of the tear duct', 'Omphalitis',
       'Glucocorticoid deficiency', 'overflow incontinence',
       'Peritonsillar Abscess', 'Alcohol withdrawal syndrome',
       'envenomation from spider or animal bite', 'Acontractile detrusor',
       'Vitreous degeneration', 'allergy to animals', 'Chronic ulcer',
       'Rheumatic tricuspid valve disease',
       'pain disorder affecting the neck',
       'paroxysmal ventricular tachycardia', 'pyogenic skin infection',
       'Viral exanthem', 'Noninfectious gastroenteritis',
       'persistent vomiting of unknown cause',
       'paroxysmal supraventricular tachycardia', 'Diabet

In [6]:
#separando as doenças sem fatores de risco
risk_factors_list = merged_dataset.loc[merged_dataset['disease_risk_factors'] == 'There is no risk factors for this disease']['diseases'].unique()
risk_factors_list

array(['Polyp of vocal cord', 'poisoning due to ethylene glycol',
       'Postmenopausal atrophic vaginitis',
       'cellulitis or abscess of mouth', 'eye alignment disorder',
       'headache after lumbar puncture', 'salivary gland disorder',
       'Juvenile osteochondrosis of tibial tubercle',
       'Traumatic AND/OR non-traumatic injury',
       'tinnitus of unknown cause', 'chronic pain disorder',
       'problem during pregnancy',
       'Liver and Intrahepatic Biliary Tract Carcinoma',
       'Choledocholithiasis', 'Hematoma, Subdural',
       'ischemia of the bowel', 'Pancreatitis, Acute',
       'foreign body in the vagina',
       'Pathological accumulation of air in tissues', 'Cysticercosis',
       'Induce (action)', 'teething syndrome',
       'Infectious gastroenteritis', 'Substance-related mental disorders',
       'Coronary Arteriosclerosis', 'idiopathic nonmenstrual bleeding',
       'Meibomian Cyst', 'Ovarian Torsion',
       'retinopathy due to high blood pressure'

In [7]:
#exemplo de descricao da doença no modelo
generator(f"{diseases_list[0]} has the following description: {diseases_list[0]} is", max_length=80, num_return_sequences=3, do_sample=True, truncation=True)

  attn_output = torch.nn.functional.scaled_dot_product_attention(


[{'generated_text': 'Juvenile osteochondrosis of tibial tubercle has the following description: Juvenile osteochondrosis of tibial tubercle is a clinical entity with several etiologies.'},
 {'generated_text': 'Juvenile osteochondrosis of tibial tubercle has the following description: Juvenile osteochondrosis of tibial tubercle is an underdiagnosed and undertreated condition, which is probably the most frequent and the most frequent bone developmental disorder in the paediatric knee.'},
 {'generated_text': 'Juvenile osteochondrosis of tibial tubercle has the following description: Juvenile osteochondrosis of tibial tubercle is found in a high proportion of the breed with the exception of the French bulldog and the American Bulldog.'}]

In [8]:
#exemplo de geração dos fatores de risco
generator(f"{risk_factors_list[0]} has the risk factors:", max_length=80, num_return_sequences=3, do_sample=True, truncation=True)

[{'generated_text': 'Polyp of vocal cord has the risk factors: old age, tobacco exposure and previous laryngeal surgery.'},
 {'generated_text': 'Polyp of vocal cord has the risk factors: chronic laryngitis, vocal cord polyp (s), polyps of the hypopharynx (hypolaryngopharynx).'},
 {'generated_text': 'Polyp of vocal cord has the risk factors: tobacco, gastroesophageal reflux disease, thyroid diseases, laryngopharyngeal reflux, laryngopharyngeal hypersensitivity, diabetes and obesity.'}]

In [None]:
def generate_description(disease):
    lista_final = []
    description_list = generator(f"{disease} has the following description: {disease} is", max_length=80, num_return_sequences=3, do_sample=True, truncation=True)
    for description in description_list:
        text = description.get('generated_text').split(':')[1]
        lista_final.append(text)
    return max(lista_final, key=len)

In [19]:
def generate_risk_factors(disease):
    lista_final = []
    risk_factors = generator(f"{disease} has the risk factors:", max_length=80, num_return_sequences=3, do_sample=True, truncation=True)
    for risk in risk_factors:
        text = risk.get('generated_text').split(':')[1]
        lista_final.append(text)
    max(lista_final, key=len)
    return max(lista_final, key=len)

In [11]:
#gerando descrição das doenças da base
description_mapping = {disease: generate_description(disease) for disease in diseases_list}

merged_dataset['diseases_description'] = merged_dataset.apply(
    lambda row: description_mapping.get(row['diseases'], row['diseases_description']),
    axis=1
)
merged_dataset.head()

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Unnamed: 0,diseases,abdomen acute,abdomen distended,abdominal bloating,abdominal colic,abdominal pain,abdominal tenderness,abnormal appearing skin,abnormal appearing tongue,abnormal breathing sounds,...,wrist pain,wrist stiffness or tightness,wrist swelling,wrist weakness,yellow color,yellow crust ooze,yellow sputum,yellowing of eyes,diseases_description,disease_risk_factors
0,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
1,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
2,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
3,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
4,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...


In [None]:
risk_mapping = {disease: generate_risk_factors(disease) for disease in risk_factors_list}

merged_dataset['disease_risk_factors'] = merged_dataset.apply(
    lambda row: risk_mapping.get(row['diseases'], row['disease_risk_factors']),
    axis=1
)
merged_dataset.head()

Unnamed: 0,diseases,abdomen acute,abdomen distended,abdominal bloating,abdominal colic,abdominal pain,abdominal tenderness,abnormal appearing skin,abnormal appearing tongue,abnormal breathing sounds,...,wrist pain,wrist stiffness or tightness,wrist swelling,wrist weakness,yellow color,yellow crust ooze,yellow sputum,yellowing of eyes,diseases_description,disease_risk_factors
0,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
1,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
2,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
3,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...
4,Panic Disorder,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,A type of anxiety disorder characterized by un...,Symptoms of panic disorder often start in the ...


In [16]:
merged_dataset.loc[merged_dataset['diseases_description'] == 'There is no description']['diseases'].unique()

array([], dtype=object)

In [21]:
merged_dataset.loc[merged_dataset['disease_risk_factors'] == 'There is no risk factors for this disease']['diseases'].unique()

array([], dtype=object)

In [None]:
#edscrições geradas para as doenças faltantes
description_mapping

{'Juvenile osteochondrosis of tibial tubercle': ' Juvenile osteochondrosis of tibial tubercle is a self-limited condition that starts by transient synovial metaplasia and cartilage mineralization within the synovium.',
 'Rotator Cuff Injuries': ' Rotator Cuff Injuries is one of the most common shoulder pathologies, affecting between 0.5% and 25% of patients.',
 'Infectious gastroenteritis': ' Infectious gastroenteritis is a gastrointestinal symptom, defined here to include any of the three following infectious etiologies; infection with enterotoxigenic Escherichia coli (ETEC) - which produce heat-stable (ST) or heat-labile (LT) or both (ETEC infection; non-bacterial gastroenteritis, NBG); and infection with norovirus - which produces both single stranded RNA',
 'hypertrophic obstructive cardiomyopathy (hocm)': ' hypertrophic obstructive cardiomyopathy (hocm) is characterized by an obstruction to left ventricular outflow by a fibrous tissue mass which causes severe, often progressive, l

In [None]:
#fatores de risco gerados para as doenças faltantes
risk_mapping

{'Polyp of vocal cord': ' inflammation, allergy, laryngopharyngeal reflux and the patients history of malignancy.',
 'poisoning due to ethylene glycol': ' high temperature as the primary cause, high humidity during the night, and use of unfiltered fresh-water supplies.',
 'Postmenopausal atrophic vaginitis': ' vaginal yeast infection (p = 0.037), bacterial vaginosis (p = 0.044), short duration of use of the product (p = 0.006) and the absence of any health education (p = 0.000).',
 'cellulitis or abscess of mouth': ' smoking, oral antibiotic medications, oral trauma and use of mouthwash, toothbrush and chewing stick.',
 'eye alignment disorder': ' 1.) strabismus, 2.) poor vision, 3.) abnormal eye alignment, and 4.) abnormal eye position.) The risk factors are the same for both children and adults.',
 'headache after lumbar puncture': ' a recent history of low back pain, previous lumbar puncture, and a higher number of cerebrospinal fluid proteins on spinal fluid examination; these para

> Esse é o resultado final do dataset completo, enriquecido e padronizado!

In [23]:
#resultado final com o datset completo 
merged_dataset.to_csv('./datasets/merged_dataset.csv') #salvando o arquivo