In [1]:
from experiment_logger import ExperimentLogger

In [2]:
# Biliotecas

# Tratamento dos dados
import pandas as pd
import numpy as np
pd.set_option("display.max_colwidth", None)

# Configurando o pandas para mostrar todas as colunas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Informações das Colunas

### Identificação e Acompanhamento
- **Patient ID**: Identificação única de cada paciente.
- **Follow-up period from enrollment (days)**: Período de acompanhamento em dias desde o início da participação do paciente no estudo.
- **days_4years**: Dias equivalentes a 4 anos (para controle de tempo de acompanhamento).
- **Exit of the study**: Motivo pelo qual o paciente saiu do estudo (ex: alta, falecimento).
- **Cause of death**: Causa de morte, se houver (0 significa que o paciente não faleceu durante o estudo).

### Eventos e Condições Cardiovasculares
- **SCD_4years SinusRhythm**: Morte súbita cardíaca (Sudden Cardiac Death) em 4 anos, durante ritmo sinusal.
- **HF_4years SinusRhythm**: Insuficiência cardíaca (Heart Failure) em 4 anos, durante ritmo sinusal.
- **Number of ventricular premature contractions per hour**: Número de contrações ventriculares prematuras por hora.
- **Non-sustained ventricular tachycardia (CH>10)**: Taquicardia ventricular não sustentada, quando o número de contrações ventriculares prematuras é maior que 10.
- **Number of supraventricular premature beats in 24h**: Número de batimentos supraventriculares prematuros em 24 horas.
- **Paroxysmal supraventricular tachyarrhythmia**: Taquiarritmia supraventricular paroxística.
  
### Dados Demográficos e Clínicos
- **Age**: Idade do paciente.
- **Gender (male=1)**: Gênero do paciente (1 para masculino, 0 para feminino).
- **Weight (kg)**: Peso do paciente em quilogramas.
- **Height (cm)**: Altura do paciente em centímetros.
- **Body Mass Index (Kg/m2)**: Índice de Massa Corporal (IMC) do paciente.
- **NYHA class**: Classe de insuficiência cardíaca segundo a classificação da New York Heart Association (NYHA).

### Sinais Vitais e Diagnósticos
- **Diastolic blood pressure (mmHg)**: Pressão arterial diastólica (mmHg).
- **Systolic blood pressure (mmHg)**: Pressão arterial sistólica (mmHg).
- **HF etiology - Diagnosis**: Etiologia da insuficiência cardíaca ou diagnóstico inicial.

### Comorbidades e Históricos Médicos
- **Diabetes (yes=1)**: Histórico de diabetes.
- **History of dyslipemia (yes=1)**: Histórico de dislipidemia.
- **Peripheral vascular disease (yes=1)**: Doença vascular periférica.
- **History of hypertension (yes=1)**: Histórico de hipertensão.
- **Prior Myocardial Infarction (yes=1)**: Infarto do miocárdio prévio.
- **Prior implantable device**: Dispositivo implantável (ex: marcapasso).
- **Prior Revascularization**: Procedimento de revascularização prévio (ex: bypass coronariano).
- **Syncope**: Episódios de síncope (desmaios).

### Hábitos e Consumos
- **daily smoking (cigarettes/day)**: Número de cigarros por dia.
- **smoke-free time (years)**: Anos desde que parou de fumar.
- **cigarettes/year**: Consumo anual de cigarros.
- **alcohol consumption (standard units)**: Consumo de álcool em unidades padrão.

### Exames Laboratoriais
- **Albumin (g/L)**: Níveis de albumina no sangue.
- **ALT or GPT (IU/L)**: Níveis da enzima ALT ou GPT.
- **AST or GOT (IU/L)**: Níveis da enzima AST ou GOT.
- **Normalized Troponin**: Troponina normalizada.
- **Total Cholesterol (mmol/L)**: Colesterol total.
- **Creatinine (µmol/L)**: Creatinina.
- **Gamma-glutamil transpeptidase (IU/L)**: GGT.
- **Glucose (mmol/L)**: Glicose.
- **Hemoglobin (g/L)**: Hemoglobina.
- **HDL (mmol/L)**: Colesterol HDL.
- **Potassium (mEq/L)**: Potássio.
- **LDL (mmol/L)**: Colesterol LDL.
- **Sodium (mEq/L)**: Sódio.
- **Pro-BNP (ng/L)**: Peptídeo natriurético pró-cerebral.
- **Protein (g/L)**: Proteínas totais.
- **T3 (pg/dL)**: Triiodotironina (T3).
- **T4 (ng/L)**: Tiroxina (T4).
- **Troponin (ng/mL)**: Troponina.
- **TSH (mIU/L)**: Hormônio estimulante da tireoide.
- **Urea (mg/dL)**: Ureia.

### Exames de Imagem e Medidas Cardíacas
- **Signs of pulmonary venous hypertension (yes=1)**: Sinais de hipertensão venosa pulmonar.
- **Cardiothoracic ratio**: Razão cardiotorácica.
- **Left atrial size (mm)**: Tamanho do átrio esquerdo.
- **Right ventricle contractility (altered=1)**: Contratilidade do ventrículo direito alterada.
- **Right ventricle end-diastolic diameter (mm)**: Diâmetro diastólico final do ventrículo direito.
- **LVEF (%)**: Fração de ejeção do ventrículo esquerdo (LVEF).
- **Mitral valve insufficiency**: Insuficiência da válvula mitral.
- **Mitral flow pattern**: Padrão de fluxo mitral.
- **Left ventricular posterior wall thickness (mm)**: Espessura da parede posterior do ventrículo esquerdo.
- **Septal thickness (mm)**: Espessura do septo interventricular.
- **Left ventricle end-diastolic diameter (mm)**: Diâmetro diastólico final do ventrículo esquerdo.
- **Left ventricle end-systolic diameter (mm)**: Diâmetro sistólico final do ventrículo esquerdo.

### Eletrocardiogramas e Holter
- **Hig-resolution ECG available**: ECG de alta resolução disponível.
- **ECG rhythm**: Ritmo cardíaco no ECG.
- **Q-waves (necrosis, yes=1)**: Presença de ondas Q no ECG (necrose).
- **PR interval (ms)**: Intervalo PR no ECG.
- **QRS duration (ms)**: Duração do complexo QRS.
- **QRS > 120 ms**: Duração do QRS maior que 120 ms.
- **QT interval (ms)**: Intervalo QT.
- **QT corrected**: Intervalo QT corrigido.
- **Average RR (ms)**: Intervalo RR médio.
- **Left ventricular hypertrophy (yes=1)**: Hipertrofia ventricular esquerda.
- **Intraventricular conduction disorder**: Distúrbio da condução intraventricular.

### Holter
- **Holter available**: Monitoramento Holter disponível.
- **Holter onset (hh:mm)**: Hora de início do monitoramento Holter.
- **Holter rhythm**: Ritmo registrado no Holter.
- **minimum RR (ms)**: Intervalo RR mínimo registrado.
- **Average RR (ms)**: Intervalo RR médio registrado.
- **maximum RR (ms)**: Intervalo RR máximo registrado.
- **RR range (ms)**: Variação dos intervalos RR no Holter.
- **Number of ventricular premature beats in 24h**: Número de batimentos ventriculares prematuros em 24 horas.
- **Extrasystole couplets**: Duplas de extrassístoles ventriculares.
- **Ventricular Extrasystole**: Número de extrassístoles ventriculares.
- **Non-sustained ventricular tachycardia**: Taquicardia ventricular não sustentada.
- **Holter artifact burden (%)**: Porcentagem de artefatos no Holter.
- **Longest RR pause (ms)**: Maior pausa entre os intervalos RR registrada em milissegundos.
- **Bradycardia**: Presença de bradicardia (frequência cardíaca lenta).
- **SDNN (ms)**: Desvio padrão dos intervalos NN (batimentos normais) no Holter.
- **SDANN (ms)**: Desvio padrão dos intervalos NN médios durante períodos de 5 minutos.
- **RMSSD (ms)**: Raiz quadrada da média dos quadrados das diferenças sucessivas entre intervalos RR.
- **pNN50 (%)**: Porcentagem de intervalos NN consecutivos que diferem mais de 50 ms.


### Pressão Arterial
- **Systolic blood pressure >120mmHg**: Pressão sistólica >120 mmHg.
- **Diastolic blood pressure >80mmHg**: Pressão diastólica >80 mmHg.

### Medicamentos

- **Calcium channel blocker (yes=1)**: Uso de bloqueador de canal de cálcio.
- **Diabetes medication (yes=1)**: Uso de medicação para diabetes.
- **Amiodarone (yes=1)**: Uso de amiodarona (antiarrítmico).
- **Angiotensin-II receptor blocker (yes=1)**: Uso de bloqueador do receptor de angiotensina II.
- **Anticoagulants/antitrombotics (yes=1)**: Uso de anticoagulantes ou antitrombóticos.
- **Betablockers (yes=1)**: Uso de betabloqueadores.
- **Digoxin (yes=1)**: Uso de digoxina (para tratamento de insuficiência cardíaca e arritmias).
- **Loop diuretics (yes=1)**: Uso de diuréticos de alça.
- **Spironolactone (yes=1)**: Uso de espironolactona (diurético poupador de potássio).
- **Statins (yes=1)**: Uso de estatinas (medicação para reduzir colesterol).
- **Hidralazina (yes=1)**: Uso de hidralazina (vasodilatador).
- **ACE inhibitor (yes=1)**: Uso de inibidor da enzima conversora de angiotensina (ACE).
- **Nitrovasodilator (yes=1)**: Uso de nitrovasodilatador (medicação vasodilatadora).


In [None]:
# from google.colab import drive

# drive.mount('/content/drive')

In [None]:
# from google.colab import drive

# drive.mount('/content/drive')
# link_definicao = '/content/drive/MyDrive/UFC_mestrado/Sigaa_UFC/Tese_dissertativo_Mestrado/Projeto_Tese_mestrado/02_Dataset/subject-info_definitions.csv'
# link_codes = '/content/drive/MyDrive/UFC_mestrado/Sigaa_UFC/Tese_dissertativo_Mestrado/Projeto_Tese_mestrado/02_Dataset/subject-info_codes.csv'

In [None]:
link_codes = r'D:\Projeto_Tese_mestrado\02_Dataset\dados_csv_info_definitions\subject-info_codes.csv'
link_definicao = r'D:\Projeto_Tese_mestrado\02_Dataset\dados_csv_info_definitions\subject-info_definitions.csv'
link_csv = r'D:\Projeto_Tese_mestrado\02_Dataset\dados_csv_info_definitions\ubject-info_limpo.csv'

In [None]:
df_def = pd.read_csv(link_definicao, delimiter=';')
df_cod = pd.read_csv(link_codes, delimiter=';',encoding = 'latin')
df = pd.read_csv(link_csv)

In [None]:
df.head()

Unnamed: 0,Patient ID,Follow-up period from enrollment (days),days_4years,Exit of the study,Cause of death,SCD_4years SinusRhythm,HF_4years SinusRhythm,Age,Gender (male=1),Weight (kg),Height (cm),Body Mass Index (Kg/m2),NYHA class,Diastolic blood pressure (mmHg),Systolic blood pressure (mmHg),HF etiology - Diagnosis,Diabetes (yes=1),History of dyslipemia (yes=1),Peripheral vascular disease (yes=1),History of hypertension (yes=1),Prior Myocardial Infarction (yes=1),Prior implantable device,Prior Revascularization,Syncope,daily smoking (cigarretes/day),smoke-free time (years),cigarettes /year,alcohol consumption (standard units),Albumin (g/L),ALT or GPT (IU/L),AST or GOT (IU/L),Normalized Troponin,Total Cholesterol (mmol/L),Creatinine (?mol/L),Gamma-glutamil transpeptidase (IU/L),Glucose (mmol/L),Hemoglobin (g/L),HDL (mmol/L),Potassium (mEq/L),LDL (mmol/L),Sodium (mEq/L),Pro-BNP (ng/L),Protein (g/L),T3 (pg/dL),T4 (ng/L),Troponin (ng/mL),TSH (mIU/L),Urea (mg/dL),Signs of pulmonary venous hypertension (yes=1),Cardiothoracic ratio,Left atrial size (mm),Right ventricle contractility (altered=1),Right ventricle end-diastolic diameter (mm),LVEF (%),Mitral valve insufficiency,Mitral flow pattern,Left ventricular posterior wall thickness (mm),Septal thickness (mm),Left ventricle end-diastolic diameter (mm),Left ventricle end-systolic diameter (mm),Hig-resolution ECG available,ECG rhythm,Q-waves (necrosis. yes=1),PR interval (ms),QRS duration (ms),QRS > 120 ms,QT interval (ms),QT corrected,Average RR (ms),Left ventricular hypertrophy (yes=1),Intraventricular conduction disorder,Holter available,Holter onset (hh:mm:ss),Holter rhythm,minimum RR (ms),Average RR (ms).1,maximum RR (ms),RR range (ms),Number of ventricular premature beats in 24h,Extrasystole couplets,Ventricular Extrasystole,Ventricular Tachycardia,Number of ventricular premature contractions per hour,Non-sustained ventricular tachycardia (CH>10),Number of supraventricular premature beats in 24h,Paroxysmal supraventricular tachyarrhythmia,Longest RR pause (ms),Bradycardia,SDNN (ms),SDANN (ms),RMSSD (ms),pNN50 (%),Calcium channel blocker (yes=1),Diabetes medication (yes=1),Amiodarone (yes=1),Angiotensin-II receptor blocker (yes=1),Anticoagulants/antitrombotics (yes=1),Betablockers (yes=1),Digoxin (yes=1),Loop diuretics (yes=1),Spironolactone (yes=1),Statins (yes=1),Hidralazina (yes=1),ACE inhibitor (yes=1),Nitrovasodilator (yes=1)
0,P0001,2065,1460,0.0,0,0,0,58,1,83,163,31.2,3,75,110,1,0,0,0,0,0,0,0,0,20,20,160600,0,42.4,10.0,20.0,1.0,5.4,106.0,20.0,5.7,132.0,1.29,4.6,3.36,141.0,1834.0,69.0,0.05,15.0,0.01,3.02,7.12,1,0.55,50.0,0.0,24.0,35.0,3,9.0,10.0,10.0,72.0,60.0,1,1,0,999,132,1,448,425,1111,0,2,1,0,1.0,375.0,984.0,2143.0,1768.0,2700.0,10.0,2.0,1.0,112.5,1.0,999.0,9.0,3840.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,1,1,0,0,0,1,0
1,P0002,2045,1460,0.0,0,0,0,58,1,74,160,28.9,2,80,130,2,0,1,0,0,1,0,1,0,20,1,292000,10,40.4,20.0,20.0,1.0,6.18,121.0,44.0,5.6,126.0,0.98,4.6,4.06,140.0,570.0,75.0,0.04,12.0,0.01,3.27,10.47,0,0.52,39.0,0.0,24.0,35.0,1,0.0,12.0,14.0,54.0,38.0,1,0,1,206,110,0,440,406,1176,0,1,1,41923,0.0,390.0,682.0,1154.0,764.0,12.0,0.0,1.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,117.0,110.0,10.9,0.2,0,0,0,1,1,1,0,0,0,1,0,0,0
2,P0003,2044,1460,0.0,0,0,0,69,1,83,174,27.4,2,75,100,1,0,0,0,0,0,0,0,0,15,9,246375,13,40.1,23.0,28.0,1.0,5.3,87.0,25.0,5.7,132.0,2.04,4.7,2.97,138.0,403.0,76.0,0.05,13.0,0.01,0.93,10.02,0,0.52,41.0,0.0,20.0,39.0,1,9.0,9.0,10.0,55.0,44.0,1,1,0,999,84,0,336,438,588,0,0,1,0,1.0,300.0,667.0,1622.0,1322.0,1854.0,92.0,2.0,1.0,77.25,1.0,999.0,9.0,2315.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,1,1,1,0,0,0,0,0
3,P0004,2044,1460,0.0,0,0,0,56,0,84,165,30.9,2,75,155,8,1,1,0,1,0,0,0,0,0,0,0,0,40.9,24.0,23.0,1.0,6.21,77.0,37.0,17.8,127.0,1.03,4.3,3.49,136.0,695.0,72.0,0.05,16.0,0.01,2.07,8.91,0,0.57,43.0,0.0,24.0,38.0,1,2.0,13.0,11.0,56.0,46.0,1,0,0,202,152,1,440,465,896,0,2,1,40119,0.0,561.0,845.0,1154.0,593.0,1.0,0.0,1.0,0.0,0.0,0.0,17.0,1.0,0.0,0.0,79.0,65.0,28.9,2.3,0,1,0,1,1,1,0,1,1,0,0,0,0
4,P0005,2043,1460,0.0,0,0,0,70,1,97,183,29.0,2,85,125,2,1,1,0,1,1,3,2,0,30,9,525600,0,45.3,24.0,23.0,1.0,5.72,88.0,26.0,7.8,159.0,1.06,4.9,3.28,140.0,456.0,82.0,0.05,16.0,0.01,1.01,10.91,0,0.56,55.0,0.0,25.0,34.0,1,1.0,11.0,11.0,73.0,67.0,1,3,0,999,194,1,466,492,896,0,2,1,0,3.0,556.0,811.0,1000.0,444.0,77.0,0.0,2.0,0.0,0.0,0.0,999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,1,0,1,0,1,1


In [None]:
print(f'Quantidade de colunas: {df.shape[1]}')
print(f'Quantidade de linhas: {df.shape[0]}')

Quantidade de colunas: 105
Quantidade de linhas: 992


In [None]:
df.head()

Unnamed: 0,Patient ID,Follow-up period from enrollment (days),days_4years,Exit of the study,Cause of death,SCD_4years SinusRhythm,HF_4years SinusRhythm,Age,Gender (male=1),Weight (kg),Height (cm),Body Mass Index (Kg/m2),NYHA class,Diastolic blood pressure (mmHg),Systolic blood pressure (mmHg),HF etiology - Diagnosis,Diabetes (yes=1),History of dyslipemia (yes=1),Peripheral vascular disease (yes=1),History of hypertension (yes=1),Prior Myocardial Infarction (yes=1),Prior implantable device,Prior Revascularization,Syncope,daily smoking (cigarretes/day),smoke-free time (years),cigarettes /year,alcohol consumption (standard units),Albumin (g/L),ALT or GPT (IU/L),AST or GOT (IU/L),Normalized Troponin,Total Cholesterol (mmol/L),Creatinine (?mol/L),Gamma-glutamil transpeptidase (IU/L),Glucose (mmol/L),Hemoglobin (g/L),HDL (mmol/L),Potassium (mEq/L),LDL (mmol/L),Sodium (mEq/L),Pro-BNP (ng/L),Protein (g/L),T3 (pg/dL),T4 (ng/L),Troponin (ng/mL),TSH (mIU/L),Urea (mg/dL),Signs of pulmonary venous hypertension (yes=1),Cardiothoracic ratio,Left atrial size (mm),Right ventricle contractility (altered=1),Right ventricle end-diastolic diameter (mm),LVEF (%),Mitral valve insufficiency,Mitral flow pattern,Left ventricular posterior wall thickness (mm),Septal thickness (mm),Left ventricle end-diastolic diameter (mm),Left ventricle end-systolic diameter (mm),Hig-resolution ECG available,ECG rhythm,Q-waves (necrosis. yes=1),PR interval (ms),QRS duration (ms),QRS > 120 ms,QT interval (ms),QT corrected,Average RR (ms),Left ventricular hypertrophy (yes=1),Intraventricular conduction disorder,Holter available,Holter onset (hh:mm:ss),Holter rhythm,minimum RR (ms),Average RR (ms).1,maximum RR (ms),RR range (ms),Number of ventricular premature beats in 24h,Extrasystole couplets,Ventricular Extrasystole,Ventricular Tachycardia,Number of ventricular premature contractions per hour,Non-sustained ventricular tachycardia (CH>10),Number of supraventricular premature beats in 24h,Paroxysmal supraventricular tachyarrhythmia,Longest RR pause (ms),Bradycardia,SDNN (ms),SDANN (ms),RMSSD (ms),pNN50 (%),Calcium channel blocker (yes=1),Diabetes medication (yes=1),Amiodarone (yes=1),Angiotensin-II receptor blocker (yes=1),Anticoagulants/antitrombotics (yes=1),Betablockers (yes=1),Digoxin (yes=1),Loop diuretics (yes=1),Spironolactone (yes=1),Statins (yes=1),Hidralazina (yes=1),ACE inhibitor (yes=1),Nitrovasodilator (yes=1)
0,P0001,2065,1460,0.0,0,0,0,58,1,83,163,31.2,3,75,110,1,0,0,0,0,0,0,0,0,20,20,160600,0,42.4,10.0,20.0,1.0,5.4,106.0,20.0,5.7,132.0,1.29,4.6,3.36,141.0,1834.0,69.0,0.05,15.0,0.01,3.02,7.12,1,0.55,50.0,0.0,24.0,35.0,3,9.0,10.0,10.0,72.0,60.0,1,1,0,999,132,1,448,425,1111,0,2,1,0,1.0,375.0,984.0,2143.0,1768.0,2700.0,10.0,2.0,1.0,112.5,1.0,999.0,9.0,3840.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,1,1,0,0,0,1,0
1,P0002,2045,1460,0.0,0,0,0,58,1,74,160,28.9,2,80,130,2,0,1,0,0,1,0,1,0,20,1,292000,10,40.4,20.0,20.0,1.0,6.18,121.0,44.0,5.6,126.0,0.98,4.6,4.06,140.0,570.0,75.0,0.04,12.0,0.01,3.27,10.47,0,0.52,39.0,0.0,24.0,35.0,1,0.0,12.0,14.0,54.0,38.0,1,0,1,206,110,0,440,406,1176,0,1,1,41923,0.0,390.0,682.0,1154.0,764.0,12.0,0.0,1.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,117.0,110.0,10.9,0.2,0,0,0,1,1,1,0,0,0,1,0,0,0
2,P0003,2044,1460,0.0,0,0,0,69,1,83,174,27.4,2,75,100,1,0,0,0,0,0,0,0,0,15,9,246375,13,40.1,23.0,28.0,1.0,5.3,87.0,25.0,5.7,132.0,2.04,4.7,2.97,138.0,403.0,76.0,0.05,13.0,0.01,0.93,10.02,0,0.52,41.0,0.0,20.0,39.0,1,9.0,9.0,10.0,55.0,44.0,1,1,0,999,84,0,336,438,588,0,0,1,0,1.0,300.0,667.0,1622.0,1322.0,1854.0,92.0,2.0,1.0,77.25,1.0,999.0,9.0,2315.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,1,1,1,0,0,0,0,0
3,P0004,2044,1460,0.0,0,0,0,56,0,84,165,30.9,2,75,155,8,1,1,0,1,0,0,0,0,0,0,0,0,40.9,24.0,23.0,1.0,6.21,77.0,37.0,17.8,127.0,1.03,4.3,3.49,136.0,695.0,72.0,0.05,16.0,0.01,2.07,8.91,0,0.57,43.0,0.0,24.0,38.0,1,2.0,13.0,11.0,56.0,46.0,1,0,0,202,152,1,440,465,896,0,2,1,40119,0.0,561.0,845.0,1154.0,593.0,1.0,0.0,1.0,0.0,0.0,0.0,17.0,1.0,0.0,0.0,79.0,65.0,28.9,2.3,0,1,0,1,1,1,0,1,1,0,0,0,0
4,P0005,2043,1460,0.0,0,0,0,70,1,97,183,29.0,2,85,125,2,1,1,0,1,1,3,2,0,30,9,525600,0,45.3,24.0,23.0,1.0,5.72,88.0,26.0,7.8,159.0,1.06,4.9,3.28,140.0,456.0,82.0,0.05,16.0,0.01,1.01,10.91,0,0.56,55.0,0.0,25.0,34.0,1,1.0,11.0,11.0,73.0,67.0,1,3,0,999,194,1,466,492,896,0,2,1,0,3.0,556.0,811.0,1000.0,444.0,77.0,0.0,2.0,0.0,0.0,0.0,999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,1,0,1,0,1,1


In [None]:
# dados = pd.DataFrame(df.loc[:,df.columns[0]:'Nitrovasodilator (yes=1)'])
#dados = df.loc[:,df.columns[0]:'Nitrovasodilator (yes=1)']
dados = df.copy()
dados.head()

Unnamed: 0,Patient ID,Follow-up period from enrollment (days),days_4years,Exit of the study,Cause of death,SCD_4years SinusRhythm,HF_4years SinusRhythm,Age,Gender (male=1),Weight (kg),Height (cm),Body Mass Index (Kg/m2),NYHA class,Diastolic blood pressure (mmHg),Systolic blood pressure (mmHg),HF etiology - Diagnosis,Diabetes (yes=1),History of dyslipemia (yes=1),Peripheral vascular disease (yes=1),History of hypertension (yes=1),Prior Myocardial Infarction (yes=1),Prior implantable device,Prior Revascularization,Syncope,daily smoking (cigarretes/day),smoke-free time (years),cigarettes /year,alcohol consumption (standard units),Albumin (g/L),ALT or GPT (IU/L),AST or GOT (IU/L),Normalized Troponin,Total Cholesterol (mmol/L),Creatinine (?mol/L),Gamma-glutamil transpeptidase (IU/L),Glucose (mmol/L),Hemoglobin (g/L),HDL (mmol/L),Potassium (mEq/L),LDL (mmol/L),Sodium (mEq/L),Pro-BNP (ng/L),Protein (g/L),T3 (pg/dL),T4 (ng/L),Troponin (ng/mL),TSH (mIU/L),Urea (mg/dL),Signs of pulmonary venous hypertension (yes=1),Cardiothoracic ratio,Left atrial size (mm),Right ventricle contractility (altered=1),Right ventricle end-diastolic diameter (mm),LVEF (%),Mitral valve insufficiency,Mitral flow pattern,Left ventricular posterior wall thickness (mm),Septal thickness (mm),Left ventricle end-diastolic diameter (mm),Left ventricle end-systolic diameter (mm),Hig-resolution ECG available,ECG rhythm,Q-waves (necrosis. yes=1),PR interval (ms),QRS duration (ms),QRS > 120 ms,QT interval (ms),QT corrected,Average RR (ms),Left ventricular hypertrophy (yes=1),Intraventricular conduction disorder,Holter available,Holter onset (hh:mm:ss),Holter rhythm,minimum RR (ms),Average RR (ms).1,maximum RR (ms),RR range (ms),Number of ventricular premature beats in 24h,Extrasystole couplets,Ventricular Extrasystole,Ventricular Tachycardia,Number of ventricular premature contractions per hour,Non-sustained ventricular tachycardia (CH>10),Number of supraventricular premature beats in 24h,Paroxysmal supraventricular tachyarrhythmia,Longest RR pause (ms),Bradycardia,SDNN (ms),SDANN (ms),RMSSD (ms),pNN50 (%),Calcium channel blocker (yes=1),Diabetes medication (yes=1),Amiodarone (yes=1),Angiotensin-II receptor blocker (yes=1),Anticoagulants/antitrombotics (yes=1),Betablockers (yes=1),Digoxin (yes=1),Loop diuretics (yes=1),Spironolactone (yes=1),Statins (yes=1),Hidralazina (yes=1),ACE inhibitor (yes=1),Nitrovasodilator (yes=1)
0,P0001,2065,1460,0.0,0,0,0,58,1,83,163,31.2,3,75,110,1,0,0,0,0,0,0,0,0,20,20,160600,0,42.4,10.0,20.0,1.0,5.4,106.0,20.0,5.7,132.0,1.29,4.6,3.36,141.0,1834.0,69.0,0.05,15.0,0.01,3.02,7.12,1,0.55,50.0,0.0,24.0,35.0,3,9.0,10.0,10.0,72.0,60.0,1,1,0,999,132,1,448,425,1111,0,2,1,0,1.0,375.0,984.0,2143.0,1768.0,2700.0,10.0,2.0,1.0,112.5,1.0,999.0,9.0,3840.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,1,1,0,0,0,1,0
1,P0002,2045,1460,0.0,0,0,0,58,1,74,160,28.9,2,80,130,2,0,1,0,0,1,0,1,0,20,1,292000,10,40.4,20.0,20.0,1.0,6.18,121.0,44.0,5.6,126.0,0.98,4.6,4.06,140.0,570.0,75.0,0.04,12.0,0.01,3.27,10.47,0,0.52,39.0,0.0,24.0,35.0,1,0.0,12.0,14.0,54.0,38.0,1,0,1,206,110,0,440,406,1176,0,1,1,41923,0.0,390.0,682.0,1154.0,764.0,12.0,0.0,1.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,117.0,110.0,10.9,0.2,0,0,0,1,1,1,0,0,0,1,0,0,0
2,P0003,2044,1460,0.0,0,0,0,69,1,83,174,27.4,2,75,100,1,0,0,0,0,0,0,0,0,15,9,246375,13,40.1,23.0,28.0,1.0,5.3,87.0,25.0,5.7,132.0,2.04,4.7,2.97,138.0,403.0,76.0,0.05,13.0,0.01,0.93,10.02,0,0.52,41.0,0.0,20.0,39.0,1,9.0,9.0,10.0,55.0,44.0,1,1,0,999,84,0,336,438,588,0,0,1,0,1.0,300.0,667.0,1622.0,1322.0,1854.0,92.0,2.0,1.0,77.25,1.0,999.0,9.0,2315.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,1,1,1,0,0,0,0,0
3,P0004,2044,1460,0.0,0,0,0,56,0,84,165,30.9,2,75,155,8,1,1,0,1,0,0,0,0,0,0,0,0,40.9,24.0,23.0,1.0,6.21,77.0,37.0,17.8,127.0,1.03,4.3,3.49,136.0,695.0,72.0,0.05,16.0,0.01,2.07,8.91,0,0.57,43.0,0.0,24.0,38.0,1,2.0,13.0,11.0,56.0,46.0,1,0,0,202,152,1,440,465,896,0,2,1,40119,0.0,561.0,845.0,1154.0,593.0,1.0,0.0,1.0,0.0,0.0,0.0,17.0,1.0,0.0,0.0,79.0,65.0,28.9,2.3,0,1,0,1,1,1,0,1,1,0,0,0,0
4,P0005,2043,1460,0.0,0,0,0,70,1,97,183,29.0,2,85,125,2,1,1,0,1,1,3,2,0,30,9,525600,0,45.3,24.0,23.0,1.0,5.72,88.0,26.0,7.8,159.0,1.06,4.9,3.28,140.0,456.0,82.0,0.05,16.0,0.01,1.01,10.91,0,0.56,55.0,0.0,25.0,34.0,1,1.0,11.0,11.0,73.0,67.0,1,3,0,999,194,1,466,492,896,0,2,1,0,3.0,556.0,811.0,1000.0,444.0,77.0,0.0,2.0,0.0,0.0,0.0,999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,1,0,1,0,1,1


In [None]:

print(f'Quantidade de colunas: {dados.shape[1]}')
print(f'Quantidade de linhas: {dados.shape[0]}')

Quantidade de colunas: 105
Quantidade de linhas: 992


In [None]:
dados.iloc[:,2].describe()

count     992.000000
mean     1212.561492
std       439.881734
min        33.000000
25%      1082.250000
50%      1460.000000
75%      1460.000000
max      1460.000000
Name: days_4years, dtype: float64

In [None]:
dados[['Patient ID','Follow-up period from enrollment (days)','days_4years','Exit of the study','Cause of death']].head()

Unnamed: 0,Patient ID,Follow-up period from enrollment (days),days_4years,Exit of the study,Cause of death
0,P0001,2065,1460,0.0,0
1,P0002,2045,1460,0.0,0
2,P0003,2044,1460,0.0,0
3,P0004,2044,1460,0.0,0
4,P0005,2043,1460,0.0,0


In [None]:
pd.unique(dados['Exit of the study'])

array([0., 3., 1., 2.])

#**Painel e Relatório**

In [None]:
# Colunas

# Identificação e Acompanhamento
identificacao_acompanhamento = [
    'Patient ID',
    'Follow-up period from enrollment (days)',
    'days_4years',
    'Exit of the study',
    'Cause of death'
]

# Eventos e Condições Cardiovasculares
eventos_condicoes_cardiovasculares = [
    'SCD_4years SinusRhythm',
    'HF_4years SinusRhythm',
    'Number of ventricular premature contractions per hour',
    'Non-sustained ventricular tachycardia (CH>10)',
    'Number of supraventricular premature beats in 24h',
    'Paroxysmal supraventricular tachyarrhythmia'
]

# Dados Demográficos e Clínicos
dados_demograficos_clinicos = [
    'Age',
    'Gender (male=1)',
    'Weight (kg)',
    'Height (cm)',
    'Body Mass Index (Kg/m2)',
    'NYHA class'
]

# Sinais Vitais e Diagnósticos
sinais_vitais_diagnosticos = [
    'Systolic blood pressure (mmHg)',
    'HF etiology - Diagnosis'
]

# Comorbidades e Históricos Médicos
comorbidades_historicos = [
    'Diabetes (yes=1)',
    'History of dyslipemia (yes=1)',
    'Peripheral vascular disease (yes=1)',
    'History of hypertension (yes=1)',
    'Prior Myocardial Infarction (yes=1)',
    'Prior implantable device',
    'Prior Revascularization',
    'Syncope'
]

# Hábitos e Consumos
habitos_consumos = [
    'daily smoking (cigarretes/day)',
    'smoke-free time (years)',
    'cigarettes /year',
    'alcohol consumption (standard units)'
]

# Exames Laboratoriais
exames_laboratoriais = [
    'Albumin (g/L)',
    'ALT or GPT (IU/L)',
    'AST or GOT (IU/L)',
    'Normalized Troponin',
    'Total Cholesterol (mmol/L)',
    'Creatinine (?mol/L)',
    'Gamma-glutamil transpeptidase (IU/L)',
    'Glucose (mmol/L)',
    'Hemoglobin (g/L)',
    'HDL (mmol/L)',
    'Potassium (mEq/L)',
    'LDL (mmol/L)',
    'Sodium (mEq/L)',
    'Pro-BNP (ng/L)',
    'Protein (g/L)',
    'T3 (pg/dL)',
    'T4 (ng/L)',
    'Troponin (ng/mL)',
    'TSH (mIU/L)',
    'Urea (mg/dL)'
]

# Exames de Imagem e Medidas Cardíacas
exames_imagem_medidas = [
    'Signs of pulmonary venous hypertension (yes=1)',
    'Cardiothoracic ratio',
    'Left atrial size (mm)',
    'Right ventricle contractility (altered=1)',
    'Right ventricle end-diastolic diameter (mm)',
    'LVEF (%)',
    'Mitral valve insufficiency ',
    'Mitral flow pattern',
    'Left ventricular posterior wall thickness (mm)',
    'Septal thickness (mm)',
    'Left ventricle end-diastolic diameter (mm)',
    'Left ventricle end-systolic diameter (mm)'
]

# Eletrocardiogramas e Holter
eletrocardiogramas_holter = [
    'Hig-resolution ECG available',
    'ECG rhythm ',
    'Q-waves (necrosis. yes=1)',
    'PR interval (ms)',
    'QRS duration (ms)',
    'QRS > 120 ms ',
    'QT interval (ms)',
    'QT corrected ',
    'Average RR (ms)',
    'Left ventricular hypertrophy (yes=1)',
    'Intraventricular conduction disorder'
]

# Holter
holter = [
    'Holter available',
    'Holter onset (hh:mm:ss)',
    'Holter  rhythm ',
    'minimum RR (ms) ',
    'Average RR (ms)',
    'maximum RR (ms)',
    'RR range (ms)',
    'Number of ventricular premature beats in 24h',
    'Extrasystole couplets ',
    'Ventricular Extrasystole',
    'Non-sustained ventricular tachycardia (CH>10)',
    'Longest RR pause (ms)',
    'Bradycardia',
    'SDNN (ms)',
    'SDANN (ms)',
    'RMSSD (ms)',
    'pNN50 (%)'
]

# Pressão Arterial
pressao_arterial = [
    'Systolic blood pressure (mmHg)',
    'Diastolic blood  pressure (mmHg)'
]

# Medicamentos
medicamentos = [
    'Calcium channel blocker (yes=1)',
    'Diabetes medication (yes=1)',
    'Amiodarone (yes=1)',
    'Angiotensin-II receptor blocker (yes=1)',
    'Anticoagulants/antitrombotics  (yes=1)',
    'Betablockers (yes=1)',
    'Digoxin (yes=1)',
    'Loop diuretics (yes=1)',
    'Spironolactone (yes=1)',
    'Statins (yes=1)',
    'Hidralazina (yes=1)',
    'ACE inhibitor (yes=1)',
    'Nitrovasodilator (yes=1)'
]


titles = ['Identificação e Acompanhamento',
           'Eventos e Condições Cardiovasculares',
          'Dados Demográficos e Clínicos',
          'Sinais Vitais e Diagnósticos',
          'Comorbidades e Históricos Médicos',
          'Hábitos e Consumos',
          'Exames Laboratoriais',
          'Exames de Imagem e Medidas Cardíacas',
          'Eletrocardiogramas e Holter',
          'Holter',
          'Pressão Arterial',
          'Medicamentos']

colunas = [identificacao_acompanhamento,
           eventos_condicoes_cardiovasculares,
           dados_demograficos_clinicos,
           sinais_vitais_diagnosticos,
           comorbidades_historicos,
           habitos_consumos,
           exames_laboratoriais,
           exames_imagem_medidas,
           eletrocardiogramas_holter,
           holter,
           pressao_arterial,
           medicamentos,
           ]

nomes = ['identificacao_acompanhamento',
           'eventos_condicoes_cardiovasculares',
           'dados_demograficos_clinicos',
           'sinais_vitais_diagnosticos',
           'comorbidades_historicos',
           'habitos_consumos',
           'exames_laboratoriais',
           'exames_imagem_medidas',
           'eletrocardiogramas_holter',
           'holter',
           'pressao_arterial',
           'medicamentos',
           ]

# **Limpeza e Tratamento dos Dados**

In [None]:
# subsituindo vazio por zero

dados = dados.iloc[:,1:]
dados = dados.fillna(0)
dados['Holter onset (hh:mm:ss)'] = dados['Holter onset (hh:mm:ss)'] # colocando a coluna de horas no formato original deixando com os vazios em vez de zero

In [None]:
dados.head()

Unnamed: 0,Follow-up period from enrollment (days),days_4years,Exit of the study,Cause of death,SCD_4years SinusRhythm,HF_4years SinusRhythm,Age,Gender (male=1),Weight (kg),Height (cm),Body Mass Index (Kg/m2),NYHA class,Diastolic blood pressure (mmHg),Systolic blood pressure (mmHg),HF etiology - Diagnosis,Diabetes (yes=1),History of dyslipemia (yes=1),Peripheral vascular disease (yes=1),History of hypertension (yes=1),Prior Myocardial Infarction (yes=1),Prior implantable device,Prior Revascularization,Syncope,daily smoking (cigarretes/day),smoke-free time (years),cigarettes /year,alcohol consumption (standard units),Albumin (g/L),ALT or GPT (IU/L),AST or GOT (IU/L),Normalized Troponin,Total Cholesterol (mmol/L),Creatinine (?mol/L),Gamma-glutamil transpeptidase (IU/L),Glucose (mmol/L),Hemoglobin (g/L),HDL (mmol/L),Potassium (mEq/L),LDL (mmol/L),Sodium (mEq/L),Pro-BNP (ng/L),Protein (g/L),T3 (pg/dL),T4 (ng/L),Troponin (ng/mL),TSH (mIU/L),Urea (mg/dL),Signs of pulmonary venous hypertension (yes=1),Cardiothoracic ratio,Left atrial size (mm),Right ventricle contractility (altered=1),Right ventricle end-diastolic diameter (mm),LVEF (%),Mitral valve insufficiency,Mitral flow pattern,Left ventricular posterior wall thickness (mm),Septal thickness (mm),Left ventricle end-diastolic diameter (mm),Left ventricle end-systolic diameter (mm),Hig-resolution ECG available,ECG rhythm,Q-waves (necrosis. yes=1),PR interval (ms),QRS duration (ms),QRS > 120 ms,QT interval (ms),QT corrected,Average RR (ms),Left ventricular hypertrophy (yes=1),Intraventricular conduction disorder,Holter available,Holter onset (hh:mm:ss),Holter rhythm,minimum RR (ms),Average RR (ms).1,maximum RR (ms),RR range (ms),Number of ventricular premature beats in 24h,Extrasystole couplets,Ventricular Extrasystole,Ventricular Tachycardia,Number of ventricular premature contractions per hour,Non-sustained ventricular tachycardia (CH>10),Number of supraventricular premature beats in 24h,Paroxysmal supraventricular tachyarrhythmia,Longest RR pause (ms),Bradycardia,SDNN (ms),SDANN (ms),RMSSD (ms),pNN50 (%),Calcium channel blocker (yes=1),Diabetes medication (yes=1),Amiodarone (yes=1),Angiotensin-II receptor blocker (yes=1),Anticoagulants/antitrombotics (yes=1),Betablockers (yes=1),Digoxin (yes=1),Loop diuretics (yes=1),Spironolactone (yes=1),Statins (yes=1),Hidralazina (yes=1),ACE inhibitor (yes=1),Nitrovasodilator (yes=1)
0,2065,1460,0.0,0,0,0,58,1,83,163,31.2,3,75,110,1,0,0,0,0,0,0,0,0,20,20,160600,0,42.4,10.0,20.0,1.0,5.4,106.0,20.0,5.7,132.0,1.29,4.6,3.36,141.0,1834.0,69.0,0.05,15.0,0.01,3.02,7.12,1,0.55,50.0,0.0,24.0,35.0,3,9.0,10.0,10.0,72.0,60.0,1,1,0,999,132,1,448,425,1111,0,2,1,0,1.0,375.0,984.0,2143.0,1768.0,2700.0,10.0,2.0,1.0,112.5,1.0,999.0,9.0,3840.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,1,1,1,0,0,0,1,0
1,2045,1460,0.0,0,0,0,58,1,74,160,28.9,2,80,130,2,0,1,0,0,1,0,1,0,20,1,292000,10,40.4,20.0,20.0,1.0,6.18,121.0,44.0,5.6,126.0,0.98,4.6,4.06,140.0,570.0,75.0,0.04,12.0,0.01,3.27,10.47,0,0.52,39.0,0.0,24.0,35.0,1,0.0,12.0,14.0,54.0,38.0,1,0,1,206,110,0,440,406,1176,0,1,1,41923,0.0,390.0,682.0,1154.0,764.0,12.0,0.0,1.0,0.0,0.0,0.0,7.0,0.0,0.0,0.0,117.0,110.0,10.9,0.2,0,0,0,1,1,1,0,0,0,1,0,0,0
2,2044,1460,0.0,0,0,0,69,1,83,174,27.4,2,75,100,1,0,0,0,0,0,0,0,0,15,9,246375,13,40.1,23.0,28.0,1.0,5.3,87.0,25.0,5.7,132.0,2.04,4.7,2.97,138.0,403.0,76.0,0.05,13.0,0.01,0.93,10.02,0,0.52,41.0,0.0,20.0,39.0,1,9.0,9.0,10.0,55.0,44.0,1,1,0,999,84,0,336,438,588,0,0,1,0,1.0,300.0,667.0,1622.0,1322.0,1854.0,92.0,2.0,1.0,77.25,1.0,999.0,9.0,2315.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,1,1,1,1,0,0,0,0,0
3,2044,1460,0.0,0,0,0,56,0,84,165,30.9,2,75,155,8,1,1,0,1,0,0,0,0,0,0,0,0,40.9,24.0,23.0,1.0,6.21,77.0,37.0,17.8,127.0,1.03,4.3,3.49,136.0,695.0,72.0,0.05,16.0,0.01,2.07,8.91,0,0.57,43.0,0.0,24.0,38.0,1,2.0,13.0,11.0,56.0,46.0,1,0,0,202,152,1,440,465,896,0,2,1,40119,0.0,561.0,845.0,1154.0,593.0,1.0,0.0,1.0,0.0,0.0,0.0,17.0,1.0,0.0,0.0,79.0,65.0,28.9,2.3,0,1,0,1,1,1,0,1,1,0,0,0,0
4,2043,1460,0.0,0,0,0,70,1,97,183,29.0,2,85,125,2,1,1,0,1,1,3,2,0,30,9,525600,0,45.3,24.0,23.0,1.0,5.72,88.0,26.0,7.8,159.0,1.06,4.9,3.28,140.0,456.0,82.0,0.05,16.0,0.01,1.01,10.91,0,0.56,55.0,0.0,25.0,34.0,1,1.0,11.0,11.0,73.0,67.0,1,3,0,999,194,1,466,492,896,0,2,1,0,3.0,556.0,811.0,1000.0,444.0,77.0,0.0,2.0,0.0,0.0,0.0,999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0,0,1,1,0,1,0,1,0,1,1


In [None]:
# dados.astype(str).str.replace(',', '.').astype(float)

In [None]:
# df_teste.astype(str).str.replace("", '.')

In [None]:
#df_teste.iloc[:,30]

In [None]:
#df_teste.iloc[:,27].astype(str).str.replace(',','.')

In [None]:
# np.unique(df['Number of ventricular premature contractions per hour'].astype(str))

In [None]:
# transformar colunas categoricas em numericos
# from sklearn.preprocessing import LabelEncoder

# # Exemplo
# encoder = LabelEncoder()
# df_teste[df_teste[object_columns]] = encoder.fit_transform(df[df_teste[object_columns]])

# 1. Converter a coluna 'Age' para numérico (se os valores forem números como strings)
#df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
# np.unique(pd.to_numeric(df['Age'], errors='coerce'))

# 2. Converter a coluna 'Holter onset (hh:mm:ss)' para segundos
#df['Holter onset (hh:mm:ss)'] = pd.to_timedelta(df['Holter onset (hh:mm:ss)']).dt.total_seconds()

# 3. Converter a coluna 'Number of ventricular premature contractions per hour' para numérico
#df['Number of ventricular premature contractions per hour'] = pd.to_numeric(df['Number of ventricular premature contractions per hour'], errors='coerce')



In [None]:
# # Filtrar as colunas que têm tipo 'object' e que são dois valores.

# from sklearn.preprocessing import OneHotEncoder

# # Obtendo as colunas do tipo 'object'
# object_columns = df_teste.select_dtypes(include='object').columns

# # Iterar sobre as colunas de tipo objeto
# for col in object_columns:
#     unique_values = np.unique(df_teste[col].astype(str))  # Obtém os valores únicos da coluna
#     if len(unique_values) == 2:  # Verifica se é binária
#         encoder = OneHotEncoder(sparse_output=False, drop='if_binary')  # Usa one-hot se for binária
#         X_binario = encoder.fit_transform(df_teste[[col]])  # Transforma em uma matriz binária

#         # Remover a coluna original e adicionar a codificada
#         df_teste = df_teste.drop(columns=[col])  # Remove a coluna original
#         df_teste[col + '_bin'] = X_binario  # Adiciona a nova coluna binária

#         print(f"Coluna {col} codificada e adicionada como '{col}_bin'.")



In [None]:
# # Só pode ser execultado só uma vez

# # Substituir valores inválidos (como '0') por '00:00:00'
#dados['Holter onset (hh:mm:ss)'] = dados['Holter onset (hh:mm:ss)'].fillna('00:00:00')

In [None]:
#ados['Holter onset (hh:mm:ss)'].head()

In [None]:
# # convertendo hora para formato data

# # Converter a coluna de horas para o formato datetime e extrair o tempo
# dados['Holter onset (hh:mm:ss)'] = pd.to_datetime(dados['Holter onset (hh:mm:ss)'], format='%H:%M:%S').dt.time

# # # Função para converter tempo em segundos desde o início do dia
# def time_to_seconds(t):
#     return t.hour * 3600 + t.minute * 60 + t.second

# # Aplicar a função para transformar a coluna de horas em segundos
# dados['Holter onset (hh:mm:ss)'] = dados['Holter onset (hh:mm:ss)'].apply(time_to_seconds)


#**Tipos de atributos**

In [None]:
# !pip install ydata-profiling
# pip install --upgrade ydata-profiling

In [None]:
import pandas as pd
from ydata_profiling import ProfileReport


# Criar um relatório de análise exploratória
#profile = ProfileReport(dados, explorative=True)
#profile.to_file("relatorio.html")  # Salva como arquivo HTML
#profile.to_notebook_iframe()  # Exibe no Jupyter Notebook


D-Tale

📌 Permite explorar dados interativamente em um ambiente gráfico. Ele reconhece colunas categóricas automaticamente.

In [None]:
#!pip install dtale


In [None]:
#import dtale
#dtale.show(dados)

 Sweetviz

📌 Cria um relatório visual da análise dos dados, identificando tipos de variáveis, estatísticas descritivas e gráficos.

In [None]:
# pip install sweetviz

In [None]:
import sweetviz as sv

relatorio = sv.analyze(dados)
relatorio.show_html("sweetviz_relatorio.html")  # Gera um relatório HTML interativo

                                             |                                             | [  0%]   00:00 ->…


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented fr

Report sweetviz_relatorio.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


AutoViz

📌 Gera visualizações automáticas para entender rapidamente os dados.

In [None]:
# pip install autoviz

In [None]:
from autoviz import AutoViz_Class

AV = AutoViz_Class()
dft = AV.AutoViz(dados)

Shape of your Data Set loaded: (992, 104)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    Number of Numeric Columns =  49
    Number of Integer-Categorical Columns =  25
    Number of String-Categorical Columns =  0
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  0
    Number of Numeric-Boolean Columns =  30
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  0
    Number of Columns to Delete =  0
    104 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
30 numeric variables in data exceeds limit, taking top 30 variables
    List of variables

Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
Follow-up period from enrollment (days),int64,0.0,51.0,33.0,2065.0,Column has a high correlation with ['Exit of the study']. Consider dropping one of them.
days_4years,int64,0.0,25.0,33.0,1460.0,"Column has 137 outliers greater than upper bound (2026.62) or lower than lower bound(515.62). Cap them or remove them., Column has a high correlation with ['Exit of the study', 'Follow-up period from enrollment (days)']. Consider dropping one of them."
Exit of the study,float64,0.0,,0.0,3.0,No issue
Cause of death,int64,0.0,0.0,0.0,7.0,"Column has 205 outliers greater than upper bound (2.50) or lower than lower bound(-1.50). Cap them or remove them., Column has a high correlation with ['Exit of the study']. Consider dropping one of them."
SCD_4years SinusRhythm,int64,0.0,0.0,0.0,1.0,No issue
HF_4years SinusRhythm,int64,0.0,0.0,0.0,1.0,No issue
Age,int64,0.0,6.0,18.0,89.0,Column has 10 outliers greater than upper bound (97.00) or lower than lower bound(33.00). Cap them or remove them.
Gender (male=1),int64,0.0,0.0,0.0,1.0,No issue
Weight (kg),int64,0.0,7.0,37.0,130.0,Column has 16 outliers greater than upper bound (112.00) or lower than lower bound(40.00). Cap them or remove them.
Height (cm),int64,0.0,5.0,127.0,190.0,Column has 16 outliers greater than upper bound (186.50) or lower than lower bound(142.50). Cap them or remove them.


Number of All Scatter Plots = 465
All Plots done
Time to run AutoViz = 297 seconds 

 ###################### AUTO VISUALIZATION Completed ########################


In [None]:
filename = dados.copy()
target_variable = "Cause of death"
custom_plot_dir = "your_custom_plot_directory"

dft = AV.AutoViz(
    filename,
    sep=",",
    depVar=target_variable,
    dfte=None,
    header=0,
    verbose=2,
    lowess=False,
    chart_format="bokeh",
    max_rows_analyzed=150000,
    max_cols_analyzed=30,
    save_plot_dir=custom_plot_dir
)

Output hidden; open in https://colab.research.google.com to view.

-----------------------------------------------

In [None]:
dados.query('`Cause of death` != 1').shape


(931, 104)

In [None]:
df_filtrado = dados.query('`Cause of death` != 1')

In [None]:
import pandas as pd


# Contagem de classes no conjunto de dados
class_counts = dados['Cause of death'].value_counts()


# Contagem de classes no conjunto de dados
for class_label, count in class_counts.items():
    print(f'Classe {class_label}: {count} -> {count/dados.shape[0]*100:.2f}%')


Classe 0: 726 -> 73.19%
Classe 6: 100 -> 10.08%
Classe 3: 94 -> 9.48%
Classe 1: 61 -> 6.15%
Classe 7: 11 -> 1.11%


In [None]:
dados.shape[0]

992

In [None]:
100 - 73.19

26.810000000000002

Testando com todos os dados menos data ou hora

Classificação por morte

In [None]:

import pandas as pd
import numpy as np

# Importar módulos do scikit-learn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


X = df_filtrado.drop(['Cause of death','Holter onset (hh:mm:ss)'],axis =1)
y = df_filtrado['Cause of death']



# Criar uma instância do StandardScaler
scaler = StandardScaler()


# Dividir os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Ajustar o scaler nos dados de treino e transformar
X_treino_normalizado = scaler.fit_transform(X_train)
# Usar o scaler já ajustado nos dados de treino para transformar os dados de teste
X_teste_normalizado = scaler.transform(X_test)


# Criar o pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Definir a grade de hiperparâmetros
param_grid = [
    {
        'classifier__penalty': ['l2'],
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['newton-cg', 'lbfgs', 'sag']
    },
    {
        'classifier__penalty': ['l1'],
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['liblinear', 'saga']
    },
    {
        'classifier__penalty': ['elasticnet'],
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['saga'],
        'classifier__l1_ratio': [0.5]
    },
    {
        'classifier__penalty': [None],
        'classifier__solver': ['newton-cg', 'lbfgs', 'sag']
    }
]

# Configurar o GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

# Executar o GridSearchCV
grid_search.fit(X_treino_normalizado, y_train)

# Melhores hiperparâmetros
print("Melhores hiperparâmetros encontrados:")
print(grid_search.best_params_)

# Avaliar o modelo
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_teste_normalizado)

print("\nRelatório de classificação:")
print(classification_report(y_test, y_pred))

print("Matriz de confusão:")
print(confusion_matrix(y_test, y_pred))

print("Acurácia no conjunto de teste:")
print(f"{accuracy_score(y_test, y_pred):.2f}")


Melhores hiperparâmetros encontrados:
{'classifier__C': 1, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}

Relatório de classificação:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       153
           3       0.83      0.79      0.81        19
           6       0.67      0.77      0.71        13
           7       1.00      0.50      0.67         2

    accuracy                           0.96       187
   macro avg       0.88      0.76      0.80       187
weighted avg       0.96      0.96      0.96       187

Matriz de confusão:
[[153   0   0   0]
 [  0  15   4   0]
 [  0   3  10   0]
 [  0   0   1   1]]
Acurácia no conjunto de teste:
0.96


In [None]:
log = ExperimentLogger('Classificacao')
#log.preparar_modelo(y_real = y_test, y_pred = y_pred)
#log.salvando_relatorio('LogisticRegression', 'Usando grid_search testando com todos os atributos')
log.consultar_modelos()

(1467028638128, 'LogisticRegression', 'Usando grid_search testando com todos os atributos', '21/01/2025', '22:24:19', '{"Acuracia": 0.9572192513368984, "Precision": 0.875, "Recall": 0.7646761133603239, "Especificidade": 1.0, "F1_score": 0.797940797940798, "Taxa_falsa_descoberta(FDR)": 0.0, "Valor_preditivo_negativo(NPU)": 1.0, "Prevalencia": 0.08928571428571429, "Taxa_falsa_Omissao(for)": 0.0, "Sensibilidade(TPR)": 1.0, "Taxa_falso_negativo": 0.0, "Taxa_falso_positivo": 0.0, "Teste_razao_verossimilhanca_negativa(LR-)": 0.0, "Teste_razao_verossimilhanca_positiva(LR+)": 0.0, "media_real": 0.7967914438502673, "mediana_real": 0.0, "moda_real": 0.0, "desvio_padrao_real": 1.8183233666197116, "Q1_real": 0.0, "Q2_real": 0.0, "Q3_real": 0.0, "skewness_real": 2.1654719161915246, "kurtosis_real": 3.3590885949759963, "media_pred": 0.8074866310160428, "mediana_pred": 0.0, "moda_pred": 0.0, "desvio_padrao_pred": 1.8341232568938828, "Q1_pred": 0.0, "Q2_pred": 0.0, "Q3_pred": 0.0, "skewness_pred": 2.1

Testando com  dados ECG que ta na base de dados, não se refere aos sinais

Classificação por morte

In [None]:
dados['Cause of death'].unique()

array([0, 3, 6, 1, 7], dtype=int64)

In [None]:
dados[~dados['Cause of death'].isin([1])].shape

(931, 104)

In [None]:
# Desbalanço de classificação
print(f'class 0: {dados.query('`Cause of death` == 0').shape[0]}')
print(f'class 1: {dados.query('`Cause of death` == 1').shape[0]}')
print(f'class 3: {dados.query('`Cause of death` == 3').shape[0]}')
print(f'class 6: {dados.query('`Cause of death` == 6').shape[0]}')
print(f'class 7: {dados.query('`Cause of death` == 7').shape[0]}')

class 0: 726
class 1: 61
class 3: 94
class 6: 100
class 7: 11


In [None]:
import pandas as pd
import numpy as np

# Importar módulos do scikit-learn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import RobustScaler

X = df_filtrado[eletrocardiogramas_holter][~df_filtrado['Cause of death'].isin([1])]
y = df_filtrado['Cause of death'][~df_filtrado['Cause of death'].isin([1])] # tirando valores que tem 6 e 7 pois significa mortes não identificada ou seja ruídos, antes foi feito com esta amostra e o resultado deu 76% de acuracia.




# Criar uma instância do StandardScaler
scaler = StandardScaler()


# Dividir os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




scaler = RobustScaler()
X_treino_normalizado = scaler.fit_transform(X_train)
X_teste_normalizado = scaler.transform(X_test)

# # Ajustar o scaler nos dados de treino e transformar
# X_treino_normalizado = scaler.fit_transform(X_train)
# # Usar o scaler já ajustado nos dados de treino para transformar os dados de teste
# X_teste_normalizado = scaler.transform(X_test)

# Criar o pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000,class_weight='balanced', C=0.1, penalty='l2', solver='newton-cg'))
])

# Definir a grade de hiperparâmetros
param_grid = [
    {
        'classifier__penalty': ['l2'],
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['newton-cg', 'lbfgs', 'sag']
    },
    {
        'classifier__penalty': ['l1'],
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['liblinear', 'saga']
    },
    {
        'classifier__penalty': ['elasticnet'],
        'classifier__C': [0.1, 1, 10],
        'classifier__solver': ['saga'],
        'classifier__l1_ratio': [0.5]
    },
    {
        'classifier__penalty': [None],
        'classifier__solver': ['newton-cg', 'lbfgs', 'sag']
    }
]

# Configurar o GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='accuracy'
)

# Executar o GridSearchCV
grid_search.fit(X_treino_normalizado, y_train)

# Melhores hiperparâmetros
print("Melhores hiperparâmetros encontrados:")
print(grid_search.best_params_)

# Avaliar o modelo
best_model = grid_search.best_estimator_ # Aqu se encontra o modelo ja otimizado com os melhores hiperparametros
y_pred = best_model.predict(X_teste_normalizado)

print("\nRelatório de classificação:")
print(classification_report(y_test, y_pred))

print("Matriz de confusão:")
print(confusion_matrix(y_test, y_pred))

print("Acurácia no conjunto de teste:")
print(f"{accuracy_score(y_test, y_pred):.2f}")


Melhores hiperparâmetros encontrados:
{'classifier__C': 0.1, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}

Relatório de classificação:
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       153
           3       0.25      0.16      0.19        19
           6       0.00      0.00      0.00        13
           7       0.00      0.00      0.00         2

    accuracy                           0.76       187
   macro avg       0.27      0.27      0.27       187
weighted avg       0.72      0.76      0.74       187

Matriz de confusão:
[[140   7   0   6]
 [ 14   3   0   2]
 [ 10   2   0   1]
 [  2   0   0   0]]
Acurácia no conjunto de teste:
0.76


In [None]:
log.preparar_modelo(y_real = y_test, y_pred = y_pred)
#log.salvando_relatorio('LogisticRegression', 'Usando grid_search eliminando class 1: morte não cardíaca')
log.consultar_modelos()

Salvamento concluído
(1467028638128, 'LogisticRegression', 'Usando grid_search testando com todos os atributos', '21/01/2025', '22:24:19', '{"Acuracia": 0.9572192513368984, "Precision": 0.875, "Recall": 0.7646761133603239, "Especificidade": 1.0, "F1_score": 0.797940797940798, "Taxa_falsa_descoberta(FDR)": 0.0, "Valor_preditivo_negativo(NPU)": 1.0, "Prevalencia": 0.08928571428571429, "Taxa_falsa_Omissao(for)": 0.0, "Sensibilidade(TPR)": 1.0, "Taxa_falso_negativo": 0.0, "Taxa_falso_positivo": 0.0, "Teste_razao_verossimilhanca_negativa(LR-)": 0.0, "Teste_razao_verossimilhanca_positiva(LR+)": 0.0, "media_real": 0.7967914438502673, "mediana_real": 0.0, "moda_real": 0.0, "desvio_padrao_real": 1.8183233666197116, "Q1_real": 0.0, "Q2_real": 0.0, "Q3_real": 0.0, "skewness_real": 2.1654719161915246, "kurtosis_real": 3.3590885949759963, "media_pred": 0.8074866310160428, "mediana_pred": 0.0, "moda_pred": 0.0, "desvio_padrao_pred": 1.8341232568938828, "Q1_pred": 0.0, "Q2_pred": 0.0, "Q3_pred": 0.0,

In [None]:
df['Cause of death'].unique()

array([0, 3, 6, 1, 7], dtype=int64)

Causa da morte E Causa da morte
- 0: sobrevivente;
- 1: morte não cardíaca;
- 3: SCD;
- 6-7: Morte por falha da bomba;
- Conforme data de encerramento do estudo
- A SCD é causada por arritmias no coração

A arritmia mais comum associada à Morte Súbita Cardíaca é a fibrilação ventricular. Essas arritmias cardíacas são impulsos de disparo rápido das câmaras inferiores do coração (ventrículos).

# Aplicando validação cruzada LogisticRegression

In [None]:
from sklearn.model_selection import cross_validate, KFold,cross_val_predict

cv_estrategia = KFold(n_splits=5, shuffle =True, random_state= 45)
resultados = cross_validate(best_model, X,y, scoring='accuracy', cv = cv_estrategia, return_train_score =True)

In [None]:
resultados

{'fit_time': array([0.01353669, 0.01406097, 0.01759028, 0.01359105, 0.01451635]),
 'score_time': array([0.00354767, 0.00351906, 0.00304699, 0.00200939, 0.00409174]),
 'test_score': array([0.76470588, 0.66129032, 0.74731183, 0.75268817, 0.70967742]),
 'train_score': array([0.73387097, 0.73557047, 0.73288591, 0.74362416, 0.72885906])}

In [None]:
cv_estrategia = KFold(n_splits=5, shuffle=True, random_state=45)

# salvando os modelos

for train_idx, test_idx in cv_estrategia.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)

    # salvando cada kfold
    #log.preparar_modelo(y_real = y_test, y_pred = y_pred)
    #log.salvando_relatorio("LogisticRegression 'classifier__C': 0.1, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'", 'Usando grid_search, k-fold')



Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído


In [None]:
# Exibindo os resultados
print("Resultados do Cross-Validation:")
print("-" * 40)
print(f"{'Fold':<10} {'Fit Time':<15} {'Score Time':<15} {'Test Score':<15} {'Train Score':<15}")
print("-" * 40)

for i in range(len(resultados['fit_time'])):
    print(f"{i+1:<10} {resultados['fit_time'][i]:<15.6f} {resultados['score_time'][i]:<15.6f} "
          f"{resultados['test_score'][i]:<15.6f} {resultados['train_score'][i]:<15.6f}")

# Exibindo a média dos resultados
print("-" * 40)
print(f"{'Média':<10} {'':<15} {'':<15} "
      f"{np.mean(resultados['test_score']):<15.6f} {np.mean(resultados['train_score']):<15.6f}")

Resultados do Cross-Validation:
----------------------------------------
Fold       Fit Time        Score Time      Test Score      Train Score    
----------------------------------------
1          0.012048        0.002001        0.764706        0.733871       
2          0.011066        0.003512        0.661290        0.735570       
3          0.007809        0.003508        0.747312        0.732886       
4          0.010008        0.003520        0.752688        0.743624       
5          0.017075        0.002504        0.709677        0.728859       
----------------------------------------
Média                                      0.727135        0.734962       


In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.model_selection import StratifiedKFold

# 'X' são as variáveis independentes e 'y' é a variável dependente

cv_estrategia = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

resultados = cross_validate(
    best_model, X, y,
    scoring='accuracy',  # Métrica de avaliação para classificação
    cv=cv_estrategia,
    return_train_score=True
)

print(f"Acurácia média nos conjuntos de teste: {resultados['test_score'].mean():.4f}")
print(f"Acurácia média nos conjuntos de treino: {resultados['train_score'].mean():.4f}")


Acurácia média nos conjuntos de teste: 0.7218
Acurácia média nos conjuntos de treino: 0.7360


# Floresta Aleatoria

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Supondo que df_filtrado seja seu DataFrame e eletrocardiogramas_holter seja a lista de colunas de interesse
X = df_filtrado[eletrocardiogramas_holter][~df_filtrado['Cause of death'].isin([1])]
y = df_filtrado['Cause of death'][~df_filtrado['Cause of death'].isin([1])]

# Dividir os dados em conjuntos de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instanciar o classificador Random Forest
rf = RandomForestClassifier(random_state=42)

# Definir a grade de hiperparâmetros para o GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],  # Número de árvores na floresta
    'max_depth': [None, 10, 20, 30],  # Profundidade máxima das árvores
    'min_samples_split': [2, 5, 10],  # Número mínimo de amostras necessárias para dividir um nó
    'min_samples_leaf': [1, 2, 4],    # Número mínimo de amostras necessárias em um nó folha
    'bootstrap': [True, False]        # Método de amostragem
}

# Configurar o GridSearchCV com validação cruzada de 5 folds
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Ajustar o GridSearchCV aos dados de treino
grid_search.fit(X_train, y_train)

# Obter o melhor modelo encontrado pelo GridSearchCV
best_rf = grid_search.best_estimator_

# Fazer previsões no conjunto de teste
y_pred = best_rf.predict(X_test)

# Avaliar o modelo
print("Melhores hiperparâmetros encontrados:")
print(grid_search.best_params_)
print("\nRelatório de classificação:")
print(classification_report(y_test, y_pred))
print("Matriz de confusão:")
print(confusion_matrix(y_test, y_pred))
print("Acurácia no conjunto de teste:")
print(f"{accuracy_score(y_test, y_pred):.2f}")


Melhores hiperparâmetros encontrados:
{'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 300}

Relatório de classificação:
              precision    recall  f1-score   support

           0       0.82      1.00      0.90       153
           3       0.00      0.00      0.00        19
           6       0.00      0.00      0.00        13
           7       0.00      0.00      0.00         2

    accuracy                           0.82       187
   macro avg       0.20      0.25      0.23       187
weighted avg       0.67      0.82      0.74       187

Matriz de confusão:
[[153   0   0   0]
 [ 19   0   0   0]
 [ 13   0   0   0]
 [  2   0   0   0]]
Acurácia no conjunto de teste:
0.82


In [None]:
log.preparar_modelo(y_real = y_test, y_pred = y_pred)
#log.salvando_relatorio('RandomForestClassifier', "'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 300'")
#log.consultar_modelos()

# Aplicando validação cruzada RandomForestClassifier

In [None]:
from sklearn.model_selection import cross_validate, KFold,cross_val_predict

# arvore
cv_estrategia = KFold(n_splits=5, shuffle =True, random_state= 45)
resultados = cross_validate(best_rf, X,y, scoring='accuracy', cv = cv_estrategia, return_train_score =True)

In [None]:
# Exibindo os resultados
print("Resultados do Cross-Validation RandomForestClassifier:")
print("-" * 40)
print(f"{'Fold':<10} {'Fit Time':<15} {'Score Time':<15} {'Test Score':<15} {'Train Score':<15}")
print("-" * 40)

for i in range(len(resultados['fit_time'])):
    print(f"{i+1:<10} {resultados['fit_time'][i]:<15.6f} {resultados['score_time'][i]:<15.6f} "
          f"{resultados['test_score'][i]:<15.6f} {resultados['train_score'][i]:<15.6f}")

# Exibindo a média dos resultados
print("-" * 40)
print(f"{'Média':<10} {'':<15} {'':<15} "
      f"{np.mean(resultados['test_score']):<15.6f} {np.mean(resultados['train_score']):<15.6f}")

Resultados do Cross-Validation RandomForestClassifier:
----------------------------------------
Fold       Fit Time        Score Time      Test Score      Train Score    
----------------------------------------
1          2.377373        0.057850        0.786096        0.803763       
2          1.788126        0.060592        0.774194        0.802685       
3          1.908811        0.104280        0.779570        0.794631       
4          2.158906        0.065338        0.784946        0.790604       
5          1.849209        0.052167        0.768817        0.801342       
----------------------------------------
Média                                      0.778725        0.798605       


In [None]:
cv_estrategia = KFold(n_splits=8, shuffle=True, random_state=45)

# salvando os modelos

for train_idx, test_idx in cv_estrategia.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    best_rf.fit(X_train, y_train) # modelo best_rf
    y_pred = best_rf.predict(X_test)

    # salvando cada kfold
    log.preparar_modelo(y_real = y_test, y_pred = y_pred)
    log.salvando_relatorio("RandomForestClassifier 'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 300", 'RandomForestClassifier, k-fold')


Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Supondo que df_filtrado seja o seu DataFrame e eletrocardiogramas_holter seja a lista de colunas de interesse
X = df_filtrado[eletrocardiogramas_holter][~df_filtrado['Cause of death'].isin([1])]
y = df_filtrado['Cause of death'][~df_filtrado['Cause of death'].isin([1])]

# Dividir os dados em conjuntos de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Definir a grade de hiperparâmetros para o GridSearchCV
param_grid = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

# Instanciar o modelo MLPClassifier
mlp = MLPClassifier(max_iter=100)

# Configurar o GridSearchCV
grid_search = GridSearchCV(estimator=mlp, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Executar o GridSearchCV
grid_search.fit(X_train, y_train)

# Melhores hiperparâmetros
print("Melhores hiperparâmetros encontrados:")
print(grid_search.best_params_)

# Avaliar o modelo
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("\nRelatório de classificação:")
print(classification_report(y_test, y_pred))

print("Matriz de confusão:")
print(confusion_matrix(y_test, y_pred))

print("Acurácia no conjunto de teste:")
print(f"{accuracy_score(y_test, y_pred):.2f}")


Melhores hiperparâmetros encontrados:
{'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'sgd'}

Relatório de classificação:
              precision    recall  f1-score   support

           0       0.82      1.00      0.90       153
           3       0.00      0.00      0.00        19
           6       0.00      0.00      0.00        13
           7       0.00      0.00      0.00         2

    accuracy                           0.82       187
   macro avg       0.20      0.25      0.23       187
weighted avg       0.67      0.82      0.74       187

Matriz de confusão:
[[153   0   0   0]
 [ 19   0   0   0]
 [ 13   0   0   0]
 [  2   0   0   0]]
Acurácia no conjunto de teste:
0.82


In [None]:
log.preparar_modelo(y_real = y_test, y_pred = y_pred)
# log.salvando_relatorio('MLPClassifier redes neurais', "'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'sgd'")
#log.consultar_modelos()

Salvamento concluído


# Aplicando validação cruzada MLPClassifier

In [None]:
from sklearn.model_selection import cross_validate, KFold,cross_val_predict

cv_estrategia = KFold(n_splits=5, shuffle =True, random_state= 45)
resultados = cross_validate(best_model, X,y, scoring='accuracy', cv = cv_estrategia, return_train_score =True)

In [None]:
# Exibindo os resultados
print("Resultados do Cross-Validation RandomForestClassifier:")
print("-" * 40)
print(f"{'Fold':<10} {'Fit Time':<15} {'Score Time':<15} {'Test Score':<15} {'Train Score':<15}")
print("-" * 40)

for i in range(len(resultados['fit_time'])):
    print(f"{i+1:<10} {resultados['fit_time'][i]:<15.6f} {resultados['score_time'][i]:<15.6f} "
          f"{resultados['test_score'][i]:<15.6f} {resultados['train_score'][i]:<15.6f}")

# Exibindo a média dos resultados
print("-" * 40)
print(f"{'Média':<10} {'':<15} {'':<15} "
      f"{np.mean(resultados['test_score']):<15.6f} {np.mean(resultados['train_score']):<15.6f}")

Resultados do Cross-Validation RandomForestClassifier:
----------------------------------------
Fold       Fit Time        Score Time      Test Score      Train Score    
----------------------------------------
1          2.219651        0.008656        0.796791        0.775538       
2          2.171323        0.007066        0.768817        0.782550       
3          2.304461        0.007532        0.779570        0.779866       
4          2.072987        0.019315        0.784946        0.778523       
5          1.415312        0.006522        0.768817        0.782550       
----------------------------------------
Média                                      0.779788        0.779806       


In [None]:
# Salvando validação cruzada

cv_estrategia = KFold(n_splits=8, shuffle=True, random_state=45)

# salvando os modelos

for train_idx, test_idx in cv_estrategia.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)

    # salvando cada kfold
    log.preparar_modelo(y_real = y_test, y_pred = y_pred)
    log.salvando_relatorio("MLPClassifier 'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'sgd'", 'MLPClassifier, k-fold')


Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído
Salvamento concluído


# Redes Neurais com Keras
Keras para fins de classificação

In [None]:
# pip install tensorflow

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score



# Selecionar as features e o target, excluindo as classes indesejadas
X = df_filtrado[eletrocardiogramas_holter][~df_filtrado['Cause of death'].isin([1])]
y = df_filtrado['Cause of death'][~df_filtrado['Cause of death'].isin([1])]

# Dividir os dados em conjuntos de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



y_train = to_categorical(y_train)
y_test = to_categorical(y_test)



model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(y_train.shape[1], activation='softmax'))


model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=50, batch_size=10, validation_split=0.2)


Epoch 1/50
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 10ms/step - accuracy: 0.2969 - loss: 2.0078 - val_accuracy: 0.7919 - val_loss: 1.0323
Epoch 2/50
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7437 - loss: 0.9834 - val_accuracy: 0.7919 - val_loss: 0.7328
Epoch 3/50
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.7598 - loss: 0.7589 - val_accuracy: 0.7919 - val_loss: 0.6797
Epoch 4/50
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7540 - loss: 0.7326 - val_accuracy: 0.7919 - val_loss: 0.6656
Epoch 5/50
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7685 - loss: 0.7134 - val_accuracy: 0.7919 - val_loss: 0.6656
Epoch 6/50
[1m60/60[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7688 - loss: 0.6665 - val_accuracy: 0.7987 - val_loss: 0.6579
Epoch 7/50
[1m60/60[0m [32m━━━━━━━━━

In [None]:
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)

print("Relatório de Classificação:")
print(classification_report(y_true, y_pred_classes))

print("Matriz de Confusão:")
print(confusion_matrix(y_true, y_pred_classes))

print("Acurácia no Conjunto de Teste:")
print(f"{accuracy_score(y_true, y_pred_classes):.2f}")


[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
Relatório de Classificação:
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       153
           3       0.23      0.16      0.19        19
           6       0.00      0.00      0.00        13
           7       0.00      0.00      0.00         2

    accuracy                           0.77       187
   macro avg       0.27      0.27      0.27       187
weighted avg       0.71      0.77      0.74       187

Matriz de Confusão:
[[141   6   6   0]
 [ 16   3   0   0]
 [  9   4   0   0]
 [  1   0   1   0]]
Acurácia no Conjunto de Teste:
0.77


# Validação cruzada com keras - Redes Neurais

In [None]:
def create_model(input_dim, num_classes):
    model = Sequential()
    model.add(Dense(64, input_dim=input_dim, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(num_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model


In [None]:
cv_estrategia = KFold(n_splits=8, shuffle=True, random_state=45)

# salvando os modelos

for train_idx, test_idx in cv_estrategia.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    y_train = to_categorical(y_train)
    y_test = to_categorical(y_test)

    model = create_model(input_dim = df_filtrado[eletrocardiogramas_holter].shape[1], num_classes=y_train.shape[1])
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    y_pred_classes = np.argmax(y_pred, axis=1)
    y_true = np.argmax(y_test, axis=1)

    print("Relatório de Classificação:")
    print(classification_report(y_true, y_pred_classes))

    print("Matriz de Confusão:")
    print(confusion_matrix(y_true, y_pred_classes))

    print("Acurácia no Conjunto de Teste:")
    print(f"{accuracy_score(y_true, y_pred_classes):.2f}")


[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.3885 - loss: 69.1252
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step
Relatório de Classificação:
              precision    recall  f1-score   support

           0       0.82      0.69      0.75        94
           3       0.00      0.00      0.00        11
           6       0.15      0.40      0.22        10
           7       0.00      0.00      0.00         2

    accuracy                           0.59       117
   macro avg       0.24      0.27      0.24       117
weighted avg       0.67      0.59      0.62       117

Matriz de Confusão:
[[65  4 19  6]
 [ 8  0  2  1]
 [ 5  0  4  1]
 [ 1  0  1  0]]
Acurácia no Conjunto de Teste:
0.59
[1m26/26[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.1706 - loss: 101.1675
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
Relatório de Classificação:
              precision    reca

# Código abaixo estar errado, pois nçao se aplica para classificação

# Comparação com Linear Bayesian Regressors

In [None]:
# import pandas as pd

# from sklearn.linear_model import ARDRegression, BayesianRidge, LinearRegression

# # Selecionar as features e o target, excluindo as classes indesejadas
# X = df_filtrado[eletrocardiogramas_holter][~df_filtrado['Cause of death'].isin([1])]
# y = df_filtrado['Cause of death'][~df_filtrado['Cause of death'].isin([1])]

# olr = LinearRegression().fit(X, y)
# brr = BayesianRidge(compute_score=True, max_iter=30).fit(X, y)
# ard = ARDRegression(compute_score=True, max_iter=30).fit(X, y)
# df_cof = pd.DataFrame(
#     {
#         #"Weights of true generative process": true_weights,
#         "ARDRegression": ard.coef_,
#         "BayesianRidge": brr.coef_,
#         "LinearRegression": olr.coef_,
#     }
# )

In [None]:
# import matplotlib.pyplot as plt
# import seaborn as sns
# from matplotlib.colors import SymLogNorm
# %matplotlib inline

# plt.figure(figsize=(10, 6))
# ax = sns.heatmap(
#     df_cof.T,
#     norm=SymLogNorm(linthresh=10e-4, vmin=-80, vmax=80),
#     cbar_kws={"label": "coefficients' values"},
#     cmap="seismic_r",
# )
# plt.ylabel("linear model")
# plt.xlabel("coefficients")
# plt.tight_layout(rect=(0, 0, 1, 0.95))
# plt.title("Models' coefficients")
# plt.show()


# Modelos como  BayesianRidge, ARDRegression, GaussianNB, GaussianMixture, PCA

- Regressão Logística Bayesiana: Uma abordagem probabilística para a regressão logística, que incorpora incertezas nos parâmetros do modelo.

- Modelos Generativos Discretos: Modelos que assumem distribuições específicas para cada classe e geram dados com base nessas distribuições.

- Modelos de Misturas (GMM e Algoritmo Expectation Maximization): Modelos que assumem que os dados são gerados por uma combinação de distribuições gaussianas, sendo úteis para modelar clusters ou grupos nos dados.

- Inferência Variacional: Uma técnica que aproxima distribuições complexas por distribuições mais simples, facilitando o cálculo de inferências em modelos probabilísticos.

- Processos Gaussianos para Classificação: Modelos que utilizam processos gaussianos para modelar distribuições de probabilidade sobre funções, permitindo a classificação com incerteza.

- Otimização Bayesiana com Processos Gaussianos: Uma abordagem para otimizar funções complexas e caras de avaliar, utilizando processos gaussianos para modelar a função objetivo.

- Variational Autoencoder: Modelos generativos que aprendem representações latentes dos dados, úteis para tarefas como geração de dados e redução de dimensionalidade.

- Normalizing Flows: Modelos que transformam distribuições simples em distribuições complexas, permitindo a modelagem de distribuições arbitrárias.

- Projeto de Sistemas de Aprendizagem de Máquina Probabilística: Envolve a construção de sistemas que incorporam incertezas e probabilidades em seus modelos, permitindo uma representação mais rica e flexível dos dados.

In [None]:
# import pandas as pd
# import numpy as np
# from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn.preprocessing import StandardScaler
# from sklearn.pipeline import Pipeline
# from sklearn.linear_model import LogisticRegression, BayesianRidge, ARDRegression
# from sklearn.naive_bayes import GaussianNB
# from sklearn.mixture import GaussianMixture
# from sklearn.decomposition import PCA
# from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# X = df_filtrado[eletrocardiogramas_holter][~df_filtrado['Cause of death'].isin([1])]
# y = df_filtrado['Cause of death'][~df_filtrado['Cause of death'].isin([1])]


pipelines

In [None]:
# # Pipeline para Regressão Logística Bayesiana
# pipe_logistic = Pipeline([
#     ('scaler', StandardScaler()),
#     ('classifier', LogisticRegression())
# ])

# # Pipeline para Regressão Linear Bayesiana
# pipe_bayesian_ridge = Pipeline([
#     ('scaler', StandardScaler()),
#     ('classifier', BayesianRidge())
# ])

# # Pipeline para ARD Regression
# pipe_ard = Pipeline([
#     ('scaler', StandardScaler()),
#     ('classifier', ARDRegression())
# ])

# # Pipeline para Naive Bayes
# pipe_naive_bayes = Pipeline([
#     ('scaler', StandardScaler()),
#     ('classifier', GaussianNB())
# ])

# # Pipeline para Gaussian Mixture Model
# pipe_gmm = Pipeline([
#     ('scaler', StandardScaler()),
#     ('classifier', GaussianMixture())
# ])

# # Pipeline para PCA Probabilístico
# pipe_pca = Pipeline([
#     ('scaler', StandardScaler()),
#     ('classifier', PCA())
# ])


GridSearchCV

In [None]:
# param_grid_logistic = {
#     'classifier__C': [0.1, 1, 10],
#     'classifier__solver': ['newton-cg', 'lbfgs', 'sag']
# }

# param_grid_bayesian_ridge = {
#     'classifier__alpha_1': [1e-6, 1e-5, 1e-4],
#     'classifier__alpha_2': [1e-6, 1e-5, 1e-4]
# }

# param_grid_ard = {
#     'classifier__alpha_1': [1e-6, 1e-5, 1e-4],
#     'classifier__alpha_2': [1e-6, 1e-5, 1e-4]
# }

# param_grid_naive_bayes = {}

# param_grid_gmm = {
#     'classifier__n_components': [2, 3, 4],
#     'classifier__covariance_type': ['full', 'tied', 'diag', 'spherical']
# }

# param_grid_pca = {
#     'classifier__n_components': [2, 3, 4]
# }


In [None]:
# # Regressão Logística Bayesiana
# grid_logistic = GridSearchCV(pipe_logistic, param_grid_logistic, cv=5, n_jobs=-1)
# grid_logistic.fit(X_train, y_train)
# print("Melhores parâmetros para Regressão Logística Bayesiana:", grid_logistic.best_params_)

# # # Regressão Linear Bayesiana
# # grid_bayesian_ridge = GridSearchCV(pipe_bayesian_ridge, param_grid_bayesian_ridge, cv=5, n_jobs=-1)
# # grid_bayesian_ridge.fit(X_train, y_train)
# # print("Melhores parâmetros para Regressão Linear Bayesiana:", grid_bayesian_ridge.best_params_)

# # ARD Regression
# grid_ard = GridSearchCV(pipe_ard, param_grid_ard, cv=5, n_jobs=-1)
# grid_ard.fit(X_train, y_train)
# print("Melhores parâmetros para ARD Regression:", grid_ard.best_params_)

# # Naive Bayes
# # grid_naive_bayes = GridSearchCV(pipe_naive_bayes, param_grid_naive_bayes, cv=5, n_jobs=-1)
# # grid_naive_bayes.fit(X_train, y_train)
# # print("Melhores parâmetros para Naive Bayes:", grid_naive_bayes.best_params_)

# # Gaussian Mixture Model
# grid_gmm = GridSearchCV(pipe_gmm, param_grid_gmm, cv=5, n_jobs=-1)
# grid_gmm.fit(X_train, y_train)
# print("Melhores parâmetros para Gaussian Mixture Model:", grid_gmm.best_params_)

# # PCA Probabilístico
# grid_pca = GridSearchCV(pipe_pca, param_grid_pca, cv=5, n_jobs=-1)
# grid_pca.fit(X_train, y_train)
# print("Melhores parâmetros para PCA Probabilístico:", grid_pca.best_params_)


In [None]:
# # Função para avaliar o modelo
# def avaliar_modelo(grid, X_test, y_test):
#     y_pred = grid.predict(X_test)
#     print("\nRelatório de classificação:")
#     print(classification_report(y_test, y_pred))
#     print("Matriz de confusão:")
#     print(confusion_matrix(y_test, y_pred))
#     print(f"Acurácia no conjunto de teste: {accuracy_score(y_test, y_pred):.2f}")

# # Avaliar cada modelo
# print("\nAvaliação para Regressão Logística Bayesiana:")
# avaliar_modelo(grid_logistic, X_test, y_test)

# # print("\nAvaliação para Regressão Linear Bayesiana:")
# # avaliar_modelo(grid_bayesian_ridge, X_test, y_test)

# # print("\nAvaliação para ARD Regression:")
# # avaliar_modelo(grid_ard, X_test, y_test)

# print("\nAvaliação para Naive Bayes:")
# avaliar_modelo(grid_naive_bayes, X_test, y_test)

# print("\nAvaliação para Gaussian Mixture Model:")
# avaliar_modelo(grid_gmm, X_test, y_test)

# # print("\nAvaliação para PCA Probabilístico:")
# # avaliar_modelo(grid_pca, X_test, y_test)


# # todos modelos comentados é porque não se aplica