# Setup

Primeiramente, vamos organizar algumas dependencias necessarias para a execucao do jupyter notebook.

## Baixando dependencias

Para o desenvolvimento do projeto, vamos utilizar as seguintes bibliotecas:

- `pandas`: Para conseguir realizar operacoes em cima dos `Dataframes`
- `matplotlib`: Para fazer plotagem de graficos
- `seaborn`: Para ajudar com a plotagem dos graficos mais comuns
- `skops`: Para gerar artefatos a partir dos modelos
- `scikit-learn`: Para o treinamento dos modelos
- `ydata-profiling`: Para realizar um profiling em cima dos dados, pegando insights
- `statsmodels`: Para ajudar com contas
- `numpy`: Para ajudar com contas

Importando as libs:

In [6]:
!pip install numpy pandas skops seaborn matplotlib ydata_profiling scikit-learn pyqt6



In [7]:
import pandas as pd
import skops.io as sio
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import RandomizedSearchCV
from datetime import datetime
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pickle
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score
  )

In [8]:
matplotlib.use('TkAgg')

## Funcoes auxiliares


1 - `plot_corr_graph`: Plota o grafico de correlacao
  - Parametros:
    - `df` (`pd.DataFrame`): `DataFrame` a ser plotado
    - `name` (`str`): Nome do grafico (Opcional)
  - Resposta:
    - `None`: A funcao nao retorna nenhum valor

In [9]:
def plot_corr_graph(df: pd.DataFrame, name: str ='Correlation Matrix') -> None:
  corr_matrix = df.corr()
  plt.figure(figsize=(50, 50))
  sns.heatmap(corr_matrix, cmap='bwr')
  plt.title(name)

2 - `get_top_abs_correlations`: Pega as colunas com as maiores correlacoes
  - Parametros:
    - `df` (`pd.DataFrame`): `DataFrame` a ser utilizado
    - `n` (`int`): Numero de colunas a ser retornados
  - Resposta:
    - `pd.Series`: Nome das colunas a serem retornadas

In [10]:
def get_top_abs_correlations(df: pd.DataFrame, n: int = 5) -> pd.Series:
    au_corr = df.corr().abs().unstack()
    labels_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
      for j in range(0, i+1):
        labels_to_drop.add((cols[i], cols[j]))
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

**DEPRECATE**

3 - `calculate_precision`: Retorna algumas informacoes sobre o modelo
  - Parametros:
    - `y_test` (`pd.DataFrame`): `DataFrame` com dados de teste
    - `y_pred` (`pd.DataFrame`): `DataFrame` com dados da previsao
  - Resposta:
    - `None`: Sem nenhuma resposta

In [11]:
def calculate_precision(y_test: pd.DataFrame, y_pred: pd.DataFrame) -> None:
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy}")

    # Classification report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    # Confusion matrix
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

4 - `export_model`: Exporta o `skops` dos modelos
    - Parametros:
        - `model`: Modelo
        - `model_name`: Nome a ser salvo

In [12]:
def export_model(model, model_name):
    with open(model_name + '.pkl', 'wb') as file:
        pickle.dump(model, file)

# Importacao dos dados

Vamos utilizar o `dengue_sinan.csv` que esta no google drive.

In [13]:
df = pd.read_csv(
    #"/content/drive/MyDrive/ACCS/dengue_sinan.csv",
    #"/content/dengue_sinan.csv",
    "./dengue_sinan.csv",
    low_memory=False
)

Como primeiro passo, vamos pegar alguns insights sobre o que estamos trabalhando.

In [14]:
df.shape

(620211, 148)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620211 entries, 0 to 620210
Columns: 148 entries, NU_NOTIFIC to ID_CNS_SUS_HASHED
dtypes: float64(115), int64(8), object(25)
memory usage: 700.3+ MB


In [16]:
df.describe()

Unnamed: 0,NU_NOTIFIC,TP_NOT,SEM_NOT,NU_ANO,SG_UF_NOT,ID_MUNICIP,ID_REGIONA,ID_UNIDADE,SEM_PRI,SOUNDEX,...,DT_TRANSDM,DT_TRANSRM,DT_TRANSRS,DT_TRANSSE,NU_LOTE_V,NU_LOTE_H,CS_FLXRET,FLXRECEBI,IDENT_MICR,MIGRADO_W
count,620211.0,620211.0,620211.0,620211.0,620211.0,620211.0,619972.0,619654.0,620211.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,620209.0,0.0,613280.0,0.0
mean,163536.3,2.000003,202089.455401,2020.719987,29.020938,292045.76695,1390.364726,4224895.0,202075.828657,,...,,,,,0.0,,0.378242,,95447.26,
std,559275.0,0.001796,267.587171,2.684069,0.69421,6983.781566,64.124777,2537868.0,528.878688,,...,,,,,0.0,,0.484949,,52851620.0,
min,0.0,2.0,201552.0,2016.0,11.0,110004.0,1331.0,35.0,20211.0,,...,,,,,0.0,,0.0,,2.0,
25%,788.0,2.0,201922.0,2019.0,29.0,291080.0,1381.0,2505711.0,201921.0,,...,,,,,0.0,,0.0,,4.0,
50%,12005.0,2.0,202106.0,2021.0,29.0,291800.0,1385.0,2802074.0,202105.0,,...,,,,,0.0,,0.0,,4.0,
75%,62916.5,2.0,202330.0,2023.0,29.0,292740.0,1398.0,6602533.0,202328.0,,...,,,,,0.0,,1.0,,4.0,
max,9994664.0,3.0,202414.0,2024.0,53.0,530010.0,6255.0,9999396.0,202414.0,,...,,,,,0.0,,1.0,,29274000000.0,


Assim, podemos ver que nosso `DataFrame` tem `148` colunas inicialmente com `620211` linhas.

## Fazendo a limpeza inicial

Como primeiro passo, podemos ver quais sao as colunas que tem **todos** os dados vazios.

In [17]:
nan_cols = []

for col in df.columns:
  if df[col].isnull().all():
    nan_cols.append(col)
    print(f'{col} esta vazia')

SOUNDEX esta vazia
EVIDENCIA esta vazia
CON_FHD esta vazia
DT_TRANSUS esta vazia
DT_TRANSDM esta vazia
DT_TRANSRM esta vazia
DT_TRANSRS esta vazia
DT_TRANSSE esta vazia
NU_LOTE_H esta vazia
FLXRECEBI esta vazia
MIGRADO_W esta vazia


Assim, podemos remover essas colunas que estao vazias.

In [18]:
df.drop(nan_cols, inplace=True, axis=1)

Depois disso, podemos nos livrar das colunas que tenham apenas valores constantes.

In [19]:
const_cols = df.columns[df.nunique(dropna=False) <= 1]

for col in const_cols:
  print(f'{col} apresenta valores constantes')

ID_AGRAVO apresenta valores constantes


Deletando essas tabelas com valores constante

In [20]:
df.drop(const_cols, inplace=True, axis=1)

Podemos ver, agora, quantas tabelas removemos

In [21]:
df.shape

(620211, 136)

# Gerando relatorio

Podemos utilizar o `ydata` para gerar um relatorio inicial sobre os nossos dados. Assim, geramos um `report.html` que pode conter informacoes importantes.

**Atencao**: Essa celula pode demorar alguns minutos para rodar

In [22]:
profile = ProfileReport(df, title="Profiling Report")
profile.to_file('report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
  return spearmanr(a, b)[0]
(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'M'')
  .reset_index(name=duplicates_key)
  .reset_index(name=duplicates_key)
  .reset_index(name=duplicates_key)
  .reset_index(name=duplicates_key)
  .reset_index(name=duplicates_key)
  .reset_index(name=duplicates_key)
  .reset_index(name=duplicates_key)
  .res

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

A partir do que investigamos no recurso a cima, podemos remover as seguintes colunas:

- NU_LOTE_V: Apenas um valor nao nulo
- NU_LOTE_I: Apenas um valor nao nulo
- SG_UF: Todos os valores nao nulos iguais
- ID_PAIS: Todos os valores nao nulos iguais
- GENGIVO: Apenas dois valores nao nulos
- METRO: Apenas dois valores nao nulos
- SANGRAM: Apenas dois valores nao nulos
- TP_NOT: Apenas um valor diferente dos demais
- SG_UF: Todos os valores nao nulos iguais
- ID_PAIS: Todos os valores nao nulos iguais
- TP_SISTEMA: Todos os valores nao nulos iguais
- NDUPLIC_N: Todos os valores nao nulos iguais
- DT_TRANSSM: Todos os valores nao nulos iguais

In [23]:
df.drop([
    'TP_NOT',
    'SG_UF',
    'ID_PAIS',
    'NU_LOTE_I',
    'TP_SISTEMA',
    'NDUPLIC_N',
    'DT_TRANSSM',
    'NU_LOTE_V',
    'GENGIVO',
    'METRO',
    'SANGRAM'
], inplace=True, axis=1)

# Transformando valores

Vamos transformar valores entre `str` para `int` e `datetime`.

- Para `datetime`:
  - DT_SIN_PRI
  - ID_OCUPA_N
  - DT_INVEST
  - DT_DIGITA
  - DT_NOTIFIC
  - DT_CHIK_S1
  - DT_CHIK_S2
  - DT_PRNT
  - DT_SORO
  - DT_NS1
  - DT_VIRAL
  - DT_INTERNA
  - DT_OBITO
  - DT_ALRM
  - DT_GRAV
  - DT_PCR
  - DT_ENCERRA
  - DT_TRANSSM

- Para `int`:
  - ID_OCUPA_N
  - CS_SEXO

In [24]:
cols = [
    'DT_SIN_PRI',
    'ID_OCUPA_N',
    'DT_INVEST',
    'DT_DIGITA',
    'DT_NOTIFIC',
    'DT_CHIK_S1',
    'DT_CHIK_S2',
    'DT_PRNT',
    'DT_SORO',
    'DT_NS1',
    'DT_VIRAL',
    'DT_INTERNA',
    'DT_OBITO',
    'DT_ALRM',
    'DT_GRAV',
    'DT_PCR',
    'DT_ENCERRA'
]

for col in cols:
  df[col] = pd.to_datetime(df[col], errors='coerce')

  df[col] = pd.to_datetime(df[col], errors='coerce')


In [25]:
df['ID_OCUPA_N'] = pd.to_numeric(df['ID_OCUPA_N'], errors='coerce')
df['CS_SEXO'] = df['CS_SEXO'].replace({'M': 1, 'F': 0, 'I': 2})
df['NM_REFEREN'] = pd.factorize(df['NM_REFEREN'])[0]
df['NM_BAIRRO'] = pd.factorize(df['NM_BAIRRO'])[0]
df['NOBAIINF'] = pd.factorize(df['NOBAIINF'])[0]

  df['CS_SEXO'] = df['CS_SEXO'].replace({'M': 1, 'F': 0, 'I': 2})


Removendo colunas de `str`

- DS_OBS : Observacoes sobre o caso
- ID_CNS_SUS_HASHED : ID Sus

Podemos remover essa tabela por nao serem dados que vamos poder utilizar para o nosso desenvolvimento.

In [26]:
df.drop([
    'DS_OBS',
    'ID_CNS_SUS_HASHED'
], axis=1, inplace=True)

# Investigando

## Corelacao

Podemos comecar realizando a correlacao entre as variaveis

In [27]:
plot_corr_graph(df)

Visualizando como tabela:

In [28]:
df.corr()

Unnamed: 0,NU_NOTIFIC,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_MUNICIP,ID_REGIONA,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,...,EPISTAXE,PETEQUIAS,HEMATURA,LACO_N,PLASMATICO,PLAQ_MENOR,COMPLICA,DT_DIGITA,CS_FLXRET,IDENT_MICR
NU_NOTIFIC,1.000000,0.031313,0.029076,0.026630,0.220976,0.223496,0.105601,-0.026054,0.021533,0.010667,...,-1.0,1.0,1.0,1.0,-0.497249,1.0,-1.0,0.002047,0.003735,0.001026
DT_NOTIFIC,0.031313,1.000000,0.999295,0.996920,0.009270,0.016407,0.013376,0.086760,0.744571,0.509323,...,1.0,-1.0,-1.0,-1.0,0.862609,-1.0,-1.0,0.047618,-0.229100,-0.000261
SEM_NOT,0.029076,0.999295,1.000000,0.999143,0.009248,0.016088,0.014159,0.084979,0.744096,0.509636,...,1.0,-1.0,-1.0,-1.0,0.866025,-1.0,-1.0,0.047573,-0.230101,-0.000204
NU_ANO,0.026630,0.996920,0.999143,1.000000,0.009226,0.015716,0.015036,0.083000,0.742367,0.509149,...,,,,,,,,0.047498,-0.230737,-0.000144
SG_UF_NOT,0.220976,0.009270,0.009248,0.009226,1.000000,0.988639,0.536426,-0.010616,0.007662,0.005276,...,,,,,,,,-0.001820,0.034225,-0.000054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PLAQ_MENOR,1.000000,-1.000000,-1.000000,,,1.000000,1.000000,1.000000,-1.000000,-1.000000,...,-1.0,1.0,1.0,1.0,-1.000000,1.0,,-1.000000,1.000000,
COMPLICA,-1.000000,-1.000000,-1.000000,,,-1.000000,-1.000000,-1.000000,-1.000000,-1.000000,...,,,,,,,1.0,-1.000000,1.000000,
DT_DIGITA,0.002047,0.047618,0.047573,0.047498,-0.001820,-0.000126,0.002813,0.000703,0.037301,0.024354,...,1.0,-1.0,-1.0,-1.0,0.936766,-1.0,-1.0,1.000000,-0.030260,-0.000261
CS_FLXRET,0.003735,-0.229100,-0.230101,-0.230737,0.034225,0.023313,0.019867,0.009835,-0.167251,-0.116386,...,-1.0,1.0,1.0,1.0,-1.000000,1.0,1.0,-0.030260,1.000000,-0.030456


Podemos utilizar `get_top_abs_correlations` para pegar as correlacoes com maior valores

In [29]:
print(get_top_abs_correlations(df, 10))

COMUNINF    MANI_HEMOR    1.0
ID_RG_RESI  PLASMATICO    1.0
ID_REGIONA  PLASMATICO    1.0
DT_PCR      PETEQUIAS     1.0
NM_BAIRRO   EPISTAXE      1.0
ID_MUNICIP  ID_LOGRADO    1.0
ID_UNIDADE  PETEQUIAS     1.0
            EPISTAXE      1.0
NM_BAIRRO   PETEQUIAS     1.0
ID_MUNICIP  HEMATURA      1.0
dtype: float64


Podemos, agora, para automatizar o processo, gerar uma lista com esses valores

In [30]:
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Vamos filtrar tabelas com correlacao maior que 95%
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print(to_drop)

['SEM_NOT', 'NU_ANO', 'ID_MUNICIP', 'ID_DISTRIT', 'ID_LOGRADO', 'ID_GEO2', 'NM_REFEREN', 'DT_INVEST', 'MIALGIA', 'EXANTEMA', 'NAUSEA', 'DOR_COSTAS', 'ARTRALGIA', 'DT_CHIK_S1', 'DT_SORO', 'RESUL_PCR_', 'SOROTIPO', 'MUNICIPIO', 'COMUNINF', 'CODISINF', 'CRITERIO', 'EVOLUCAO', 'DT_ENCERRA', 'ALRM_ABDOM', 'ALRM_LETAR', 'ALRM_LIQ', 'GRAV_ENCH', 'GRAV_TAQUI', 'GRAV_EXTRE', 'GRAV_CONSC', 'GRAV_ORGAO', 'MANI_HEMOR', 'EPISTAXE', 'PETEQUIAS', 'HEMATURA', 'LACO_N', 'PLASMATICO', 'PLAQ_MENOR', 'COMPLICA', 'DT_DIGITA', 'CS_FLXRET']


Assim, podemos remover as seguintes colunas:
- SEM_NOT
- NU_ANO
- ID_MUNICIP
- ID_DISTRIT
- ID_LOGRADO
- DT_INVEST
- MIALGIA
- EXANTEMA
- NAUSEA
- DOR_COSTAS
- ARTRALGIA
- DT_CHIK_S1
- DT_SORO
- RESUL_PCR_
- SOROTIPO
- MUNICIPIO
- HOSPITAL
- COMUNINF
- CODISINF
- CRITERIO
- EVOLUCAO
- DT_ENCERRA
- ALRM_ABDOM
- ALRM_LETAR
- ALRM_LIQ
- GRAV_ENCH
- GRAV_TAQUI
- GRAV_EXTRE
- GRAV_CONSC
- GRAV_ORGAO
- MANI_HEMOR
- EPISTAXE
- PETEQUIAS
- HEMATURA
- LACO_N
- PLASMATICO
- PLAQ_MENOR
- COMPLICA
- DT_DIGITA
- CS_FLXRET
- UF

In [31]:
to_drop = [
  'SEM_NOT',
  'NU_ANO',
  'ID_MUNICIP',
  'ID_DISTRIT',
  'ID_LOGRADO',
  'ID_GEO2',
  'DT_INVEST',
  'MIALGIA',
  'EXANTEMA',
  'NAUSEA',
  'DOR_COSTAS',
  'ARTRALGIA',
  'DT_CHIK_S1',
  'DT_SORO',
  'RESUL_PCR_',
  'SOROTIPO',
  'MUNICIPIO',
  'COMUNINF',
  'CODISINF',
  'CRITERIO',
  'EVOLUCAO',
  'DT_ENCERRA',
  'ALRM_ABDOM',
  'ALRM_LETAR',
  'ALRM_LIQ',
  'GRAV_ENCH',
  'GRAV_TAQUI',
  'GRAV_EXTRE',
  'GRAV_CONSC',
  'GRAV_ORGAO',
  'MANI_HEMOR',
  'EPISTAXE',
  'PETEQUIAS',
  'HEMATURA',
  'LACO_N',
  'PLASMATICO',
  'PLAQ_MENOR',
  'COMPLICA',
  'DT_DIGITA',
  'CS_FLXRET',
  'HOSPITAL',
  'UF'
]

df.drop(to_drop, inplace=True, axis=1)

In [32]:
plot_corr_graph(df)

# Tratando valores nulos

In [33]:
df.columns

Index(['NU_NOTIFIC', 'DT_NOTIFIC', 'SG_UF_NOT', 'ID_REGIONA', 'ID_UNIDADE',
       'DT_SIN_PRI', 'SEM_PRI', 'NU_IDADE_N', 'CS_SEXO', 'CS_GESTANT',
       'CS_RACA', 'CS_ESCOL_N', 'ID_MN_RESI', 'ID_RG_RESI', 'ID_BAIRRO',
       'NM_BAIRRO', 'ID_GEO1', 'NM_REFEREN', 'CS_ZONA', 'ID_OCUPA_N', 'FEBRE',
       'CEFALEIA', 'VOMITO', 'CONJUNTVIT', 'ARTRITE', 'PETEQUIA_N',
       'LEUCOPENIA', 'LACO', 'DOR_RETRO', 'DIABETES', 'HEMATOLOG', 'HEPATOPAT',
       'RENAL', 'HIPERTENSA', 'ACIDO_PEPT', 'AUTO_IMUNE', 'DT_CHIK_S2',
       'DT_PRNT', 'RES_CHIKS1', 'RES_CHIKS2', 'RESUL_PRNT', 'RESUL_SORO',
       'DT_NS1', 'RESUL_NS1', 'DT_VIRAL', 'RESUL_VI_N', 'DT_PCR', 'HISTOPA_N',
       'IMUNOH_N', 'HOSPITALIZ', 'DT_INTERNA', 'DDD_HOSP', 'TEL_HOSP',
       'TPAUTOCTO', 'COUFINF', 'COPAISINF', 'CO_BAINF', 'NOBAIINF',
       'CLASSI_FIN', 'DOENCA_TRA', 'CLINC_CHIK', 'DT_OBITO', 'ALRM_HIPOT',
       'ALRM_PLAQ', 'ALRM_VOM', 'ALRM_SANG', 'ALRM_HEMAT', 'ALRM_HEPAT',
       'DT_ALRM', 'GRAV_PULSO', 'GRAV_CON

Inicialmente, podemos ver quais tabelas do nosso `DataFrame` apresentam valores nulos

In [34]:
df.isna().sum()

NU_NOTIFIC         0
DT_NOTIFIC         0
SG_UF_NOT          0
ID_REGIONA       239
ID_UNIDADE       557
               ...  
GRAV_SANG     619633
GRAV_AST      619634
GRAV_MIOC     619633
DT_GRAV       619671
IDENT_MICR      6931
Length: 81, dtype: int64

Listando as tabelas que apresentam valores nulos:

In [35]:
df.columns[df.isna().any()].tolist()

['ID_REGIONA',
 'ID_UNIDADE',
 'DT_SIN_PRI',
 'CS_SEXO',
 'CS_GESTANT',
 'CS_RACA',
 'CS_ESCOL_N',
 'ID_RG_RESI',
 'ID_BAIRRO',
 'ID_GEO1',
 'CS_ZONA',
 'FEBRE',
 'CEFALEIA',
 'VOMITO',
 'CONJUNTVIT',
 'ARTRITE',
 'PETEQUIA_N',
 'LEUCOPENIA',
 'LACO',
 'DOR_RETRO',
 'DIABETES',
 'HEMATOLOG',
 'HEPATOPAT',
 'RENAL',
 'HIPERTENSA',
 'ACIDO_PEPT',
 'AUTO_IMUNE',
 'DT_CHIK_S2',
 'DT_PRNT',
 'RES_CHIKS1',
 'RES_CHIKS2',
 'RESUL_PRNT',
 'RESUL_SORO',
 'DT_NS1',
 'RESUL_NS1',
 'DT_VIRAL',
 'RESUL_VI_N',
 'DT_PCR',
 'HISTOPA_N',
 'IMUNOH_N',
 'HOSPITALIZ',
 'DT_INTERNA',
 'DDD_HOSP',
 'TEL_HOSP',
 'TPAUTOCTO',
 'COUFINF',
 'COPAISINF',
 'CO_BAINF',
 'CLASSI_FIN',
 'DOENCA_TRA',
 'CLINC_CHIK',
 'DT_OBITO',
 'ALRM_HIPOT',
 'ALRM_PLAQ',
 'ALRM_VOM',
 'ALRM_SANG',
 'ALRM_HEMAT',
 'ALRM_HEPAT',
 'DT_ALRM',
 'GRAV_PULSO',
 'GRAV_CONV',
 'GRAV_INSUF',
 'GRAV_HIPOT',
 'GRAV_HEMAT',
 'GRAV_MELEN',
 'GRAV_METRO',
 'GRAV_SANG',
 'GRAV_AST',
 'GRAV_MIOC',
 'DT_GRAV',
 'IDENT_MICR']

Para a coluna `CLASSI_FIN`, vamos remover aquelas que apresentam valor nulo

In [36]:
df.dropna(subset=['CLASSI_FIN'], inplace=True)

Depois disso, podemos utilizar o `replace` para transformar os diversos valores em 0 caso o paciente nao tenha dengue e 1 caso o mesmo tenha

In [37]:
df['CLASSI_FIN'].unique()

array([ 8., 10.,  5.,  1.,  2., 12., 11.])

In [38]:
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(5, 0)
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(8, 0)
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(1, 0)
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(2, 0)
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(10, 1)
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(11, 1)
df['CLASSI_FIN'] = df['CLASSI_FIN'].replace(12, 1)

Vamos agora tratar os campos restantes de forma individual

Para os campos:

- ID_REGIONA
- ID_UNIDADE
- ID_RG_RESI
- ID_BAIRRO
- DDD_HOSP
- TEL_HOSP
- ID_GEO1
- CS_ZONA
- COUFINF

Vamos preencher utilizando como base, os valores nao nulos da tabela

In [39]:
cols = [
  'ID_REGIONA',
  'ID_UNIDADE',
  'ID_RG_RESI',
  'ID_BAIRRO',
  'DDD_HOSP',
  'TEL_HOSP',
  'ID_GEO1',
  'CS_ZONA',
  'COUFINF'
]

for col in cols:
  df[col] = df[col].ffill()
  df[col] = df[col].bfill()

Para os campos de sintomas, vamos colocar o valor `2` como padrao, ja que representa o nao

In [40]:
cols = [
  "FEBRE",
  "CEFALEIA",
  "VOMITO",
  "CONJUNTVIT",
  "ARTRITE",
  "PETEQUIA_N",
  "LEUCOPENIA",
  "LACO",
  "DOR_RETRO",
  "CLASSI_FIN",
  "DIABETES",
  "HEMATOLOG",
  'HEPATOPAT',
  'RENAL',
  'HIPERTENSA',
  'ACIDO_PEPT',
  'AUTO_IMUNE',
  'GRAV_PULSO',
  'GRAV_CONV',
  'GRAV_INSUF',
  'GRAV_HIPOT',
  'GRAV_HEMAT',
  'GRAV_MELEN',
  'GRAV_METRO',
  'GRAV_SANG',
  'GRAV_AST',
  'GRAV_MIOC',
  'ALRM_HIPOT',
  'ALRM_PLAQ',
  'ALRM_VOM',
  'ALRM_SANG',
  'ALRM_HEMAT',
  'ALRM_HEPAT',
  'DOENCA_TRA',
  'RESUL_VI_N',
]

for col in cols:
  df[col] = df[col].fillna(2)
  df[col] = df[col].replace(2, 0)

Colocando não realizado para valores nulos do resultados dos exames

In [41]:
cols = [
  'RES_CHIKS1',
  'RES_CHIKS2',
  'RESUL_PRNT',
  'RESUL_SORO',
  'RESUL_NS1',
  'IMUNOH_N',
  'HISTOPA_N'
]

for col in cols:
  df[col] = df[col].fillna(4)

Colocando o valor `0` para `CLINIC_CHIK`

In [42]:
df['CLINC_CHIK'] = df['CLINC_CHIK'].fillna(0)

Tratando datas colocando o valor 0 da epoch

In [43]:
cols = [
  'DT_SIN_PRI',
  'DT_CHIK_S2',
  'DT_PRNT',
  'DT_NS1',
  'DT_VIRAL',
  'DT_PCR',
  'DT_OBITO',
  'DT_ALRM',
  'DT_GRAV',
  'DT_INTERNA',
  'DT_NOTIFIC'
]

start_date = pd.to_datetime('1970-01-01')

for col in cols:
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = df[col].fillna(0)

  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')
  df[col] = pd.to_datetime(df[col], errors='coerce').view('int64')


Colocando valores padroes nos campos `CS_*`

In [44]:
cols = [
  'CS_SEXO',
  'CS_GESTANT',
  'CS_RACA',
  'CS_ESCOL_N',
]

for col in cols:
  df[col] = df[col].fillna(df[col].mode()[0])

Colocando o valor padrao como nao (`2`)  para `HOSPITALIZ`

In [45]:
df['HOSPITALIZ'] = df['HOSPITALIZ'].fillna(2)

In [46]:
df['TPAUTOCTO'] = df['TPAUTOCTO'].fillna(1)

Colocando o valor padrao como `1` para `COPAISINF`

In [47]:
df['COPAISINF'] = df['COPAISINF'].fillna(1)

Colocando o valor padrao para `CO_BAINF`

In [48]:
df['CO_BAINF'] = df['CO_BAINF'].fillna(df['CO_BAINF'].mode()[0])

Colocando a moda como valor padrao para `IDENT_MICR`

In [49]:
df['IDENT_MICR'] = df['IDENT_MICR'].fillna(df['IDENT_MICR'].mode()[0])

Agora, podemos verificar que tratamos todas as colunas

In [50]:
df.columns[df.isna().any()].tolist()

[]

In [51]:
df.head()

Unnamed: 0,NU_NOTIFIC,DT_NOTIFIC,SG_UF_NOT,ID_REGIONA,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,NU_IDADE_N,CS_SEXO,CS_GESTANT,...,GRAV_INSUF,GRAV_HIPOT,GRAV_HEMAT,GRAV_MELEN,GRAV_METRO,GRAV_SANG,GRAV_AST,GRAV_MIOC,DT_GRAV,IDENT_MICR
0,158,1457136000000000000,29,1381.0,2498731.0,1456876800000000000,201609,3009.0,1.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
1,298,1455494400000000000,29,1385.0,3280969.0,1455408000000000000,201607,4039.0,1.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
2,5082,1458864000000000000,29,1385.0,2800527.0,1458777600000000000,201612,4053.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
3,111262,1458777600000000000,29,1385.0,2706628.0,1458691200000000000,201612,4065.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
4,166,1457827200000000000,29,1381.0,2498731.0,1457740800000000000,201610,4067.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0


## Exportando os dados

vamos exportar os dados para `csv`

In [52]:
df['DT_NOTIFIC'].head()

0    1457136000000000000
1    1455494400000000000
2    1458864000000000000
3    1458777600000000000
4    1457827200000000000
Name: DT_NOTIFIC, dtype: int64

In [53]:
df_tmp = df.copy()

cols = [
  'DT_SIN_PRI',
  'DT_CHIK_S2',
  'DT_PRNT',
  'DT_NS1',
  'DT_VIRAL',
  'DT_PCR',
  'DT_OBITO',
  'DT_ALRM',
  'DT_GRAV',
  'DT_INTERNA',
  'DT_NOTIFIC'
]

for col in cols:
  df_tmp[col] = pd.to_datetime(df_tmp[col], errors='coerce', unit='ns')

df_tmp.to_csv('out.csv', index=True, sep=';')

In [54]:
df_tmp.head()

Unnamed: 0,NU_NOTIFIC,DT_NOTIFIC,SG_UF_NOT,ID_REGIONA,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,NU_IDADE_N,CS_SEXO,CS_GESTANT,...,GRAV_INSUF,GRAV_HIPOT,GRAV_HEMAT,GRAV_MELEN,GRAV_METRO,GRAV_SANG,GRAV_AST,GRAV_MIOC,DT_GRAV,IDENT_MICR
0,158,2016-03-05,29,1381.0,2498731.0,2016-03-02,201609,3009.0,1.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NaT,4.0
1,298,2016-02-15,29,1385.0,3280969.0,2016-02-14,201607,4039.0,1.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NaT,4.0
2,5082,2016-03-25,29,1385.0,2800527.0,2016-03-24,201612,4053.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NaT,4.0
3,111262,2016-03-24,29,1385.0,2706628.0,2016-03-23,201612,4065.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NaT,4.0
4,166,2016-03-13,29,1381.0,2498731.0,2016-03-12,201610,4067.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NaT,4.0


In [55]:
df.to_csv('out_brute.csv', index=False)

# Treinamento

Para facilitar o processo de treinamento, vamos criar uma classe `TrainingModels` responsavel por realizar aplicar e realizar o treinamento dos modelos selecionados. Podemos alterar os valores dos parametros durante a inicializacao do objeto.

Estão sendo utilizados os seguintes modelos:
KNeighborsClassifier, DecisionTreeClassifier, RandomForestClassifier, LogisticRegression e MLPClassifier

Respectivos hiperparâmetros estão setados no construtor da classe:
 `knn_neighbors = 5`, `knn_weights = 'uniform'`, `knn_metric = 'euclidean'`, `dt_criterion = 'entropy'`, `dt_min_samples_split = 2`,`rf_n_estimators = 100`, `rf_criterion = 'entropy'`,`logistic_max_iter = 100`, `logistic_penalty = 'l2'`,`logistic_solver = 'lbfgs'`, `mlp_hidden_layer_sizes = (100, 100)`, `mlp_activation = 'relu'`, `mlp_solver = 'adam'`,
`mlp_learning_rate_init = 0.001`, `mlp_max_iter = 200`,`mlp_batch_size = 32`


In [56]:
class TrainingModels:
  def __init__(
    self,
    knn_neighbors: int = 5,
    knn_weights: str = 'uniform',
    knn_metric: str = 'euclidean',
    dt_criterion: str = 'entropy',
    dt_min_samples_split: int = 2,
    rf_n_estimators: int = 100,
    rf_criterion: str = 'entropy',
    rf_max_depth: int = None,
    logistic_max_iter: int = 100,
    logistic_penalty: str = 'l2',
    logistic_solver: str = 'lbfgs',
    mlp_hidden_layer_sizes: tuple[int, int] = (100, 100),
    mlp_activation: str = 'relu',
    mlp_solver: str = 'adam',
    mlp_learning_rate_init: float = 0.001,
    mlp_max_iter: int = 200,
    mlp_batch_size: int = 32
  ):
    self.knn = KNeighborsClassifier(
        n_neighbors=knn_neighbors,
        weights=knn_weights,
        metric=knn_metric
      )
    self.dt = DecisionTreeClassifier(
        criterion=dt_criterion,
        min_samples_split=dt_min_samples_split
      )
    self.rf = RandomForestClassifier(
        n_estimators=rf_n_estimators,
        criterion=rf_criterion,
        max_depth=rf_max_depth
      )
    self.logistic = LogisticRegression(
        max_iter=logistic_max_iter,
        penalty=logistic_penalty,
        solver=logistic_solver
      )
    self.mlp = MLPClassifier(
        hidden_layer_sizes=mlp_hidden_layer_sizes,
        activation=mlp_activation,
        solver=mlp_solver,
        learning_rate_init=mlp_learning_rate_init,
        max_iter=mlp_max_iter,
        batch_size=mlp_batch_size
      )

  def fit(self, X_train: pd.DataFrame, y_train: pd.DataFrame) -> None:
    print("Treinando Knn...")
    self.knn.fit(X_train, y_train)
    print("Treinando DT...")
    self.dt.fit(X_train, y_train)
    print("Treinando RF...")
    self.rf.fit(X_train, y_train)
    print("Treinando Logistic...")
    self.logistic.fit(X_train, y_train)
    print("Treinando MLP")
    self.mlp.fit(X_train, y_train)


  def plot_metrics(self, X_test, y_test):
        models = {'KNN': self.knn, 'Decision Tree': self.dt, 'Random Forest': self.rf, 'Logistic Regression': self.logistic, 'MLP': self.mlp}
        #models = {'KNN': self.knn, 'Decision Tree': self.dt, 'Random Forest': self.rf, 'Logistic Regression': self.logistic}
        metrics = {'Accuracy': accuracy_score, 'Precision': precision_score, 'Recall': recall_score, 'F1 Score': f1_score}
        colors = ['b', 'g', 'r', 'c', 'm']

        plt.figure(figsize=(12, 6))
        bar_width = 0.15
        index = np.arange(len(models))
        metric_values = {}

        for model_name, model in models.items():
            y_pred = model.predict(X_test)
            metric_values[model_name] = [metric_func(y_test, y_pred) for metric_func in metrics.values()]

            # Print confusion matrix
            cm = confusion_matrix(y_test, y_pred)
            plt.figure(figsize=(5, 4))
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
            plt.title(f'{model_name} Confusion Matrix')
            plt.xlabel('Predicted')
            plt.ylabel('Actual')
            plt.show()

        for i, (metric_name, metric_color) in enumerate(zip(metrics.keys(), colors)):
            plt.bar(index + i * bar_width, [metric_value[i] for metric_value in metric_values.values()], bar_width, label=metric_name, color=metric_color)

        plt.xlabel('Models')
        plt.ylabel('Value')
        plt.title('Model Metrics Comparison')
        plt.xticks(index + bar_width * (len(metrics) - 1) / 2, models.keys())
        plt.legend()
        plt.tight_layout()
        plt.show()

  def _help_load_models(self, path):
      with open(path, 'rb') as file:
          return pickle.load(file)
    
  def load_models(self, knn: str, dt: str, rf: str, logistic: str, mlp: str):
      self.knn = self._help_load_models(knn)
      self.dt = self._help_load_models(dt)
      self.rf = self._help_load_models(rf)
      self.logistic = self._help_load_models(logistic)
      self.mlp = self._help_load_models(mlp)

In [57]:
models = TrainingModels()

In [58]:
scaler = MaxAbsScaler()
scaled_features = scaler.fit_transform(df)

df_scaled = pd.DataFrame(scaled_features, columns=df.columns)

X = df_scaled.drop(["CLASSI_FIN"], axis=1)
y = df_scaled['CLASSI_FIN']

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=45)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=45)

In [None]:
models.fit(X_train, y_train)

Métricas avaliadas:

- 'Accuracy',
- 'Precision',
- 'Recall',
- 'F1 Score'

In [None]:
models.plot_metrics(X_test, y_test)

Assim, podemos ver que os algoritmos de `DecisionTree` e, principalmente, `RandomForest` sao os melhores algoritmos para o problema proposto. O que se encaixa com o que seria esperado para esse tipo de problema.

Como possivel melhoria do sistema, seria possivel aumentar a quatidade de interacoes da regressao logistica na esperanca de termos um melhor resultado

#### Salvando os modelos

In [None]:
export_model(models.mlp, "mlp_pesos_padroes")
export_model(models.logistic, "logistic_pesos_padroes")
export_model(models.knn, "knn_pesos_padroes")
export_model(models.dt, "dt_pesos_padroes")
export_model(models.rf, "rf_pesos_padroes")

# Selecionando Parametros

Vamos usar `RandomizedSearchCV` para tentar encontrar os melhores paremetros

In [None]:
param_dist = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, 30],
    'criterion': ['gini', 'entropy']
}
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions=param_dist, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

# pegar por f1, curva rock

Asssim, temos que os melhores valores sao:

- `n_estimators`: 100
- `max_depth`: None
- `criterion`: entropy
- `bootstrap`: True
- `class_weight`: balanced
- `max_features`: sqrt
- `min_samples_leaf`: 1
- `min_samples_split`: 10

In [None]:
param_dist = {
    'n_neighbors': [10, 20, 25, 30, 35, 40],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}
random_search = RandomizedSearchCV(KNeighborsClassifier(), param_distributions=param_dist, scoring='accuracy', random_state=42, verbose = 1)
random_search.fit(X, y)
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

Asssim, temos que os melhores valores sao:

- `n_neighbors`: 35
- `weights`: distance
- `metric`: manhattan

In [None]:
param_dist = {
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [5, 10, 20]
}
random_search = RandomizedSearchCV(DecisionTreeClassifier(), param_distributions=param_dist, scoring='accuracy', random_state=42, verbose = 1)
random_search.fit(X, y)
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

Asssim, temos que os melhores valores sao:

- `criterion`: entropy
- `min_samples_split`: 10
- `splitter`: random

In [None]:
param_dist = {
    'penalty': ['l2', 'elasticnet', None],
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'max_iter': [350, 400, 450, 500, 550, 600, 700, 800]
}
random_search = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist, scoring='accuracy', random_state=42, verbose = 1)
random_search.fit(X, y)
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

Asssim, temos que os melhores valores sao:

- `penalty`: l2
- `solver`: lbfgs
- `max_iter`: 200

In [None]:
param_dist = {
    'hidden_layer_sizes': [(100, 100), (150, 150)],
    'activation': ['relu', 'tanh', 'logistic'],
    'solver': ['adam', 'sgd', 'lbfgs'],
    'learning_rate_init': [0.0001, 0.001],
    'max_iter': [100, 200, 300],
    'batch_size': [16, 32]
}
random_search = RandomizedSearchCV(MLPClassifier(), param_distributions=param_dist, scoring='accuracy', random_state=42, verbose = 1)
random_search.fit(X, y)
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

Asssim, temos que os melhores valores sao:

- `hidden_layer_sizes`: (150, 150)
- `activation`: relu
- `solver`: adam
- `learning_rate_init`: 0.001
- `max_iter`: 300
- `batch_size`: 32


In [None]:
new_models = TrainingModels(
    rf_n_estimators = 100,
    rf_criterion = 'entropy',
    rf_max_depth = None,
    knn_neighbors = 35,
    knn_weights = 'distance',
    knn_metric = 'manhattan',
    dt_criterion = 'entropy',
    dt_min_samples_split = 10,
    logistic_max_iter = 200,
    logistic_penalty = 'l2',
    logistic_solver = 'lbfgs',
    mlp_hidden_layer_sizes = (150, 150),
    mlp_activation = 'relu',
    mlp_solver = 'adam',
    mlp_learning_rate_init = 0.001,
    mlp_max_iter = 300,
    mlp_batch_size = 32
)

In [None]:
new_models.fit(X_train, y_train)

In [None]:
new_models.plot_metrics(X_test, y_test)

#### Salvando os modelos

In [None]:
export_model(new_models.mlp, "mlp_parametros_livres")
export_model(new_models.logistic, "logistic_parametros_livres")
export_model(new_models.knn, "knn_parametros_livres")
export_model(new_models.dt, "dt_parametros_livres")
export_model(new_models.rf, "rf_parametros_livres")

## Usando os modelos exportados

Podemos usar os modelos que exportamos a cima para treinar o uso de dados

In [59]:
models = TrainingModels()
models.load_models(knn="./knn_parametros_livres.pkl",
                   dt="./dt_parametros_livres.pkl",
                   rf="./rf_parametros_livres.pkl",
                   logistic="./logistic_parametros_livres.pkl",
                   mlp="./mlp_parametros_livres.pkl")

In [None]:
models.plot_metrics(X_test, y_test)

# Utilizando base de dados de testes

Para isso, vamos utilizar o `d1.csv` para identificar o Dataset 1 e `d2.csv` para identificar o Dataset 2.

In [60]:
d1 = pd.read_csv("d1.csv")
d2 = pd.read_csv("d2.csv")

In [61]:
d1.head()

Unnamed: 0,HEMATOLOG,HEMATURA,LACO,CEFALEIA,ID_GEO2,FEBRE,NAUSEA,PETEQUIAS,CONJUNTVIT,EPISTAXE,...,AUTO_IMUNE,HIPERTENSA,PLASMATICO,PETEQUIA_N,MIALGIA,LACO_N,ACIDO_PEPT,EXANTEMA,PLAQ_MENOR,CLASSI_FIN_BINARIO
0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0
2,0.0,-1.0,0.0,1.0,-1.0,0.0,0.0,-1.0,0.0,-1.0,...,0.0,0.0,-1.0,0.0,1.0,-1.0,0.0,0.0,-1.0,1.0
3,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0
4,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0


In [62]:
d1.columns

Index(['HEMATOLOG', 'HEMATURA', 'LACO', 'CEFALEIA', 'ID_GEO2', 'FEBRE',
       'NAUSEA', 'PETEQUIAS', 'CONJUNTVIT', 'EPISTAXE', 'MANI_HEMOR',
       'DIABETES', 'ARTRALGIA', 'ARTRITE', 'LEUCOPENIA', 'DOR_COSTAS',
       'DOR_RETRO', 'VOMITO', 'RENAL', 'HEPATOPAT', 'AUTO_IMUNE', 'HIPERTENSA',
       'PLASMATICO', 'PETEQUIA_N', 'MIALGIA', 'LACO_N', 'ACIDO_PEPT',
       'EXANTEMA', 'PLAQ_MENOR', 'CLASSI_FIN_BINARIO'],
      dtype='object')

In [63]:
set(df) - set(d1)

{'ALRM_HEMAT',
 'ALRM_HEPAT',
 'ALRM_HIPOT',
 'ALRM_PLAQ',
 'ALRM_SANG',
 'ALRM_VOM',
 'CLASSI_FIN',
 'CLINC_CHIK',
 'COPAISINF',
 'COUFINF',
 'CO_BAINF',
 'CS_ESCOL_N',
 'CS_GESTANT',
 'CS_RACA',
 'CS_SEXO',
 'CS_ZONA',
 'DDD_HOSP',
 'DOENCA_TRA',
 'DT_ALRM',
 'DT_CHIK_S2',
 'DT_GRAV',
 'DT_INTERNA',
 'DT_NOTIFIC',
 'DT_NS1',
 'DT_OBITO',
 'DT_PCR',
 'DT_PRNT',
 'DT_SIN_PRI',
 'DT_VIRAL',
 'GRAV_AST',
 'GRAV_CONV',
 'GRAV_HEMAT',
 'GRAV_HIPOT',
 'GRAV_INSUF',
 'GRAV_MELEN',
 'GRAV_METRO',
 'GRAV_MIOC',
 'GRAV_PULSO',
 'GRAV_SANG',
 'HISTOPA_N',
 'HOSPITALIZ',
 'IDENT_MICR',
 'ID_BAIRRO',
 'ID_GEO1',
 'ID_MN_RESI',
 'ID_OCUPA_N',
 'ID_REGIONA',
 'ID_RG_RESI',
 'ID_UNIDADE',
 'IMUNOH_N',
 'NM_BAIRRO',
 'NM_REFEREN',
 'NOBAIINF',
 'NU_IDADE_N',
 'NU_NOTIFIC',
 'RESUL_NS1',
 'RESUL_PRNT',
 'RESUL_SORO',
 'RESUL_VI_N',
 'RES_CHIKS1',
 'RES_CHIKS2',
 'SEM_PRI',
 'SG_UF_NOT',
 'TEL_HOSP',
 'TPAUTOCTO'}

In [64]:
d2.head()

Unnamed: 0,RENAL,ALRM_HIPOT,GRAV_HEMAT,CS_FLXRET,ID_BAIRRO,ACIDO_PEPT,VOMITO,GRAV_ENCH,HEPATOPAT,ID_DISTRIT,...,LEUCOPENIA,DDD_HOSP,GRAV_PULSO,EVOLUCAO,TEL_HOSP,DT_SORO,FEBRE,DOR_COSTAS,ID_GEO2,CLASSI_FIN_BINARIO
0,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-9223372036854775808,-1.0,-1.0,-1.0,0.0
1,-1.0,-1.0,-1.0,0.0,6.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-9223372036854775808,-1.0,-1.0,-1.0,0.0
2,2.0,-1.0,-1.0,1.0,192.0,2.0,2.0,-1.0,2.0,-1.0,...,2.0,-1.0,-1.0,1.0,-1.0,-9223372036854775808,2.0,2.0,-1.0,1.0
3,-1.0,-1.0,-1.0,0.0,24.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-9223372036854775808,-1.0,-1.0,-1.0,1.0
4,-1.0,-1.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-9223372036854775808,-1.0,-1.0,-1.0,0.0


In [65]:
d2.shape

(549797, 116)

In [66]:
df.shape

(549797, 81)

In [67]:
d2.columns

Index(['RENAL', 'ALRM_HIPOT', 'GRAV_HEMAT', 'CS_FLXRET', 'ID_BAIRRO',
       'ACIDO_PEPT', 'VOMITO', 'GRAV_ENCH', 'HEPATOPAT', 'ID_DISTRIT',
       ...
       'LEUCOPENIA', 'DDD_HOSP', 'GRAV_PULSO', 'EVOLUCAO', 'TEL_HOSP',
       'DT_SORO', 'FEBRE', 'DOR_COSTAS', 'ID_GEO2', 'CLASSI_FIN_BINARIO'],
      dtype='object', length=116)

In [68]:
set(df) - set(d2)

{'CLASSI_FIN', 'CS_SEXO', 'ID_OCUPA_N', 'NM_BAIRRO', 'NM_REFEREN', 'NOBAIINF'}

In [69]:
df.head()

Unnamed: 0,NU_NOTIFIC,DT_NOTIFIC,SG_UF_NOT,ID_REGIONA,ID_UNIDADE,DT_SIN_PRI,SEM_PRI,NU_IDADE_N,CS_SEXO,CS_GESTANT,...,GRAV_INSUF,GRAV_HIPOT,GRAV_HEMAT,GRAV_MELEN,GRAV_METRO,GRAV_SANG,GRAV_AST,GRAV_MIOC,DT_GRAV,IDENT_MICR
0,158,1457136000000000000,29,1381.0,2498731.0,1456876800000000000,201609,3009.0,1.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
1,298,1455494400000000000,29,1385.0,3280969.0,1455408000000000000,201607,4039.0,1.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
2,5082,1458864000000000000,29,1385.0,2800527.0,1458777600000000000,201612,4053.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
3,111262,1458777600000000000,29,1385.0,2706628.0,1458691200000000000,201612,4065.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0
4,166,1457827200000000000,29,1381.0,2498731.0,1457740800000000000,201610,4067.0,0.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-9223372036854775808,4.0


In [70]:
df.columns

Index(['NU_NOTIFIC', 'DT_NOTIFIC', 'SG_UF_NOT', 'ID_REGIONA', 'ID_UNIDADE',
       'DT_SIN_PRI', 'SEM_PRI', 'NU_IDADE_N', 'CS_SEXO', 'CS_GESTANT',
       'CS_RACA', 'CS_ESCOL_N', 'ID_MN_RESI', 'ID_RG_RESI', 'ID_BAIRRO',
       'NM_BAIRRO', 'ID_GEO1', 'NM_REFEREN', 'CS_ZONA', 'ID_OCUPA_N', 'FEBRE',
       'CEFALEIA', 'VOMITO', 'CONJUNTVIT', 'ARTRITE', 'PETEQUIA_N',
       'LEUCOPENIA', 'LACO', 'DOR_RETRO', 'DIABETES', 'HEMATOLOG', 'HEPATOPAT',
       'RENAL', 'HIPERTENSA', 'ACIDO_PEPT', 'AUTO_IMUNE', 'DT_CHIK_S2',
       'DT_PRNT', 'RES_CHIKS1', 'RES_CHIKS2', 'RESUL_PRNT', 'RESUL_SORO',
       'DT_NS1', 'RESUL_NS1', 'DT_VIRAL', 'RESUL_VI_N', 'DT_PCR', 'HISTOPA_N',
       'IMUNOH_N', 'HOSPITALIZ', 'DT_INTERNA', 'DDD_HOSP', 'TEL_HOSP',
       'TPAUTOCTO', 'COUFINF', 'COPAISINF', 'CO_BAINF', 'NOBAIINF',
       'CLASSI_FIN', 'DOENCA_TRA', 'CLINC_CHIK', 'DT_OBITO', 'ALRM_HIPOT',
       'ALRM_PLAQ', 'ALRM_VOM', 'ALRM_SANG', 'ALRM_HEMAT', 'ALRM_HEPAT',
       'DT_ALRM', 'GRAV_PULSO', 'GRAV_CON

In [71]:
d2['CS_SEXO'] = df['CS_SEXO'].mode()[0]
d2['ID_OCUPA_N'] = df['ID_OCUPA_N'].mode()[0]
d2['NM_BAIRRO'] = df['NM_BAIRRO'].mode()[0]
d2['NM_REFEREN'] = df['NM_REFEREN'].mode()[0]
d2['NOBAIINF'] = df['NOBAIINF'].mode()[0]
d2['CLASSI_FIN'] = df['CLASSI_FIN']

In [72]:
df['CS_SEXO'].mode()

0    0.0
Name: CS_SEXO, dtype: float64

In [73]:
columns_with_nan = d2.columns[d2.isna().any()].tolist()
print("Columns with NaN values:", columns_with_nan)

Columns with NaN values: ['CLASSI_FIN']


In [74]:
d2 = d2[df.columns]

In [75]:
d2.shape

(549797, 81)

In [76]:
df.dropna(subset=['CLASSI_FIN'], inplace=True)

In [77]:
d2.dropna(subset=['CLASSI_FIN'], inplace=True)

In [78]:
d2.shape

(516653, 81)

In [79]:
y = d2['CLASSI_FIN']
x = d2.drop(['CLASSI_FIN'], axis=1)

In [None]:
models.plot_metrics(x, y)