# Downloading data from the SINAN database

In [1]:
from pysus.online_data import SINAN
import pandas as pd

SINAN is a database of reported cases of certain diseases that Brazilian law requires to be reported. Unfortunately the data available for free download, corresponds only to the investigated cases not the totality of the reported cases. Nevertheless it's an interesting dataset.

To find out what are these diseases, we can use PySUS:

In [2]:
SINAN.list_diseases()

['Animais Peçonhentos',
 'Botulismo',
 'Cancer',
 'Chagas',
 'Chikungunya',
 'Colera',
 'Coqueluche',
 'Contact Communicable Disease',
 'Acidentes de Trabalho',
 'Dengue',
 'Difteria',
 'Esquistossomose',
 'Febre Amarela',
 'Febre Maculosa',
 'Febre Tifoide',
 'Hanseniase',
 'Hantavirose',
 'Hepatites Virais',
 'Intoxicação Exógena',
 'Leishmaniose Visceral',
 'Leptospirose',
 'Leishmaniose Tegumentar',
 'Malaria',
 'Meningite',
 'Peste',
 'Poliomielite',
 'Raiva Humana',
 'Sífilis Adquirida',
 'Sífilis Congênita',
 'Sífilis em Gestante',
 'Tétano Acidental',
 'Tétano Neonatal',
 'Tuberculose',
 'Violência Domestica',
 'Zika']

These diseases are available in countrywide tables, so if we want to see the cases of `Chagas` disease in the state of Minas Gerais, first we can check which years are available:

In [3]:
SINAN.get_available_years('chagas')



['CHAGBR00.dbc',
 'CHAGBR01.dbc',
 'CHAGBR02.dbc',
 'CHAGBR03.dbc',
 'CHAGBR04.dbc',
 'CHAGBR05.dbc',
 'CHAGBR06.dbc',
 'CHAGBR07.dbc',
 'CHAGBR08.dbc',
 'CHAGBR09.dbc',
 'CHAGBR10.dbc',
 'CHAGBR11.dbc',
 'CHAGBR12.dbc',
 'CHAGBR13.dbc',
 'CHAGBR14.dbc',
 'CHAGBR15.dbc',
 'CHAGBR16.dbc',
 'CHAGBR17.dbc',
 'CHAGBR18.dbc',
 'CHAGBR19.dbc']

We can see, that we have data from 2000 until 2019. We can check when the data was last updated, and how big are the files. Now we can download it:

In [4]:
from pysus.online_data import last_update
last_update('SINAN')

Unnamed: 0,folder,date,file_size,file_name
0,/dissemin/publicos/SINAN/DADOS/FINAIS,2021-09-30 10:04:00,26826,ACBIBR06.dbc
1,/dissemin/publicos/SINAN/DADOS/FINAIS,2021-09-30 10:04:00,641813,ACBIBR07.dbc
2,/dissemin/publicos/SINAN/DADOS/FINAIS,2021-09-30 10:04:00,998830,ACBIBR08.dbc
3,/dissemin/publicos/SINAN/DADOS/FINAIS,2021-09-30 10:04:00,1432723,ACBIBR09.dbc
4,/dissemin/publicos/SINAN/DADOS/FINAIS,2021-09-30 10:04:00,1562123,ACBIBR10.dbc
...,...,...,...,...
1052,/dissemin/publicos/SINAN/DADOS/PRELIM,2022-02-21 09:48:00,3702244,TUBEBR21.dbc
1053,/dissemin/publicos/SINAN/DADOS/PRELIM,2022-02-21 09:48:00,64058,TUBEBR22.dbc
1054,/dissemin/publicos/SINAN/DADOS/PRELIM,2021-10-15 11:37:00,24793234,VIOLBR20.dbc
1055,/dissemin/publicos/SINAN/DADOS/PRELIM,2021-10-15 11:37:00,16021135,VIOLBR21.dbc


Now we can download it.

In [5]:
df = SINAN.download(year=2018,disease='Chagas')
df



Unnamed: 0,TP_NOT,ID_AGRAVO,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_MUNICIP,ID_REGIONA,ID_UNIDADE,DT_SIN_PRI,...,DT_OBITO,CON_PROVAV,CON_OUTRA,CON_LOCAL,TPAUTOCTO,COUFINF,COPAISINF,COMUNINF,DOENCA_TRA,DT_ENCERRA
0,b'2',b'B571',b'20181219',b'201851',b'2018',b'15',b'150490',b'1490 ',b'2316005',b'20181205',...,b' ',b'2',b' ',b'4',b'1',b'15',b'1 ',b'150490',b'2',b'20190220'
1,b'2',b'B571',b'20181130',b'201848',b'2018',b'15',b'150450',b'1491 ',b'6559476',b'20181105',...,b' ',b'5',b' ',b' ',b' ',b' ',b'0 ',b' ',b' ',b'20190131'
2,b'2',b'B571',b'20181003',b'201840',b'2018',b'15',b'150490',b'1490 ',b'2316005',b'20180925',...,b' ',b'5',b' ',b'2',b'1',b'15',b'1 ',b'150490',b'2',b'20181206'
3,b'2',b'B571',b'20180109',b'201802',b'2018',b'15',b'150375',b'1492 ',b'2331691',b'20171219',...,b' ',b'5',b' ',b'2',b'1',b'15',b'1 ',b'150375',b'2',b'20180315'
4,b'2',b'B571',b'20180811',b'201832',b'2018',b'15',b'150450',b'1491 ',b'6559476',b'20180811',...,b' ',b'5',b' ',b'2',b'1',b'15',b'1 ',b'150450',b'2',b'20180814'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4680,b'2',b'B571',b'20181018',b'201842',b'2018',b'16',b'160030',b' ',b'3043088',b'20181010',...,b' ',b'5',b' ',b'2',b'1',b'16',b'1 ',b'160030',b'2',b'20181023'
4681,b'2',b'B571',b'20180221',b'201808',b'2018',b'12',b'120020',b'1941 ',b'5336171',b'20180207',...,b' ',b' ',b' ',b' ',b' ',b' ',b'0 ',b' ',b' ',b'20180402'
4682,b'2',b'B571',b'20181016',b'201842',b'2018',b'15',b'150210',b'1496 ',b'2313367',b'20180417',...,b' ',b'9',b' ',b'9',b'3',b' ',b'0 ',b' ',b'2',b'20190801'
4683,b'2',b'B571',b'20180911',b'201845',b'2018',b'15',b'150080',b'1484 ',b'6996515',b'20180310',...,b' ',b'5',b' ',b' ',b'1',b'15',b'1 ',b'150150',b'2',b'20181120'


In this case, the table return is small, but in the case the table is bigger than your computer's memory, you can download the data straight to disk and ask Pysus to return only the name of the resulting file:
```python
fname = SINAN.download(year=2015,disease='Dengue', return_fname=True)
```
then you can decide how to handle the data.

Let's look at the variables available on the downloaded dataframe

In [None]:
df.info()

## Decoding the age in SINAN tables
In SINAN the age comes encoded. PySUS can decode the age column `NU_IDADE_N` into any of these units: years, months, days, or hours.

In [None]:
from pysus.preprocessing.decoders import decodifica_idade_SINAN
decodifica_idade_SINAN?

Let's convert the age to years and save it on a different column.

In [None]:
df['idade_anos'] = decodifica_idade_SINAN(df.NU_IDADE_N, 'Y')
df[['NU_IDADE_N', 'idade_anos']]

We can easily convert dates and numerical fields in the dataframe:

In [None]:
for cname in df.columns:
    if cname.startswith('DT_'):
        df[cname] = pd.to_datetime(df[cname])
    elif cname.startswith('ID_'):
        try:
            df[cname] = pd.to_numeric(df[cname])
        except ValueError:
            continue
df.info()

## Saving the Modified data
We can save our dataframe in any format we wish to avoid having to redo this analysis next time. If we want to keep only the data from the state of Minas Gerais we need to filter the table using the UF code `31`.

In [None]:
df['SG_UF_NOT'] = df.SG_UF_NOT.astype(int)
df[df.SG_UF_NOT==31].to_csv('chagas_SP_2018_mod.csv',sep=';',compression='zip')