# Downloading data from the SINAN database

In [1]:
from pysus.online_data import SINAN
import pandas as pd



SINAN is a database of reported cases of certain diseases that Brazilian law requires to be reported. Unfortunately the data available for free download, corresponds only to the investigated cases not the totality of the reported cases. Nevertheless it's an interesting dataset.

To find out what are these diseases, we can use PySUS:

In [2]:
SINAN.list_diseases()

['Animais Peçonhentos',
 'Botulismo',
 'Chagas',
 'Chikungunya',
 'Colera',
 'Coqueluche',
 'Dengue',
 'Difteria',
 'Esquistossomose',
 'Febre Amarela',
 'Febre Maculosa',
 'Febre Tifoide',
 'Hanseniase',
 'Hantavirose',
 'Hepatites Virais',
 'Intoxicação Exógena',
 'Leishmaniose Visceral',
 'Leptospirose',
 'Leishmaniose Tegumentar',
 'Malaria',
 'Meningite',
 'Peste',
 'Poliomielite',
 'Raiva Humana',
 'Tétano Acidental',
 'Tétano Neonatal',
 'Tuberculose',
 'Violência Domestica']

These diseases are available by state, so if we want to see the cases of `Chagas` disease in the state of Minas Gerais, first we can check which years are available:

In [3]:
SINAN.get_available_years('RJ', 'chagas')

['CHAGRJ07.dbc',
 'CHAGRJ08.dbc',
 'CHAGRJ09.dbc',
 'CHAGRJ10.dbc',
 'CHAGRJ11.dbc',
 'CHAGRJ12.dbc',
 'CHAGRJ13.dbc',
 'CHAGRJ14.dbc',
 'CHAGRJ15.dbc',
 'CHAGRJ16.dbc',
 'CHAGRJ17.dbc',
 'CHAGRJ18.dbc',
 'CHAGRJ19.dbc']

We can see, that we have data from 2007 until 2018. Now we can download it:

In [14]:
df = SINAN.download(year=2018,disease='Chagas')
df

Unnamed: 0,TP_NOT,ID_AGRAVO,DT_NOTIFIC,SEM_NOT,NU_ANO,SG_UF_NOT,ID_MUNICIP,ID_REGIONA,ID_UNIDADE,DT_SIN_PRI,...,DT_OBITO,CON_PROVAV,CON_OUTRA,CON_LOCAL,TPAUTOCTO,COUFINF,COPAISINF,COMUNINF,DOENCA_TRA,DT_ENCERRA
0,2,B571,2018-08-20,201834,2018,35,351570,1333,2773694,2018-07-30,...,,,,,,,0,,,2019-01-15
1,2,B571,2018-04-20,201816,2018,35,353760,1349,7036892,2018-04-20,...,,,,,,,0,,,2018-06-20
2,2,B571,2018-06-14,201824,2018,35,350635,1349,7786913,2018-05-31,...,,,,,,,0,,,2018-07-19
3,2,B571,2018-12-06,201849,2018,35,354100,1349,3551911,2018-06-01,...,,,,,,,0,,,2019-01-23
4,2,B571,2018-12-19,201851,2018,35,351720,1340,2790475,2018-11-28,...,,,,,,,0,,,2019-01-10
5,2,B571,2018-04-20,201816,2018,35,354160,1340,7580371,2018-04-19,...,,,,,,,0,,,2018-05-22
6,2,B571,2018-10-08,201841,2018,35,350280,1336,2043874,2018-08-22,...,,,,,,,0,,,2018-11-01
7,2,B571,2018-09-10,201837,2018,35,350960,1342,2087219,2018-09-01,...,,,,,,,0,,,2018-12-20
8,2,B571,2018-09-10,201837,2018,35,355030,1331,2077485,2018-09-05,...,,,,,,,0,,,2018-11-05
9,2,B571,2018-09-10,201837,2018,35,350560,1348,2049813,2018-06-27,...,,,,,,,0,,,2018-11-14


Let's look at the variables available on the downloaded dataframe

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 99 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   TP_NOT      19 non-null     object 
 1   ID_AGRAVO   19 non-null     object 
 2   DT_NOTIFIC  19 non-null     object 
 3   SEM_NOT     19 non-null     object 
 4   NU_ANO      19 non-null     object 
 5   SG_UF_NOT   19 non-null     object 
 6   ID_MUNICIP  19 non-null     object 
 7   ID_REGIONA  19 non-null     object 
 8   ID_UNIDADE  19 non-null     object 
 9   DT_SIN_PRI  19 non-null     object 
 10  SEM_PRI     19 non-null     object 
 11  DT_NASC     19 non-null     object 
 12  NU_IDADE_N  19 non-null     int64  
 13  CS_SEXO     19 non-null     object 
 14  CS_GESTANT  19 non-null     object 
 15  CS_RACA     19 non-null     object 
 16  CS_ESCOL_N  19 non-null     object 
 17  SG_UF       19 non-null     object 
 18  ID_MN_RESI  19 non-null     object 
 19  ID_RG_RESI  19 non-null     obj

## Decoding the age in SINAN tables
In SINAN the age comes encoded. PySUS can decode the age column `NU_IDADE_N` into any of these units: years, months, days, or hours.

In [16]:
from pysus.preprocessing.decoders import decodifica_idade_SINAN
decodifica_idade_SINAN?

[0;31mSignature:[0m       [0mdecodifica_idade_SINAN[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m            vectorize
[0;31mString form:[0m     <numpy.vectorize object at 0x7fb554cc31f0>
[0;31mFile:[0m            /usr/local/lib/python3.8/dist-packages/numpy/__init__.py
[0;31mDocstring:[0m      
Em tabelas do SINAN frequentemente a idade é representada como um inteiro que precisa ser parseado
para retornar a idade em uma unidade cronológica padrão.
:param unidade: unidade da idade: 'Y': anos, 'M' meses, 'D': dias, 'H': horas
:param idade: inteiro ou sequencia de inteiros codificados.
:return:
[0;31mClass docstring:[0m
vectorize(pyfunc, otypes=None, doc=None, excluded=None, cache=False,
          signature=None)

Generalized function class.

Define a vectorized function which takes a nested sequence of objects or
numpy arrays as inputs and returns a single numpy array or a tuple of numpy


Let's convert the age to years and save it on a different column.

In [17]:
df['idade_anos'] = decodifica_idade_SINAN(df.NU_IDADE_N, 'Y')
df[['NU_IDADE_N', 'idade_anos']]

Unnamed: 0,NU_IDADE_N,idade_anos
0,4068,68.0
1,4049,49.0
2,4049,49.0
3,4045,45.0
4,4060,60.0
5,4035,35.0
6,4025,25.0
7,4071,71.0
8,4001,1.0
9,4055,55.0


We can easily convert dates and numerical fields in the dataframe:

In [18]:
for cname in df.columns:
    if cname.startswith('DT_'):
        df[cname] = pd.to_datetime(df[cname])
    elif cname.startswith('ID_'):
        try:
            df[cname] = pd.to_numeric(df[cname])
        except ValueError:
            continue
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 100 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   TP_NOT      19 non-null     object        
 1   ID_AGRAVO   19 non-null     object        
 2   DT_NOTIFIC  19 non-null     datetime64[ns]
 3   SEM_NOT     19 non-null     object        
 4   NU_ANO      19 non-null     object        
 5   SG_UF_NOT   19 non-null     object        
 6   ID_MUNICIP  19 non-null     int64         
 7   ID_REGIONA  19 non-null     int64         
 8   ID_UNIDADE  19 non-null     int64         
 9   DT_SIN_PRI  19 non-null     datetime64[ns]
 10  SEM_PRI     19 non-null     object        
 11  DT_NASC     19 non-null     datetime64[ns]
 12  NU_IDADE_N  19 non-null     int64         
 13  CS_SEXO     19 non-null     object        
 14  CS_GESTANT  19 non-null     object        
 15  CS_RACA     19 non-null     object        
 16  CS_ESCOL_N  19 non-null    

## Saving the Modified data
We can seve our dataframe in any format we wish to avoid having to redo this analysis next time.

In [None]:
df.to_csv('chagas_SP_2018_mod.csv',sep=';',compression='zip')