# Use examples of SINASC database
- This notebook will guide you on how to use the data (read / load)
- A simple analysis of the database

## Reading the data

In [None]:
# First step is to download the data from github, either
# executing the git clone command on your project folder
# or by manually downloading the files from the website

In [None]:
# The data is in parquet format and compressed with gzip.
# Pandas is a very good python package to handle data like
# this, and fastparquet is required as a 'engine' to properly
# read a parquet file

# To use pandas, we need to first download it
%pip install pandas

# We also need to install fastparquet
%pip install fastparquet

# Once both are installed, there is no need to run this cell
# anymore

In [2]:
# With both downloaded, let's read the file

# Bring pandas to our environment with import
# fastparquet is brought by pandas already, so no need to
# import it manually
import pandas as pd

# read_parquet() from pandas handles the reading
# it also handles the compression without additional parameters
df = pd.read_parquet('db/DN2023.parquet.gzip')

# If the cell above doesn't run, you may need to change some
# configs or the url. Check cell below if that is the case

In [3]:
# This cell is meant to show the current directory which
# python has access.
# Both paths should combine and become the full path to the 
# file. Ex. r:/GitHub/DB_SINASC/db/file.parquet.gzip

import os

os.getcwd()

'r:\\GitHub\\DB_SINASC'

## Analysing

In [4]:
# With the data loaded correctly, we can begin to use pandas

# let's see the 5 first rows
df.head()

Unnamed: 0,IDADEMAE,DTNASCMAE,RACACORMAE,ESTCIVMAE,QTDFILVIVO,QTDFILMORT,QTDGESTANT,QTDPARTNOR,QTDPARTCES,PARIDADE,...,RACACOR,IDANOMAL,CODANOMAL,LOCNASC,CODESTAB,CODMUNNASC,IDADEPAI,TPMETESTIM,STDNEPIDEM,STDNNOVA
0,32.0,1990-10-10,4.0,2.0,3.0,2.0,4.0,3.0,0.0,1,...,4.0,0,,1.0,2679477.0,110001,32.0,,0.0,1
1,18.0,2004-08-19,5.0,1.0,,,,,,0,...,5.0,0,,1.0,2679477.0,110001,,,0.0,1
2,15.0,2007-10-01,1.0,5.0,0.0,0.0,0.0,0.0,0.0,0,...,1.0,0,,1.0,2679477.0,110001,,,0.0,1
3,32.0,1990-05-20,4.0,2.0,1.0,0.0,1.0,0.0,1.0,1,...,4.0,0,,1.0,2516500.0,110001,35.0,,0.0,1
4,27.0,1995-07-15,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0,...,4.0,0,,1.0,2516500.0,110001,,,0.0,1


This gives us some info about the dataframe
- Examples of values from every column
- Some info about missing values

In [5]:
# now let's talk about dimensions, what is the size of 
# this dataframe?

df.shape

(2537576, 49)

We have:
- 2.537.576 rows
- 49 columns

In [6]:
# Let's take a look on more details about each column

# Dtype is the type of a column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2537576 entries, 0 to 2537575
Data columns (total 49 columns):
 #   Column      Dtype         
---  ------      -----         
 0   IDADEMAE    float32       
 1   DTNASCMAE   datetime64[ns]
 2   RACACORMAE  float32       
 3   ESTCIVMAE   float32       
 4   QTDFILVIVO  float32       
 5   QTDFILMORT  float32       
 6   QTDGESTANT  float32       
 7   QTDPARTNOR  float32       
 8   QTDPARTCES  float32       
 9   PARIDADE    int32         
 10  ESCMAE      float32       
 11  ESCMAE2010  float32       
 12  SERIESCMAE  float32       
 13  ESCMAEAGR1  float32       
 14  CODMUNNATU  float32       
 15  CODUFNATU   float32       
 16  NATURALMAE  float32       
 17  CODMUNRES   int32         
 18  CODOCUPMAE  float32       
 19  DTULTMENST  datetime64[ns]
 20  SEMAGESTAC  float32       
 21  GESTACAO    float32       
 22  GRAVIDEZ    float32       
 23  CONSPRENAT  float32       
 24  CONSULTAS   float32       
 25  MESPRENAT   float3

In [None]:
# Here we can see the amount of missing values
# in each column

df.isna().sum()

IDADEMAE           34
DTNASCMAE       14422
RACACORMAE      52509
ESTCIVMAE       17065
QTDFILVIVO      37201
QTDFILMORT      56375
QTDGESTANT      40055
QTDPARTNOR      54082
QTDPARTCES      58606
PARIDADE            0
ESCMAE          17108
ESCMAE2010      27233
SERIESCMAE     872798
ESCMAEAGR1      20077
CODMUNNATU      40008
CODUFNATU       40049
NATURALMAE      40007
CODMUNRES           0
CODOCUPMAE     180826
DTULTMENST    1316285
SEMAGESTAC      20069
GESTACAO        20069
GRAVIDEZ         1383
CONSPRENAT      37292
CONSULTAS       10529
MESPRENAT       87947
KOTELCHUCK          0
PARTO             994
TPAPRESENT      20717
STTRABPART      47903
STCESPARTO      67918
TPROBSON        38248
TPNASCASSI      14257
DTNASC              0
HORANASC         1368
APGAR1          28032
APGAR5          27495
PESO              268
SEXO                0
RACACOR         45982
IDANOMAL            0
CODANOMAL     2511829
LOCNASC            79
CODESTAB        24775
CODMUNNASC          0
IDADEPAI  

In [None]:
# And here we can see the smallest and biggest 
# values (min, max)
# The mean and standart deviation (std)
# 1st, 2nd and 3rd Quartiles (25%, 50% and 75%)
# and the total number of non missing values (count)

df.describe()

Unnamed: 0,IDADEMAE,DTNASCMAE,RACACORMAE,ESTCIVMAE,QTDFILVIVO,QTDFILMORT,QTDGESTANT,QTDPARTNOR,QTDPARTCES,PARIDADE,...,SEXO,RACACOR,IDANOMAL,LOCNASC,CODESTAB,CODMUNNASC,IDADEPAI,TPMETESTIM,STDNEPIDEM,STDNNOVA
count,2537542.0,2523154,2485067.0,2520511.0,2500375.0,2481201.0,2497521.0,2483494.0,2478970.0,2537576.0,...,2537576.0,2491594.0,2537576.0,2537497.0,2512801.0,2537576.0,861895.0,1120692.0,2537573.0,2537576.0
mean,27.65704,1995-04-28 01:43:09.704473472,2.83784,1.986388,1.02436,0.272304,1.281611,0.6555804,0.4028645,0.6351045,...,1.487671,2.83804,0.01015536,1.030128,2994136.75,320248.0,32.160557,1.563689,0.6774816,0.9999882
min,8.0,1957-07-12 00:00:00,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,94.0,110001.0,9.0,1.0,0.0,0.0
25%,22.0,1990-04-22 00:00:00,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,1.0,2119528.0,260120.0,26.0,1.0,0.0,1.0
50%,27.0,1995-09-11 00:00:00,4.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,4.0,0.0,1.0,2461005.0,320530.0,32.0,2.0,1.0,1.0
75%,33.0,2000-07-27 00:00:00,4.0,2.0,2.0,0.0,2.0,1.0,1.0,1.0,...,2.0,4.0,0.0,1.0,2798220.0,355030.0,37.0,2.0,1.0,1.0
max,65.0,2023-10-30 00:00:00,5.0,5.0,30.0,28.0,92.0,97.0,91.0,1.0,...,2.0,5.0,1.0,5.0,9999999.0,530010.0,98.0,2.0,1.0,1.0
std,6.723352,,1.419769,1.404968,1.241332,0.6119844,1.460373,1.212463,0.7427981,0.4814009,...,0.5001649,1.419894,0.1002608,0.258671,1999851.375,100699.7,7.730621,0.4959273,0.4674402,0.003438338


The info. about each column can be seen [here](../docs/features.md)