In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns ;sns.set()

In [None]:
df = pd.read_csv('../Data/CleanData/Test1.csv',dtype='object',index_col=0)
df.columns

## It may be possible to clean some columns that contain no information after the slicing of the nofet df. (ocupacion, nivel edu fallecido, etc)

In [None]:
#First drop columns that only contain NaNs. I expect this from variables related to pregnancy when dead
print("Empty columns:")
print(df.loc[:,df.isna().all()].columns.to_list())

dfn = df.loc[:,~df.isna().all()]
print()
### Check other columns that are suspicios of not containing any relevant information.
#GRU_ED2 because it should always be == 1 or NaN
for col in ['EST_CIVIL','GRU_ED2','NIVEL_EDU_FALLECIDO','ULTCURFAL','MAN_MUER']:
    print(col ,dfn[col].unique())

### Can only drop EST_CIVIL, GRU_ED2. Interestingly enough, there is someone in 6th grade (probably a spelling mistake). I'll drop these 2 columns as well

In [None]:
dfn = dfn.drop(columns=['EST_CIVIL','GRU_ED2','NIVEL_EDU_FALLECIDO','ULTCURFAL'])

## Start looking at some seemingly easy to understand variables:

PESO, AREA_RES, EDAD_MADRE, SEXO, TALLA, T_GES

In [None]:
cols = ['PESO','AREA_RES','EDAD_MADRE','SEXO','TALLA','T_GES']

fig = plt.figure(figsize=(20,10))
for i,col in enumerate(cols):
    axi = fig.add_subplot(2,3,i+1)
    axi.set_title(col)
    cnt_peso = dfn[col].value_counts()
    sns.barplot(cnt_peso.index,cnt_peso.values,ax=axi)
    
plt.savefig('../Plots/InitialDists.png')
plt.show()

PESO has quite an interesting distribution. It looks normal if we exclude 1 and 9, but 1 is a large peak. It may be related to underweighting. There is also the fact that 1 represents the 0-1000 g range, while every other number represents a 500 g range. Even then, it is still a prominent peak.

EDAD_MADRE is skewed to the right, with the peak at 3 (20-24 yo). It is worrying the quantity of pregnancies from 10-14 yo (bin 1). The range 15-19 is also quite large.

Also, most of the births occur at municipal headers (1), SEXO is almost 50/50.

T_GES: Has peaks at 1 and 4 (<22 weeks, 38-41 weeks). The latter is about the normal time of a pregnancy, but the first roughly indicates the quantity of early deliveries. It would be interesting to study the influence of this variable on viability of newborn.

Most of the TALLA are missing, so it probably is not a very informative variable.

## Now let us discriminate distributions.

In [None]:
cols = ['PESO','AREA_RES','EDAD_MADRE','SEXO','TALLA','T_GES']

fig = plt.figure(figsize=(20,10))
for i,col in enumerate(cols):
    axi = fig.add_subplot(2,3,i+1)
    axi.set_title(col)
    cnt_peso = (dfn[[col,'TIPO_DEFUN']]
                .groupby([col,'TIPO_DEFUN'])
                .apply(len).reset_index()
                .rename(columns={0:'Count'}))
    sns.barplot(data=cnt_peso,x=col,y='Count',hue='TIPO_DEFUN',ax=axi)
    
plt.tight_layout()
plt.savefig('../Plots/InitialDiscrimDists.png')
plt.show()

The first plot explains the observations on the above cells: The PESO distribution of the newborns IS normal, the peak at 1 is due almost exclusively to fetal deaths (orange). A small peak (green) cal also be seen here, which probably makes PESO a good factor indicating newborn viability. AREA and EDAD_MADRE have very similar distributions. 

From these plots we can start identifying correlations (at least to proof the consistency of the data). We can see that the peak at 3 in SEXO (undetermined) is exclusively due to fetal deaths. From the T_GES plot we can see that most of the fetal deaths occur before 22 weeks. Depending on the time of differentiation of sex, this correlation may be due to the lack of differentiation at the time of the death of these fetuses.

There might be some correlation between TALLA and T_GES. There are two similar peaks in both of the plots, which may indicate some correspondence.

There doesn't seem to be substantial differences among distributions in AREA_RES, that is, the fact that some mother lives at a city or in a rural place doesn't seem to determine whether the fetus lives or dies.

An additional (and more subtle) insight: in the EDAD_MADRE plot we can see that the distribution for fetal deaths is a bit more spread out than that for the newborns. What this means is that older women do get pregnant, but the probability of them having an abortion increases.