<img src="https://i.imgur.com/6U6q5jQ.png"/>

<a target="_blank" href="https://colab.research.google.com/github/SocialAnalytics-StrategicIntelligence/TableOperations/blob/main/index.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Operations on Data Frames


Let me get the data on dengue from [Peru](https://www.datosabiertos.gob.pe/dataset/vigilancia-epidemiol%C3%B3gica-de-dengue):

In [None]:
import pandas as pd
linkData="https://github.com/SocialAnalytics-StrategicIntelligence/TableOperations/raw/main/dengue_ok.pkl"

dengue = pd.read_pickle(linkData)

# checking format
dengue.info()

In [None]:
# Each row is a person:
dengue.head()

In [None]:
# some exploration
dengue.describe().apply(lambda s: s.apply('{0:.5f}'.format))

In [None]:
# exploring
dengue.enfermedad.value_counts()

Better labels:

In [None]:
dengue['enfermedad_text']=dengue.enfermedad.astype(str)

dengue.replace({'enfermedad_text':{'SIN_SEÑALES':'1_SIN_SEÑALES','ALARMA':'2_ALARMA','GRAVE':'3_GRAVE'}},inplace=True)

In [None]:
# exploring
dengue.ano.value_counts(sort=False)

Discretizing:

In [None]:
binLimits=[0,15,50,110]
theLabels=["a_menor_a_16","b_entre_16y50","c_mayor_a_50"]
dengue["edad_grupos"]=pd.cut(dengue['edad'], include_lowest=True,
                                     bins=binLimits, 
                                     labels=theLabels,
                                     ordered=True)

# see

dengue.head()

The surface:

In [None]:
pd.crosstab( dengue.enfermedad_text,dengue.edad_grupos, dropna=False, normalize='columns')

In [None]:
pd.crosstab(dengue.enfermedad_text,[dengue.sexo,dengue.edad_grupos], dropna=False, normalize='columns')

# Yearly look

In [None]:
import altair as alt
alt.data_transformers.enable("vegafusion")


In [None]:
alt_dengue=alt.Chart(dengue)

enc_dengue=alt_dengue.encode(
    x='ano:T',
    y='mean(edad):Q',
    color='enfermedad_text:N',
)

enc_dengue.mark_line() + enc_dengue.mark_errorband()

More detailed:

In [None]:
enc_dengue=alt_dengue.encode(
    x='ano:T',
    y='median(edad):Q',
    color='enfermedad_text:N',
    tooltip=['median(edad)','ano:T']
).interactive()

enc_dengue.mark_line().facet(
    row='sexo:N',
    column='edad_grupos:N'
) 

In [None]:
enc_dengue=alt_dengue.encode(
    x='ano:T',
    y=alt.Y('sum(case):Q'),
    color='enfermedad_text:N',
    tooltip=['sum(case):Q','ano:T']
).interactive()
enc_dengue.mark_line().facet(
    row='sexo:N',
    column='edad_grupos:N'
)

The previous plot may require a logged Y-axis: 

In [None]:
enc_dengue=alt_dengue.encode(
    x='ano:T',
    y=alt.Y('sum(case):Q', scale=alt.Scale(type='log')),
    color='enfermedad_text:N',
    tooltip=['sum(case):Q','ano:T']
).interactive()

enc_dengue.mark_line().facet(
    row='sexo:N',
    column='edad_grupos:N'
)

Let's get the same results in tables:

In [None]:
indexList=['edad_grupos','ano','sexo','enfermedad_text']
aggregator={'edad': ['median']}
LevelByYear_medians=dengue.groupby(indexList,observed=True).agg(aggregator)
LevelByYear_medians

In [None]:
LevelByYear_medians.unstack(['sexo','enfermedad_text'])

Notice the multi-index:

In [None]:
LevelByYear_medians.info()

These are other possibilities, but not better than the lines:

In [None]:
alt_dengue=alt.Chart(dengue)
enc_dengue=alt_dengue.encode(
    x='ano:T',
    y=alt.Y('sum(case):Q', scale=alt.Scale(type='log')),
    column='enfermedad_text:N'
)
enc_dengue.mark_circle() 

In [None]:
alt_dengue=alt.Chart(dengue)
enc_dengue=alt_dengue.encode(
    x='ano:T',
    y=alt.Y('sum(case):Q', scale=alt.Scale(type='log')),
    column='enfermedad_text:N',
)
enc_dengue.mark_rule() 

In [None]:
alt_dengue=alt.Chart(dengue)
enc_dengue=alt_dengue.encode(
    x='ano:T',
    y=alt.Y('sum(case):Q', scale=alt.Scale(type='log')),
    column='enfermedad_text:N',
)
enc_dengue.mark_bar() 

Let's do some aggregation:

In [None]:
indexList=['edad_grupos','ano','sexo','enfermedad_text']
aggregator={'edad': ['median','mean','min','max']}
LevelByYear_statsFull=dengue.groupby(indexList,observed=True).agg(aggregator)
LevelByYear_statsFull

Now, some reshaping:

In [None]:
LevelByYear_statsFull.stack(future_stack=True)

# Mining location

Let's use _departamento_ and _provincia_: 

In [None]:
indexList=['ano','departamento','provincia','enfermedad_text']
aggregator={'case':['sum']}
ByYearPlace=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace

Create a wide shape:

In [None]:
#long to wide
ByYearPlace.unstack()

In [None]:
# no missing values
ByYearPlace_wide=ByYearPlace.unstack().fillna(0)
ByYearPlace_wide

The idea is get the sgare of people in ALARM status. For that we need this:

In [None]:
sumCases=ByYearPlace_wide.sum(axis=1)
sumCases

In [None]:
# here you are:
shareAlarma=ByYearPlace_wide.loc[:,('case','sum','2_ALARMA')]/sumCases
shareAlarma.name='shareAlarma'
shareAlarma

No multi index:

In [None]:
shareAlarma=shareAlarma.reset_index()
shareAlarma

Let's find thwe worst province per Region in a year:

In [None]:
where = shareAlarma.groupby(['ano','departamento'])['shareAlarma'].idxmax()
worst_prov_year = shareAlarma.loc[where].reset_index(drop=True)
worst_prov_year

In [None]:
worst_prov_year.shareAlarma.describe()

In [None]:
# amount of worst provinces per region
len(worst_prov_year.provincia.value_counts())

In [None]:
# amount of worst provinces per region - cleaner
len(worst_prov_year[worst_prov_year.shareAlarma>0].provincia.value_counts())

Some filtering:

In [None]:
worst_ProvYear_alarma=worst_prov_year[worst_prov_year.shareAlarma>0].loc[:,['departamento','provincia']]
worst_ProvYear_alarma.reset_index(drop=True,inplace=True)
worst_ProvYear_alarma

In [None]:
indexList=['departamento','provincia']
aggregator={'provincia':['count']}
worst_ProvYear_alarma_Frequency=worst_ProvYear_alarma.groupby(indexList,observed=True).agg(aggregator)
worst_ProvYear_alarma_Frequency

The count informs how many years a province was the most affected:

In [None]:
worst_ProvYear_alarma_Frequency.describe()

In [None]:
# final look
worst_ProvYear_alarma_Frequency.columns=['yearsAffected']
worst_ProvYear_alarma_Frequency=worst_ProvYear_alarma_Frequency[worst_ProvYear_alarma_Frequency.yearsAffected>2]
worst_ProvYear_alarma_Frequency.reset_index(inplace=True)
worst_ProvYear_alarma_Frequency

Let's plot:

In [None]:
alt_worstProv=alt.Chart(worst_ProvYear_alarma_Frequency)

enc_worstProv=alt_worstProv.encode(
    y='departamento',
    x='provincia',
    text='yearsAffected:O',
    size='yearsAffected:O'
)

enc_worstProv.mark_text()

Let's try another info:

In [None]:
indexList=['ano','departamento','enfermedad_text']
aggregator={'case':['sum']}
ByYearDepa=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYearDepa_wide=ByYearDepa.unstack().fillna(0)
ByYearDepaAlarm=ByYearDepa_wide.loc[:,('case','sum','2_ALARMA')]/ByYearDepa_wide.sum(axis=1)
ByYearDepaAlarm.name='alarmShare'

ByYearDepaAlarm=ByYearDepaAlarm.reset_index()
ByYearDepaAlarm

In [None]:
ByYearDepaAlarm.describe()

In [None]:
ByYearDepaAlarm_focus=ByYearDepaAlarm[ByYearDepaAlarm.alarmShare>0]

In [None]:
ByYearDepaAlarm_focus.describe()

In [None]:
edges=[-1, .10, .25, .5,1]
theLabels=["a.below10%","b.11-25%","c.26-50%","d.above50%"]
ByYearDepaAlarm_focus.loc[:,"alarmLevels"]=pd.cut(ByYearDepaAlarm_focus['alarmShare'],
                                            include_lowest=True,
                                            bins=edges, 
                                            labels=theLabels,
                                            ordered=True)

##
ByYearDepaAlarm_focus.head()

In [None]:
alt_WorstDepa=alt.Chart(ByYearDepaAlarm_focus).encode(x='ano:O',
                                                      y=alt.Y('departamento:N',
                                                              sort=alt.EncodingSortField(field='alarmShare',op='max',order='descending')))
enc1_WorstDepa=alt_WorstDepa.encode(
    color=alt.Color('alarmLevels:O').scale(scheme="lightgreyred", reverse=False)
)

enc1_WorstDepa.mark_rect()

In [None]:
enc2_WorstDepa=alt_WorstDepa.encode(
    text=alt.Text('alarmShare:Q', format=".1f"),
    opacity=alt.condition('datum.alarmShare >= 0.3', alt.value(1), alt.value(0)))
enc2_WorstDepa.mark_text(fontStyle='bold')

In [None]:
enc1_WorstDepa.mark_rect() + enc2_WorstDepa.mark_text()

You can find different color schemes [here](https://vega.github.io/vega/docs/schemes/)