<img src="https://i.imgur.com/6U6q5jQ.png"/>

<a target="_blank" href="https://colab.research.google.com/github/SocialAnalytics-StrategicIntelligence/TableOperations/blob/main/index.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Operations on Data Frames


Let me get the data on dengue from [Peru](https://www.datosabiertos.gob.pe/dataset/vigilancia-epidemiol%C3%B3gica-de-dengue):

In [None]:
import pandas as pd
linkData="https://github.com/SocialAnalytics-StrategicIntelligence/TableOperations/raw/main/dataFiles/dengue_ok.pkl"
dengue = pd.read_pickle(linkData)
dengue.info()

In [None]:
# some exploration
dengue.describe().apply(lambda s: s.apply('{0:.5f}'.format))

Each row is a person:

In [None]:
dengue.head()

If we wanted to count people, creating a column of ones helps:

In [None]:
dengue=dengue.assign(case=1)
dengue.head()

Let's start creating _data from these data_!

# Average Age - by Year and Symptoms

## Aggregation

Having people, we need to count them by some grouping variable, in this case year (_ano_) and dengue status.


In [None]:
indexList=['ano','enfermedad']
aggregator={'edad': ['mean','median'], 'case':['sum']}
ByYear_stats=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYear_stats.head(20)

In [None]:
# notice hierarchy: multindex
ByYear_stats.columns

For easier manipualtion outside Python, we could flatten the index hierarchy:

In [None]:
# ok?
["_".join(name) for name in ByYear_stats.columns]

In [None]:
# changing
ByYear_stats.columns=["_".join(name) for name in ByYear_stats.columns]
ByYear_stats.head(20)

In [None]:
# final look:
ByYear_stats.reset_index(drop=False,inplace=True)
ByYear_stats.head(20)

Notice a particular data type:

In [None]:
ByYear_stats.enfermedad.dtype

Saving to CSV will erase that _dtype_ attribute. Then, use pickle file format: 

In [None]:
ByYear_stats.to_pickle('dataFiles/ByYear_stats.pkl') # this can be read in R.

## Reshaping

Notice the variables are in three columns: **edad_mean** /	**edad_median** / 	**case_sum**. We could reshape those columns to a long format: 

In [None]:
theVarsAsIndex=['ano','enfermedad']

# stacking  and resetting index
ByYear_LongStats=ByYear_stats.set_index(theVarsAsIndex).stack().reset_index()

#result
ByYear_LongStats

In [None]:
# just renaming
ByYear_LongStats.rename(columns={'level_2':'statsName',0:'statsValue'},inplace=True)
ByYear_LongStats

In [None]:
# still ordinal?
ByYear_LongStats.enfermedad.dtype

In [None]:
ByYear_LongStats.to_pickle('dataFiles/ByYear_LongStats.pkl')

# Dengue by Location (Province)

## Aggregating

We can redo the previous process, adding _departamento_ and _province_: 

In [None]:
indexList=['ano','departamento','provincia','enfermedad']
aggregator={'case':['sum']}
ByYearPlace=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace

Before flattening the output data frame in long format, you could create a wide shape:

## Long to wide

In [None]:
#simply
ByYearPlace.unstack()

In [None]:
# a more familiar look
ByYearPlace_wide=ByYearPlace.unstack().reset_index()
ByYearPlace_wide

In [None]:
# zero instead of missing
ByYearPlace_wide.fillna(0,inplace=True)
ByYearPlace_wide

In [None]:
# you expected
ByYearPlace_wide.columns

In [None]:
#prepare
["_".join(names) if names[1]!='' else names[0] for names in ByYearPlace_wide.columns]

In [None]:
# change
ByYearPlace_wide.columns=["_".join(names) if names[1]!='' else names[0] for names in ByYearPlace_wide.columns]
ByYearPlace_wide

What about finding the _provincia_ most affected in a _departamento_?

In [None]:
where = ByYearPlace_wide.groupby(['ano','departamento'])['case_sum_ALARMA'].idxmax()
worst_prov_year = ByYearPlace_wide.loc[where].reset_index(drop=True)
worst_prov_year

In [None]:
# amount of worst provinces per region
len(worst_prov_year.provincia.value_counts())

In [None]:
# amount of worst provinces per region - cleaner
len(worst_prov_year[worst_prov_year.case_sum_ALARMA>0].provincia.value_counts())

## Filtering

Lets' filter some rows based on what we just computed:

In [None]:
worst_ProvYear_alarma=worst_prov_year[worst_prov_year.case_sum_ALARMA>0].loc[:,['departamento','provincia']]
worst_ProvYear_alarma.reset_index(drop=True,inplace=True)
worst_ProvYear_alarma

In [None]:
# adding a column of ones
worst_ProvYear_alarma['case']=1
worst_ProvYear_alarma

## Frequency table

With filtered data, let's create a crosstabulation:

In [None]:
indexList=['departamento','provincia']
aggregator={'case':['sum']}
worst_ProvYear_alarma_Frequency=worst_ProvYear_alarma.groupby(indexList,observed=True).agg(aggregator)
worst_ProvYear_alarma_Frequency

In [None]:
# flattening with counts
worst_ProvYear_alarma_Frequency.reset_index()

In [None]:
# final look
worst_ProvYear_alarma_Frequency.columns=['case']
worst_ProvYear_alarma_Frequency.reset_index(inplace=True)
worst_ProvYear_alarma_Frequency

Saving the frequencies in a file:

In [None]:
worst_ProvYear_alarma_Frequency.to_csv('dataFiles/worst_ProvYear_alarma_Frequency.csv',index=False)

# The 'ALARM' symptoms level

## Aggregating

Let's check a previous data frame:

In [None]:
ByYearPlace_wide

This time, I want two variables:

In [None]:
indexList=['ano','departamento']
aggregator={'case_sum_SIN_SEÑALES':['sum'],'case_sum_ALARMA':['sum']}
ByYearPlace=ByYearPlace_wide.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace.columns=['sum_SIN_SEÑALES','sum_ALARMA']
ByYearPlace.reset_index(inplace=True)
ByYearPlace

## Creating information

I will create a new variable:

In [None]:
ByYearPlace['rateAlarma']=(ByYearPlace['sum_ALARMA']/ByYearPlace['sum_SIN_SEÑALES'])
ByYearPlace['rateAlarma'].describe()

We got _inf_ values:

In [None]:
import numpy as np #identify with numpy
ByYearPlace[np.isinf(ByYearPlace.rateAlarma)]

We need to make a decision. I did this:

In [None]:
ByYearPlace.loc[186,'rateAlarma']=1
ByYearPlace.drop(columns=['sum_SIN_SEÑALES','sum_ALARMA'],inplace=True)
ByYearPlace['rateAlarma'].describe()

### Discretizing

Sometimes you need a numerical variable as an ordinal variable: 

In [None]:
edges=[-1, .1, .25, .5,.75,1,2]
theLabels=["less10%","10-25%","25-50","51-75%","75-100%","above100%"]
ByYearPlace["rateAlarma.cut"]=pd.cut(ByYearPlace['rateAlarma'], include_lowest=True,
                                     bins=edges, 
                                     labels=theLabels,
                                     ordered=True)

In [None]:
# we have
ByYearPlace

We could check the yearly behavior:

In [None]:
ByYearPlace.groupby('ano').describe()

Let's do some **filtering**:

In [None]:
ByYearPlace=ByYearPlace[ByYearPlace.ano>=2012]
ByYearPlace.reset_index(drop=True,inplace=True)
ByYearPlace

In [None]:
ByYearPlace.info()

In [None]:
# the category should be exported as pickle

ByYearPlace.to_pickle("dataFiles/ByYearPlace.pkl")