<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Data Aggregating in Python


Sometimes, you need to summarize the unit of analysis at a higher level. This is when you need the aggregating capabilities in Pandas.

We will use data from here:

In [None]:
%%html
<iframe width="800" height="500" src="https://covid.saude.gov.br/" allowfullscreen></iframe>


I downloaded the data for 2022 in the _FilesToAggregate_ folder:

In [None]:
import pandas as pd
import glob
import os

all_filenames = glob.glob(os.path.join('FilesToAggregate' , "*.csv"))
all_filenames

In [None]:
pd.read_csv(all_filenames[0]).head()

In [None]:
pd.read_csv(all_filenames[0],sep=";").head()

In [None]:
list_dfs=[]
for aName in all_filenames:
    list_dfs.append(pd.read_csv(aName,sep=";"))

Let's check the names:

In [None]:
for aDF in list_dfs:
    print(aDF.columns,aDF.shape)

In [None]:
pd.concat(list_dfs,axis=0,ignore_index=True)

In [None]:
# then
covid=pd.concat(list_dfs,ignore_index=True,copy=False)

Let's keep the columns needed:

In [None]:
toSelect=['regiao', 'estado', 'municipio','data', 'semanaEpi','casosNovos', 'obitosNovos']
covid=covid[toSelect]

In [None]:
# you have the data at the municipal level

covid.head()

In [None]:
# check
covid.info()

Let's keep complete data:

In [None]:
covid.dropna(how='any',inplace=True)
covid.info()

We have enough data, let's see different aggregation alternatives:

In [None]:
# sum of cases by estado
casesSumByState=covid.groupby('estado').agg({'casosNovos': 'sum'})
casesSumByState

In [None]:
# sum of cases by estado and week
casesSumByStateAndWeek=covid.groupby(['estado','semanaEpi']).agg({'casosNovos': 'sum'})
casesSumByStateAndWeek

In [None]:
# sum and mean of cases by estado and week
casesSumAndMeanByStateAndWeek=covid.groupby(['estado','semanaEpi']).agg({'casosNovos': ['sum','mean']})
casesSumAndMeanByStateAndWeek

In [None]:
# sum of cases and deaths by estado
CasesAndDeathsByState=covid.groupby('estado').agg({'casosNovos': 'sum', 'obitosNovos': 'sum'})
CasesAndDeathsByState

In [None]:
# sum and mean of cases and deaths by estado and week
CasesAndDeathsMeanAndSumByStateAndWeek=covid.groupby(['estado','semanaEpi']).agg({'casosNovos': ['sum','mean'],'obitosNovos':['sum','mean']})
CasesAndDeathsMeanAndSumByStateAndWeek

In all the previous cases, the aggregating category is the new index. In general, you can have a traditional table like this:

In [None]:
CasesAndDeathsByState.reset_index()

However, in case of **multi index** that could be more challenging:

In [None]:
CasesAndDeathsMeanAndSumByStateAndWeek.head()

In [None]:
CasesAndDeathsMeanAndSumByStateAndWeek.columns

In [None]:
newColumns=["_".join(levels) for levels in CasesAndDeathsMeanAndSumByStateAndWeek.columns]
newColumns

In [None]:
CasesAndDeathsMeanAndSumByStateAndWeek.columns=newColumns

In [None]:
CasesAndDeathsMeanAndSumByStateAndWeek

In [None]:
CasesAndDeathsMeanAndSumByStateAndWeek.reset_index(inplace=True)
CasesAndDeathsMeanAndSumByStateAndWeek