# Data Aggregating


Sometimes, you need to summarize the unit of analysis at a higher level. This is when you need the aggregating capabilities in Pandas.

We will use data from here:

In [1]:
%%html
<iframe width="700" height="300" src="https://covid.saude.gov.br/" allowfullscreen></iframe>


I downloaded the data for 2022 in the _DataFiles_ folder:

In [2]:
import pandas as pd
import glob
import os

all_names = glob.glob(os.path.join('FilesToAggregate' , "*2022.csv"))
all_names

['FilesToAggregate/HIST_PAINEL_COVIDBR_2022_Parte2_20jul2022.csv',
 'FilesToAggregate/HIST_PAINEL_COVIDBR_2022_Parte1_20jul2022.csv']

In [3]:
dfs=[]
for name in all_names:
    dfs.append(pd.read_csv(name,sep=";"))

Let's check the names:

In [4]:
for df in dfs:
    print(df.columns)

Index(['regiao', 'estado', 'municipio', 'coduf', 'codmun', 'codRegiaoSaude',
       'nomeRegiaoSaude', 'data', 'semanaEpi', 'populacaoTCU2019',
       'casosAcumulado', 'casosNovos', 'obitosAcumulado', 'obitosNovos',
       'Recuperadosnovos', 'emAcompanhamentoNovos', 'interior/metropolitana'],
      dtype='object')
Index(['regiao', 'estado', 'municipio', 'coduf', 'codmun', 'codRegiaoSaude',
       'nomeRegiaoSaude', 'data', 'semanaEpi', 'populacaoTCU2019',
       'casosAcumulado', 'casosNovos', 'obitosAcumulado', 'obitosNovos',
       'Recuperadosnovos', 'emAcompanhamentoNovos', 'interior/metropolitana'],
      dtype='object')


In [5]:
# then
covid=pd.concat(dfs,ignore_index=True,copy=False)

We have several rows:

In [6]:
covid.shape[0]

1129419

Let's keep the columns needed:

In [7]:
toSelect=['regiao', 'estado', 'municipio','data', 'semanaEpi','casosNovos', 'obitosNovos']
covid=covid[toSelect]

In [19]:
# you have the data at the municipal level

covid.head()

Unnamed: 0,regiao,estado,municipio,data,semanaEpi,casosNovos,obitosNovos
0,Brasil,,,2022-07-01,26,76045,284
1,Brasil,,,2022-07-02,26,37784,158
2,Brasil,,,2022-07-03,27,18575,53
3,Brasil,,,2022-07-04,27,45501,122
4,Brasil,,,2022-07-05,27,74591,396


We will aggregate by estado, let's check missingness:

In [20]:
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1129419 entries, 0 to 1129418
Data columns (total 7 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   regiao       1129419 non-null  object
 1   estado       1129218 non-null  object
 2   municipio    1119570 non-null  object
 3   data         1129419 non-null  object
 4   semanaEpi    1129419 non-null  int64 
 5   casosNovos   1129419 non-null  int64 
 6   obitosNovos  1129419 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 60.3+ MB


There are enough data, let's see different aggregation alternatives:

In [26]:
# sum of cases by estado
casesSumByState=covid.groupby('estado').agg({'casosNovos': 'sum'})
casesSumByState

Unnamed: 0_level_0,casosNovos
estado,Unnamed: 1_level_1
AC,98212
AL,147788
AM,330308
AP,91570
BA,739720
CE,776922
DF,613688
ES,1116860
GO,1279760
MA,167962


In [27]:
# sum of cases by estado and week
casesSumByStateAndWeek=covid.groupby(['estado','semanaEpi']).agg({'casosNovos': 'sum'})
casesSumByStateAndWeek

Unnamed: 0_level_0,Unnamed: 1_level_0,casosNovos
estado,semanaEpi,Unnamed: 2_level_1
AC,1,34
AC,2,2464
AC,3,7796
AC,4,8592
AC,5,16090
...,...,...
TO,26,20306
TO,27,12422
TO,28,9914
TO,29,4148


In [28]:
# sum and mean of cases by estado and week
casesSumAndMeanByStateAndWeek=covid.groupby(['estado','semanaEpi']).agg({'casosNovos': ['sum','mean']})
casesSumAndMeanByStateAndWeek

Unnamed: 0_level_0,Unnamed: 1_level_0,casosNovos,casosNovos
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean
estado,semanaEpi,Unnamed: 2_level_2,Unnamed: 3_level_2
AC,1,34,0.211180
AC,2,2464,15.304348
AC,3,7796,48.422360
AC,4,8592,53.366460
AC,5,16090,99.937888
...,...,...,...
TO,26,20306,20.573455
TO,27,12422,12.585613
TO,28,9914,10.044580
TO,29,4148,7.354610


In [29]:
# sum of cases and deaths by estado
CasesAndDeathsByState=covid.groupby('estado').agg({'casosNovos': 'sum', 'obitosNovos': 'sum'})
CasesAndDeathsByState

Unnamed: 0_level_0,casosNovos,obitosNovos
estado,Unnamed: 1_level_1,Unnamed: 2_level_1
AC,98212,318
AL,147788,1302
AM,330308,734
AP,91570,254
BA,739720,5438
CE,776922,5106
DF,613688,1406
ES,1116860,2594
GO,1279760,4834
MA,167962,1088


In [36]:
# sum and mean of cases and deaths by estado and week
CasesAndDeathsMeanAndSumByStateAndWeek=covid.groupby(['estado','semanaEpi']).agg({'casosNovos': ['sum','mean'],'obitosNovos':['sum','mean']})
CasesAndDeathsMeanAndSumByStateAndWeek

Unnamed: 0_level_0,Unnamed: 1_level_0,casosNovos,casosNovos,obitosNovos,obitosNovos
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
estado,semanaEpi,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AC,1,34,0.211180,2,0.012422
AC,2,2464,15.304348,4,0.024845
AC,3,7796,48.422360,2,0.012422
AC,4,8592,53.366460,22,0.136646
AC,5,16090,99.937888,56,0.347826
...,...,...,...,...,...
TO,26,20306,20.573455,20,0.020263
TO,27,12422,12.585613,2,0.002026
TO,28,9914,10.044580,6,0.006079
TO,29,4148,7.354610,16,0.028369


In all the previous cases, the aggregating category is the new index. In general, you can have a traditional table like this:

In [30]:
CasesAndDeathsByState.reset_index()

Unnamed: 0,estado,casosNovos,obitosNovos
0,AC,98212,318
1,AL,147788,1302
2,AM,330308,734
3,AP,91570,254
4,BA,739720,5438
5,CE,776922,5106
6,DF,613688,1406
7,ES,1116860,2594
8,GO,1279760,4834
9,MA,167962,1088


However, in case of **multi index** that could be more challenging:

In [37]:
CasesAndDeathsMeanAndSumByStateAndWeek.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,casosNovos,casosNovos,obitosNovos,obitosNovos
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
estado,semanaEpi,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AC,1,34,0.21118,2,0.012422
AC,2,2464,15.304348,4,0.024845
AC,3,7796,48.42236,2,0.012422
AC,4,8592,53.36646,22,0.136646
AC,5,16090,99.937888,56,0.347826


In [38]:
CasesAndDeathsMeanAndSumByStateAndWeek.columns

MultiIndex([( 'casosNovos',  'sum'),
            ( 'casosNovos', 'mean'),
            ('obitosNovos',  'sum'),
            ('obitosNovos', 'mean')],
           )

In [43]:
newColumns=["_".join(levels) for levels in CasesAndDeathsMeanAndSumByStateAndWeek.columns]
newColumns

['casosNovos_sum', 'casosNovos_mean', 'obitosNovos_sum', 'obitosNovos_mean']

In [44]:
CasesAndDeathsMeanAndSumByStateAndWeek.columns=newColumns

In [46]:
CasesAndDeathsMeanAndSumByStateAndWeek

Unnamed: 0_level_0,Unnamed: 1_level_0,casosNovos_sum,casosNovos_mean,obitosNovos_sum,obitosNovos_mean
estado,semanaEpi,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AC,1,34,0.211180,2,0.012422
AC,2,2464,15.304348,4,0.024845
AC,3,7796,48.422360,2,0.012422
AC,4,8592,53.366460,22,0.136646
AC,5,16090,99.937888,56,0.347826
...,...,...,...,...,...
TO,26,20306,20.573455,20,0.020263
TO,27,12422,12.585613,2,0.002026
TO,28,9914,10.044580,6,0.006079
TO,29,4148,7.354610,16,0.028369


In [47]:
CasesAndDeathsMeanAndSumByStateAndWeek.reset_index(inplace=True)
CasesAndDeathsMeanAndSumByStateAndWeek

Unnamed: 0,estado,semanaEpi,casosNovos_sum,casosNovos_mean,obitosNovos_sum,obitosNovos_mean
0,AC,1,34,0.211180,2,0.012422
1,AC,2,2464,15.304348,4,0.024845
2,AC,3,7796,48.422360,2,0.012422
3,AC,4,8592,53.366460,22,0.136646
4,AC,5,16090,99.937888,56,0.347826
...,...,...,...,...,...,...
805,TO,26,20306,20.573455,20,0.020263
806,TO,27,12422,12.585613,2,0.002026
807,TO,28,9914,10.044580,6,0.006079
808,TO,29,4148,7.354610,16,0.028369


In [48]:
import os

CasesAndDeathsMeanAndSumByStateAndWeek.to_csv(os.path.join("FilesToAggregate","Aggregated_Covid.csv"),index=False)