Let's bring some data of COVID from Brazil:

In [1]:
linkCovid1='https://github.com/DACSS-PreProcessing/DFaggregating/raw/refs/heads/main/datafiles/HIST_PAINEL_COVIDBR_2022_Parte1_20jul2022.csv'
linkCovid2='https://github.com/DACSS-PreProcessing/DFaggregating/raw/refs/heads/main/datafiles/HIST_PAINEL_COVIDBR_2022_Parte2_20jul2022.csv'
dataCovid1=read.csv(linkCovid1,sep = ';')
dataCovid2=read.csv(linkCovid2,sep = ';')


Let's concatenate both data frames:

In [2]:
dataCovid=do.call(rbind,list(dataCovid1, dataCovid2))

In [4]:
dataCovid$data=strptime(dataCovid$data, "%Y-%m-%d")
dataCovid$day=format(dataCovid$data,"%d")
dataCovid$year=format(dataCovid$data,"%Y")
dataCovid$month=format(dataCovid$data,"%m")
head(dataCovid)

Unnamed: 0_level_0,regiao,estado,municipio,coduf,codmun,codRegiaoSaude,nomeRegiaoSaude,data,semanaEpi,populacaoTCU2019,casosAcumulado,casosNovos,obitosAcumulado,obitosNovos,Recuperadosnovos,emAcompanhamentoNovos,interior.metropolitana,day,year,month
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<int>,<int>,<chr>,<dttm>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>
1,Brasil,,,76,,,,2022-01-01,52,210147125,22291507,3986,619105,49,21581668,90734,,1,2022,1
2,Brasil,,,76,,,,2022-01-02,1,210147125,22293228,1721,619133,28,21581717,92378,,2,2022,1
3,Brasil,,,76,,,,2022-01-03,1,210147125,22305078,11850,619209,76,21591847,94022,,3,2022,1
4,Brasil,,,76,,,,2022-01-04,1,210147125,22323837,18759,619384,175,21603954,100499,,4,2022,1
5,Brasil,,,76,,,,2022-01-05,1,210147125,22351104,27267,619513,129,21615473,116118,,5,2022,1
6,Brasil,,,76,,,,2022-01-06,1,210147125,22386930,35826,619641,128,21626836,140453,,6,2022,1


In [5]:
saveRDS(dataCovid,"BrasilCovid.rds")

Now, check the data available:

In [None]:
str(dataCovid)

'data.frame':	1129419 obs. of  17 variables:
 $ regiao                : chr  "Brasil" "Brasil" "Brasil" "Brasil" ...
 $ estado                : chr  "" "" "" "" ...
 $ municipio             : chr  "" "" "" "" ...
 $ coduf                 : int  76 76 76 76 76 76 76 76 76 76 ...
 $ codmun                : int  NA NA NA NA NA NA NA NA NA NA ...
 $ codRegiaoSaude        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ nomeRegiaoSaude       : chr  "" "" "" "" ...
 $ data                  : chr  "2022-01-01" "2022-01-02" "2022-01-03" "2022-01-04" ...
 $ semanaEpi             : int  52 1 1 1 1 1 1 1 2 2 ...
 $ populacaoTCU2019      : int  210147125 210147125 210147125 210147125 210147125 210147125 210147125 210147125 210147125 210147125 ...
 $ casosAcumulado        : num  22291507 22293228 22305078 22323837 22351104 ...
 $ casosNovos            : int  3986 1721 11850 18759 27267 35826 63292 49303 24382 34788 ...
 $ obitosAcumulado       : int  619105 619133 619209 619384 619513 619641 619822 6199

Let's keep complete data by "ESTADO":

In [None]:
dataCovid=dataCovid[dataCovid$estado!="",]

Let's keep some columns:

In [None]:

toSelect=c('regiao', 'estado', 'municipio','data', 'semanaEpi','casosNovos', 'obitosNovos')
covid=dataCovid[,toSelect]

head(covid)


Unnamed: 0_level_0,regiao,estado,municipio,data,semanaEpi,casosNovos,obitosNovos
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>
182,Norte,RO,,2022-01-01,52,34,3
183,Norte,RO,,2022-01-02,1,32,2
184,Norte,RO,,2022-01-03,1,91,3
185,Norte,RO,,2022-01-04,1,254,3
186,Norte,RO,,2022-01-05,1,232,3
187,Norte,RO,,2022-01-06,1,391,8


Let's format the dates, and get date details:

In [3]:
covid$data=strptime(covid$data, "%Y-%m-%d")
covid$day=format(covid$data,"%d")
covid$year=format(covid$data,"%Y")
covid$month=format(covid$data,"%m")
head(covid)


ERROR: Error: object 'covid' not found


Let's find out about years available:

In [None]:
unique(covid$year)

In [None]:
unique(covid$month)

So, we have data from January to July 2022.
Let's find out: **count of new positive cases per month**:

In [None]:
sum(covid$casosNovos[covid$month=='07'])

In [None]:
sum(covid$casosNovos[covid$month=='06'])

In [None]:
sum(covid$casosNovos[covid$month=='05'])

...

In [None]:
sum(covid$casosNovos[covid$month=='01'])

We use **aggregation** to simplify the previous steps:

In [None]:
# sum of cases by month
casesSumByMonth=aggregate(data=covid,casosNovos~month,sum)
casesSumByMonth

month,casosNovos
<chr>,<int>
1,6278446
2,6721752
3,2320550
4,1000682
5,1141604
6,2677960
7,2192552


**AGGREGATING** capabilities allow us to produce useful output with few code:

* **The groupings**:

In the last example, _month_ was the **grouping** variable. We can have more the one of those:

In [None]:
# sum of cases by estado and week
casesSumByStateAndMonth=aggregate(data=covid,casosNovos~estado + month,sum)
casesSumByStateAndMonth

estado,month,casosNovos
<chr>,<chr>,<int>
AC,01,25752
AL,01,32320
AM,01,192126
AP,01,49954
BA,01,184908
CE,01,336702
DF,01,173908
ES,01,426590
GO,01,178684
MA,01,31844


* **The function to apply**:

We can have more than one function:

In [None]:
# sum and mean of cases by estado and week
casesSumAndMeanByStateAndWeek=aggregate(data=covid,casosNovos~estado + semanaEpi,
          function(x) c(mean = mean(x), sum = sum(x) ) )


head(casesSumAndMeanByStateAndWeek,30)

Unnamed: 0_level_0,estado,semanaEpi,casosNovos
Unnamed: 0_level_1,<chr>,<int>,"<dbl[,2]>"
1,AC,1,"0.2111801, 34"
2,AL,1,"1.8159341, 1322"
3,AM,1,"6.9569161, 3068"
4,AP,1,"7.8487395, 934"
5,BA,1,"3.9570406, 11606"
6,CE,1,"6.0890937, 7928"
7,DF,1,"729.2857143, 10210"
8,ES,1,"16.7678571, 9390"
9,GO,1,"6.2926267, 10924"
10,MA,1,"1.6086106, 2466"


...or better:

In [None]:
casesSumAndMeanByStateAndWeek=do.call(data.frame, aggregate(data=covid,casosNovos~estado + semanaEpi,
function(x) c(mean = mean(x), sum = sum(x) ) ))
head(casesSumAndMeanByStateAndWeek,30)

Unnamed: 0_level_0,estado,semanaEpi,casosNovos.mean,casosNovos.sum
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<dbl>
1,AC,1,0.2111801,34
2,AL,1,1.8159341,1322
3,AM,1,6.9569161,3068
4,AP,1,7.8487395,934
5,BA,1,3.9570406,11606
6,CE,1,6.0890937,7928
7,DF,1,729.2857143,10210
8,ES,1,16.7678571,9390
9,GO,1,6.2926267,10924
10,MA,1,1.6086106,2466


* **The variables transformed**:

We can apply the function to more than one variable:

In [None]:
# sum of cases and deaths by estado

CasesAndDeathsByState=aggregate(data=covid,
                                cbind(casosNovos,obitosNovos)~estado,
                                sum)

head(CasesAndDeathsByState,30)

Unnamed: 0_level_0,estado,casosNovos,obitosNovos
Unnamed: 0_level_1,<chr>,<int>,<int>
1,AC,98212,318
2,AL,147788,1302
3,AM,330308,734
4,AP,91570,254
5,BA,739720,5438
6,CE,776922,5106
7,DF,613688,1406
8,ES,1116860,2594
9,GO,1279760,4834
10,MA,167962,1088


* Function **according** to variable

The function can vary according to variable.  In this case, using **dplyr** is needed:

In [None]:
library(dplyr)
covid |>
  group_by(month) |>
  summarize(casosNovos_VAR = var(casosNovos),
            casosNovos_SD = sd(casosNovos),
            obitosNovos_Median = median(obitosNovos),
            obitosNovos_Mean = mean(obitosNovos))


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




month,casosNovos_VAR,casosNovos_SD,obitosNovos_Median,obitosNovos_Mean
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,326469.2,571.3748,0,0.09281227
2,306066.65,553.2329,0,0.28219244
3,72432.53,269.1329,0,0.11970739
4,82381.53,287.0218,0,0.04438116
5,89456.42,299.0927,0,0.03650708
6,61557.08,248.107,0,0.05624778
7,74893.74,273.6672,0,0.08545746
