<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Data Organization


Let me get the data on dengue from [Peru](https://www.datosabiertos.gob.pe/dataset/vigilancia-epidemiol%C3%B3gica-de-dengue):

In [1]:
import pandas as pd
linkData="https://github.com/SocialAnalytics-StrategicIntelligence/OrganizeExploreAndQuery/raw/main/dataFiles/dengue_ok.pkl"
dengue = pd.read_pickle(linkData)
dengue.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398943 entries, 0 to 398942
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   departamento  398943 non-null  object        
 1   provincia     398943 non-null  object        
 2   distrito      398943 non-null  object        
 3   ano           398943 non-null  int64         
 4   semana        398943 non-null  int64         
 5   sexo          398943 non-null  object        
 6   edad          398943 non-null  int64         
 7   enfermedad    398943 non-null  category      
 8   year          398931 non-null  datetime64[ns]
dtypes: category(1), datetime64[ns](1), int64(3), object(4)
memory usage: 24.7+ MB


In [2]:
dengue.describe().apply(lambda s: s.apply('{0:.5f}'.format))

Unnamed: 0,ano,semana,edad,year
count,398943.0,398943.0,398943.0,398931.00000
mean,2015.0617,22.61685,29.97476,.5f
min,2000.0,1.0,0.0,.5f
25%,2011.0,11.0,15.0,.5f
50%,2016.0,19.0,27.0,.5f
75%,2020.0,34.0,42.0,.5f
max,2022.0,53.0,106.0,.5f
std,6.14862,14.89333,18.5326,


Each row is a person:

In [3]:
dengue.head()

Unnamed: 0,departamento,provincia,distrito,ano,semana,sexo,edad,enfermedad,year
0,HUANUCO,LEONCIO PRADO,LUYANDO,2000,47,M,9,SIN_SEÑALES,2000-01-01
1,HUANUCO,LEONCIO PRADO,LUYANDO,2000,40,F,18,SIN_SEÑALES,2000-01-01
2,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,48,F,32,SIN_SEÑALES,2000-01-01
3,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,37,F,40,SIN_SEÑALES,2000-01-01
4,HUANUCO,LEONCIO PRADO,MARIANO DAMASO BERAUN,2000,42,M,16,SIN_SEÑALES,2000-01-01


If we wanted to count people, creating a column of ones helps:

In [4]:
dengue=dengue.assign(case=1)
dengue.head()

Unnamed: 0,departamento,provincia,distrito,ano,semana,sexo,edad,enfermedad,year,case
0,HUANUCO,LEONCIO PRADO,LUYANDO,2000,47,M,9,SIN_SEÑALES,2000-01-01,1
1,HUANUCO,LEONCIO PRADO,LUYANDO,2000,40,F,18,SIN_SEÑALES,2000-01-01,1
2,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,48,F,32,SIN_SEÑALES,2000-01-01,1
3,HUANUCO,LEONCIO PRADO,JOSE CRESPO Y CASTILLO,2000,37,F,40,SIN_SEÑALES,2000-01-01,1
4,HUANUCO,LEONCIO PRADO,MARIANO DAMASO BERAUN,2000,42,M,16,SIN_SEÑALES,2000-01-01,1


Let's start creating _data from these data_!

## Aggregation

In [15]:
indexList=['departamento', 'provincia', 'ano', 'semana','enfermedad']
aggregator={'case': ['sum']}
ByProvinceWeek_AllCases=dengue.groupby(indexList,observed=True).agg(aggregator)
ByProvinceWeek_AllCases

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,case
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,sum
departamento,provincia,ano,semana,enfermedad,Unnamed: 5_level_2
AMAZONAS,BAGUA,2000,18,SIN_SEÑALES,17
AMAZONAS,BAGUA,2000,19,SIN_SEÑALES,40
AMAZONAS,BAGUA,2000,20,SIN_SEÑALES,58
AMAZONAS,BAGUA,2000,21,SIN_SEÑALES,27
AMAZONAS,BAGUA,2000,22,SIN_SEÑALES,24
...,...,...,...,...,...
UCAYALI,PADRE ABAD,2022,52,SIN_SEÑALES,5
UCAYALI,PADRE ABAD,2022,52,ALARMA,2
UCAYALI,PURUS,2020,51,SIN_SEÑALES,1
UCAYALI,PURUS,2022,28,ALARMA,1


Notice:

In [16]:
ByProvinceWeek_AllCases.columns

MultiIndex([('case', 'sum')],
           )

We may turn that multi-index structure in a simpler way:

In [18]:
ByProvinceWeek_AllCases.columns=['cases']
ByProvinceWeek_AllCases

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,cases
departamento,provincia,ano,semana,enfermedad,Unnamed: 5_level_1
AMAZONAS,BAGUA,2000,18,SIN_SEÑALES,17
AMAZONAS,BAGUA,2000,19,SIN_SEÑALES,40
AMAZONAS,BAGUA,2000,20,SIN_SEÑALES,58
AMAZONAS,BAGUA,2000,21,SIN_SEÑALES,27
AMAZONAS,BAGUA,2000,22,SIN_SEÑALES,24
...,...,...,...,...,...
UCAYALI,PADRE ABAD,2022,52,SIN_SEÑALES,5
UCAYALI,PADRE ABAD,2022,52,ALARMA,2
UCAYALI,PURUS,2020,51,SIN_SEÑALES,1
UCAYALI,PURUS,2022,28,ALARMA,1


In [19]:
ByProvinceWeek_AllCases.reset_index(drop=False,inplace=True)

ByProvinceWeek_AllCases

Unnamed: 0,departamento,provincia,ano,semana,enfermedad,cases
0,AMAZONAS,BAGUA,2000,18,SIN_SEÑALES,17
1,AMAZONAS,BAGUA,2000,19,SIN_SEÑALES,40
2,AMAZONAS,BAGUA,2000,20,SIN_SEÑALES,58
3,AMAZONAS,BAGUA,2000,21,SIN_SEÑALES,27
4,AMAZONAS,BAGUA,2000,22,SIN_SEÑALES,24
...,...,...,...,...,...,...
36417,UCAYALI,PADRE ABAD,2022,52,SIN_SEÑALES,5
36418,UCAYALI,PADRE ABAD,2022,52,ALARMA,2
36419,UCAYALI,PURUS,2020,51,SIN_SEÑALES,1
36420,UCAYALI,PURUS,2022,28,ALARMA,1


In [59]:
ByProvinceWeek_AllCases.enfermedad.cat.categories,ByProvinceWeek_AllCases.enfermedad.cat.ordered

(Index(['SIN_SEÑALES', 'ALARMA', 'GRAVE'], dtype='object'), True)

In [60]:
ByProvinceWeek_AllCases.to_pickle('dataFiles/ByProvinceWeek_AllCases.pkl')

## Reshaping

### From Long to Wide

The object *CasesByWeek* shows the values in one column, and the other column serve as identifier (an index). Let's make a simple wide version (one index):

In [21]:
ByProvinceWeek_AllCases.pivot_table(values='cases',
                            index=['departamento','provincia'],
                            columns='enfermedad',aggfunc="sum")

Unnamed: 0_level_0,enfermedad,SIN_SEÑALES,ALARMA,GRAVE
departamento,provincia,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AMAZONAS,ALTO AMAZONAS,0,0,0
AMAZONAS,AMBO,0,0,0
AMAZONAS,AREQUIPA,0,0,0
AMAZONAS,ASCOPE,0,0,0
AMAZONAS,ATALAYA,0,0,0
...,...,...,...,...
UCAYALI,TUMBES,0,0,0
UCAYALI,UCAYALI,0,0,0
UCAYALI,UTCUBAMBA,0,0,0
UCAYALI,VIRU,0,0,0


Simpler structure:

In [30]:
ByProvinceWeek_AllCases_Wide=ByProvinceWeek_AllCases.pivot_table(values='cases',
                            index=['departamento', 'provincia','ano'],
                            columns='enfermedad',aggfunc="sum").reset_index(drop=False)
ByProvinceWeek_AllCases_Wide

enfermedad,departamento,provincia,ano,SIN_SEÑALES,ALARMA,GRAVE
0,AMAZONAS,ALTO AMAZONAS,2000,0,0,0
1,AMAZONAS,ALTO AMAZONAS,2001,0,0,0
2,AMAZONAS,ALTO AMAZONAS,2002,0,0,0
3,AMAZONAS,ALTO AMAZONAS,2003,0,0,0
4,AMAZONAS,ALTO AMAZONAS,2004,0,0,0
...,...,...,...,...,...,...
55655,UCAYALI,ZARUMILLA,2018,0,0,0
55656,UCAYALI,ZARUMILLA,2019,0,0,0
55657,UCAYALI,ZARUMILLA,2020,0,0,0
55658,UCAYALI,ZARUMILLA,2021,0,0,0


We have multi index, let's flatten them:

In [36]:
ByProvinceWeek_AllCases_Wide.columns

Index(['departamento', 'provincia', 'ano', 'SIN_SEÑALES', 'ALARMA', 'GRAVE'], dtype='object')

In [37]:
ByProvinceWeek_AllCases_Wide.columns.name = None 

In [38]:
ByProvinceWeek_AllCases_Wide

Unnamed: 0,departamento,provincia,ano,SIN_SEÑALES,ALARMA,GRAVE
0,AMAZONAS,ALTO AMAZONAS,2000,0,0,0
1,AMAZONAS,ALTO AMAZONAS,2001,0,0,0
2,AMAZONAS,ALTO AMAZONAS,2002,0,0,0
3,AMAZONAS,ALTO AMAZONAS,2003,0,0,0
4,AMAZONAS,ALTO AMAZONAS,2004,0,0,0
...,...,...,...,...,...,...
55655,UCAYALI,ZARUMILLA,2018,0,0,0
55656,UCAYALI,ZARUMILLA,2019,0,0,0
55657,UCAYALI,ZARUMILLA,2020,0,0,0
55658,UCAYALI,ZARUMILLA,2021,0,0,0


In [40]:
ByProvinceWeek_AllCases_Wide.to_csv('dataFiles/ByProvinceWeek_AllCases_Wide.csv',index=False)

### Wide to Long

We should be able to transfor this wide version into a long one:

In [42]:
# maybe not this one:
ByProvinceWeek_AllCases_Wide.set_index('departamento').stack().reset_index()

Unnamed: 0,departamento,level_1,0
0,AMAZONAS,provincia,ALTO AMAZONAS
1,AMAZONAS,ano,2000
2,AMAZONAS,SIN_SEÑALES,0
3,AMAZONAS,ALARMA,0
4,AMAZONAS,GRAVE,0
...,...,...,...
278295,UCAYALI,provincia,ZARUMILLA
278296,UCAYALI,ano,2022
278297,UCAYALI,SIN_SEÑALES,0
278298,UCAYALI,ALARMA,0


In [46]:
ByProvinceWeek_AllCases_Long=ByProvinceWeek_AllCases_Wide.set_index(['departamento','provincia','ano']).stack().reset_index()
ByProvinceWeek_AllCases_Long

Unnamed: 0,departamento,provincia,ano,level_3,0
0,AMAZONAS,ALTO AMAZONAS,2000,SIN_SEÑALES,0
1,AMAZONAS,ALTO AMAZONAS,2000,ALARMA,0
2,AMAZONAS,ALTO AMAZONAS,2000,GRAVE,0
3,AMAZONAS,ALTO AMAZONAS,2001,SIN_SEÑALES,0
4,AMAZONAS,ALTO AMAZONAS,2001,ALARMA,0
...,...,...,...,...,...
166975,UCAYALI,ZARUMILLA,2021,ALARMA,0
166976,UCAYALI,ZARUMILLA,2021,GRAVE,0
166977,UCAYALI,ZARUMILLA,2022,SIN_SEÑALES,0
166978,UCAYALI,ZARUMILLA,2022,ALARMA,0


In [47]:
ByProvinceWeek_AllCases_Long.rename(columns={'level_3':'status',0:'cases'},inplace=True)
ByProvinceWeek_AllCases_Long

Unnamed: 0,departamento,provincia,ano,status,cases
0,AMAZONAS,ALTO AMAZONAS,2000,SIN_SEÑALES,0
1,AMAZONAS,ALTO AMAZONAS,2000,ALARMA,0
2,AMAZONAS,ALTO AMAZONAS,2000,GRAVE,0
3,AMAZONAS,ALTO AMAZONAS,2001,SIN_SEÑALES,0
4,AMAZONAS,ALTO AMAZONAS,2001,ALARMA,0
...,...,...,...,...,...
166975,UCAYALI,ZARUMILLA,2021,ALARMA,0
166976,UCAYALI,ZARUMILLA,2021,GRAVE,0
166977,UCAYALI,ZARUMILLA,2022,SIN_SEÑALES,0
166978,UCAYALI,ZARUMILLA,2022,ALARMA,0
