<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Data Reshaping in Python


Let me get the data on dengue from [Peru](https://www.datosabiertos.gob.pe/dataset/vigilancia-epidemiol%C3%B3gica-de-dengue):

In [None]:
import pandas as pd
import os

dengue = pd.read_csv(os.path.join('FilesToReshape' , "datos_abiertos_vigilancia_dengue.csv"),on_bad_lines='warn')

Pandas offers **on_bad_lines='warn'** to let you know if something does not make sense. As you see, you have 8 lines that were omitted. This is what you have now:

In [None]:
dengue.shape

You can try in a different way:

In [None]:
dengue2=pd.read_table(os.path.join('FilesToReshape' , "datos_abiertos_vigilancia_dengue.csv"))
dengue2

You did not get a warning, and in fact you got 8 more rows. You can try to identify wjta is wrong:

In [None]:
dengue2.iloc[87867:87873,0]

In [None]:
# use that weird text:
dengue2[dengue2.iloc[:,0].str.contains("I\\,II",regex=False)]

I have prepared a cleaner version:

In [None]:
dengue = pd.read_csv(os.path.join('FilesToReshape' , "datos_abiertos_vigilancia_dengue_ok.csv"))
dengue

Let me select a subset of columns:

In [None]:
toSelect=['departamento', 'provincia', 'distrito','ano', 'semana', 'sexo','enfermedad']
dengue=dengue[toSelect]
dengue.head()

As we know there were issues with text, let's check department values:

In [None]:
dengue.departamento.value_counts().index.to_list()

Let's replace that cell value in all the data frame:  

In [None]:
dengue.replace('\\N',None,regex=False,inplace=True)

Now, we keep the complete data:

In [None]:
dengue.dropna(how='any',inplace=True,ignore_index=True)

In [None]:
dengue.info()

The data is about people, but since there is no identifier for a person, it is possible that rows are repeated:

In [None]:
dengue[dengue.duplicated(keep=False)].sort_values(by=['distrito','semana','sexo'])

To ease the count, we could add a column of ones:

In [None]:
dengue=dengue.assign(case=1)
dengue

At this stage, we should aggregate the data:

In [None]:
CasesByWeek=dengue.groupby(['departamento', 'provincia', 'distrito','ano', 'semana','sexo','enfermedad']).agg({'case': ['sum']})
CasesByWeek

We may turn that multi-index structure in a simpler way:

In [None]:
CasesByWeek.columns=['cases'] # new name for the only column

CasesByWeek.reset_index(drop=False,inplace=True)

CasesByWeek

## Reshaping

### From Long to Wide

The object *CasesByWeek* shows the values in one column, and the other column serve as identifier (an index). Let's make a simple wide version (one index):

In [None]:
CasesByWeek.pivot_table(values='cases',
                            index=['departamento'],
                            columns='enfermedad',aggfunc="sum")

The reshaping with two keys:

In [None]:
CasesByWeek.pivot_table(values='cases',
                            index=['departamento', 'provincia'],
                            columns='enfermedad',aggfunc="sum")

The reshaping with two keys and two multi columns:

In [None]:
CasesByWeek.pivot_table(values='cases',
                            index=['departamento', 'provincia'],
                            columns=['enfermedad','sexo'],aggfunc="sum")

Have you noticed that the more keys the more missing values?

Let's keep this last one, and it in a traditional way:

In [None]:
CasesByWeek_Wide=CasesByWeek.pivot_table(values='cases',
                            index=['departamento', 'provincia'],
                            columns=['enfermedad','sexo'],aggfunc="sum").reset_index(drop=False)
CasesByWeek_Wide

We have multi index, let's flatten them:

In [None]:
CasesByWeek_Wide.columns

In [None]:
CasesByWeek_Wide.columns.name = None 

Now, concatenate the tuples:

In [None]:
["_".join(pair) for pair in CasesByWeek_Wide.columns[2:]]

In [None]:
# create the newNames
newNames=['departamento','provincia']
newNames.extend(["_".join(pair) for pair in CasesByWeek_Wide.columns[2:]])
newNames

In [None]:
# renaming
CasesByWeek_Wide.columns=newNames
CasesByWeek_Wide.columns

We could re format the strings in the columns:

In [None]:
CasesByWeek_Wide.columns.str.title().str.replace('\\s','',regex=True).str.replace("Dengue","",regex=False)

In [None]:
# last step 
CasesByWeek_Wide.columns=CasesByWeek_Wide.columns.str.title().str.replace('\\s','',regex=True).str.replace("Dengue","",regex=False)

CasesByWeek_Wide

### Wide to Long

We should be able to transfor this wide version into a long one:

In [None]:
# maybe not this one:
CasesByWeek_Wide.set_index('Departamento').stack().reset_index()

In [None]:
CasesByWeek_Long=CasesByWeek_Wide.set_index(['Departamento','Provincia']).stack().reset_index()
CasesByWeek_Long

In [None]:
CasesByWeek_Long.rename(columns={'level_2':'status',0:'cases'},inplace=True)
CasesByWeek_Long