# Tidy Data

### Overview

'Tidy data' provide a standard way to organize data values within a dataset.

Principles of tidy data:

1. columns represent separate variables
2. rows represent unique, individual observations
3. observational units form tables

This may mean that you need to transform you table to meet these principles:

![Table](img/table-01.png)

This may make the resulting table harder to interpret. In that case simply consider the context in which the tale is being prepared, is it for reporting or analysis.

Converting a table to meet the principles of 'tidy data' makes it easier to fix common data problems, such as columns containing values instead of variables(columns 'treatment a' and 'treatment b' are converted to a 'treatment' column). and transform the data into different shapes as needed.

We can convert 'values' columns into variables columns using the pandas' method `.melt()`.

![Table 2](img/table-02.png)

To `melt` data, specify the dataframe, `frame=df` and the column we want to remain constant, `id_vars='name'`. The `value_vars` parameter is used to specify the values you want to `melt`. If you don't specify any columns, `melt` will use all the columns not specified i the `id_vars` parameter. We can rename the new variable and values columns with the `var_name` and `value_name` parameters respectively.

![Table 3](img/table-03.png)

### Melting Data

Melting data is the process of turning columns of your data into rows of data. Consider the 'Airquality' dataset. In the tidy DataFrame, the variables `Ozone`, `Solar.R`, `Wind`, and `Temp` each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

There are two parameters you should be aware of: `id_var`s and `value_vars`. The `id_vars` represent the columns of the data **you do not want to melt** (i.e., keep it in its current shape), while the `value_vars` represent the columns **you do wish to melt into rows**. By default, if no `value_vars` are provided, all columns not set in the `id_vars` will be melted.

Melting data will result in other rows being duplicated, e.g. the `Month` and `Day` fields are repeated with the same values.

In [1]:
import pandas as pd

df = pd.read_csv('data2/airquality.csv')
df.columns

Index(['Ozone', 'Solar.R', 'Wind', 'Temp', 'Month', 'Day'], dtype='object')

In [5]:
df.shape

(153, 6)

In [2]:
df.describe()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
count,116.0,146.0,153.0,153.0,153.0,153.0
mean,42.12931,185.931507,9.957516,77.882353,6.993464,15.803922
std,32.987885,90.058422,3.523001,9.46527,1.416522,8.86452
min,1.0,7.0,1.7,56.0,5.0,1.0
25%,18.0,115.75,7.4,72.0,6.0,8.0
50%,31.5,205.0,9.7,79.0,7.0,16.0
75%,63.25,258.75,11.5,85.0,8.0,23.0
max,168.0,334.0,20.7,97.0,9.0,31.0


In [3]:
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


Use `pd.melt()` to melt the `Ozone`, `Solar.R`, `Wind`, and `Temp` columns of airquality into rows. Do this by using `id_vars` to specify the columns you do not wish to melt:` 'Month` and `Day`.

In [9]:
df_melt = pd.melt(df, id_vars=['Month', 'Day'])
df_melt.head()

Unnamed: 0,Month,Day,variable,value
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


This exercise demonstrates that melting a DataFrame is not always appropriate if you want to make it tidy. You may have to perform other transformations depending on how your data is represented.

When melting DataFrames, it would be better to have column names more meaningful than `variable` and `value` (the default names used by `pd.melt()`).

You can rename the `variable` column by specifying an argument to the `var_name` parameter, and the `value` column by specifying an argument to the `value_name` parameter. 

In [10]:
df_melt = pd.melt(df, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
df_melt.head()

Unnamed: 0,Month,Day,measurement,reading
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


### Pivot

This is the opposite of **melting**.

If in **melting** we turn columns into rows, in **pivoting** we turn unique values into separate columns.

Why pivot data?

- to re-shape it from analysis friendly, to report friendly.
- violates **tidy data** principle, e.g. rows contain observations - multiple variables are stored in the same column.

To **pivot** use pandas `.pivot()` method:

- `index` parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of `pd.melt()`
- `columns` parameter used to denote the column we want to pivot into new columns.
- `values` parameter used to denote the values to be used to fill in the new columns

The pivot process will fail if you have duplicate entries, e.g. two temperaturs for the same date.

I such cases use the `pivot_table()` method - takes the `aggfunc` parameter by which you define how you want the duplicate values handled, .e.g. `aggfunc=np.mean` denotes that the mean be calculated for any duplicate values.

In [11]:
df_pivot = df_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')
df_pivot.head()

Unnamed: 0_level_0,measurement,Ozone,Solar.R,Temp,Wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,1,41.0,190.0,67.0,7.4
5,2,36.0,118.0,72.0,8.0
5,3,12.0,149.0,74.0,12.6
5,4,18.0,313.0,62.0,11.5
5,5,,,56.0,14.3


You'll notice we didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a **hierarchical index** (also known as a **MultiIndex**). They allow you to group columns or rows by another variable - in this case, by `Month` as well as `Day`. 

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: `.reset_index()`.

In [12]:
df_pivot.index

MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 

In [13]:
df_pivot_reset = df_pivot.reset_index()
df_pivot_reset.index

RangeIndex(start=0, stop=153, step=1)

In [15]:
df_pivot_reset.head()

measurement,Month,Day,Ozone,Solar.R,Temp,Wind
0,5,1,41.0,190.0,67.0,7.4
1,5,2,36.0,118.0,72.0,8.0
2,5,3,12.0,149.0,74.0,12.6
3,5,4,18.0,313.0,62.0,11.5
4,5,5,,,56.0,14.3


### Handling duplicate values


In [8]:
import pandas as pd
import numpy as np

# create a dataframe with duplicate data
df_dup = pd.read_csv('data2/airquality_dup.csv')
df_dup.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [9]:
# melt the duplicated data
df_dup_melt = pd.melt(df_dup, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')
df_dup_melt.columns

Index(['Month', 'Day', 'measurement', 'reading'], dtype='object')

In [10]:
df_dup_melt.head()

Unnamed: 0,Month,Day,measurement,reading
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


In [12]:
# pivot the table and deal with duplicate values by 
# providing an aggregation function through the aggfunc parameter.
df_pivot = df_dup_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading', aggfunc=np.mean)
df_pivot.head()

Unnamed: 0_level_0,measurement,Ozone,Solar.R,Temp,Wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,1,41.0,190.0,67.0,7.4
5,2,36.0,118.0,72.0,8.0
5,3,12.0,149.0,74.0,12.6
5,4,18.0,313.0,62.0,11.5
5,5,,,56.0,14.3


In [13]:
# reset the table index
df_pivot = df_pivot.reset_index()
df_pivot.head()

measurement,Month,Day,Ozone,Solar.R,Temp,Wind
0,5,1,41.0,190.0,67.0,7.4
1,5,2,36.0,118.0,72.0,8.0
2,5,3,12.0,149.0,74.0,12.6
3,5,4,18.0,313.0,62.0,11.5
4,5,5,,,56.0,14.3


In [14]:
df_pivot.shape

(153, 6)

In [15]:
# original duplicate table
df_dup.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [16]:
df_dup.shape

(306, 6)

### Re-Shaping Data

Melting and pivoting are the basic tools used to rehape data.

A common problem is when columns contain multiple 'bits' of information, e.g. males upto 14, males between 15-24 ,etc, sex and age group are stored in the same columns. Although this may be ideal for data reporting, it is unsuitable for analysis since we can not fit a model where age and gender are independent predictors.

In [25]:
import pandas as pd
import numpy as np

df_tb = pd.read_csv('data2/tb.csv')
df_tb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 18 columns):
country    200 non-null object
year       201 non-null int64
m014       162 non-null float64
m1524      163 non-null float64
m2534      164 non-null float64
m3544      164 non-null float64
m4554      165 non-null float64
m5564      166 non-null float64
m65        164 non-null float64
mu         0 non-null float64
f014       160 non-null float64
f1524      160 non-null float64
f2534      162 non-null float64
f3544      160 non-null float64
f4554      161 non-null float64
f5564      162 non-null float64
f65        160 non-null float64
fu         0 non-null float64
dtypes: float64(16), int64(1), object(1)
memory usage: 28.3+ KB


In [27]:
df_tb.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
0,AD,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,AE,2000,2.0,4.0,4.0,6.0,5.0,12.0,10.0,,3.0,16.0,1.0,3.0,0.0,0.0,4.0,
2,AF,2000,52.0,228.0,183.0,149.0,129.0,94.0,80.0,,93.0,414.0,565.0,339.0,205.0,99.0,36.0,
3,AG,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,
4,AL,2000,2.0,19.0,21.0,14.0,24.0,19.0,16.0,,3.0,11.0,10.0,8.0,8.0,5.0,11.0,


What we need is a separate 'gender' and 'age' columns.

1. we can melt the data so the gender and age are in the same column

In [28]:
# pass in the dataframe we want to melt, setting the columns we 
# do not want changed through 'id_vars', the remaining columns will be melted
df_tb_melt = pd.melt(frame=df_tb, id_vars=['country', 'year'])
df_tb_melt.columns

Index(['country', 'year', 'variable', 'value'], dtype='object')

In [29]:
df_tb_melt.head()

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0


2. the 'variable' column represents both age and gender - we now carry out some string parsing to create a gender column.

In [30]:
# slice 1st char representing the gender
df_tb_melt['gender'] = df_tb_melt.variable.str[0]
df_tb_melt.head()

Unnamed: 0,country,year,variable,value,gender
0,AD,2000,m014,0.0,m
1,AE,2000,m014,2.0,m
2,AF,2000,m014,52.0,m
3,AG,2000,m014,0.0,m
4,AL,2000,m014,2.0,m


In [31]:
# slice the remaining chars representing the age group
df_tb_melt['age_group'] = df_tb_melt.variable.str[1:]
df_tb_melt.head()

Unnamed: 0,country,year,variable,value,gender,age_group
0,AD,2000,m014,0.0,m,14
1,AE,2000,m014,2.0,m,14
2,AF,2000,m014,52.0,m,14
3,AG,2000,m014,0.0,m,14
4,AL,2000,m014,2.0,m,14


Another common way multiple variables are stored in columns is with a delimiter. Notice that the data in `df_ebola` dataframe has column names such as `Cases_Guinea` and `Deaths_Guinea`. Here, the underscore `_` serves as a delimiter between the first part (cases or deaths), and the second part (country).

In [32]:
import pandas as pd
import numpy as np

df_ebola = pd.read_csv('data2/ebola.csv')
df_ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


We can use Python's built-in string method `.split()` to split the column heading at the `_` into a list of strings.

We can then extract the first element of this list and assign it to a `type` variable, and the second element of the list to a `country` variable. We can accomplish this by accessing the `str` attribute of the column and using the `.get()` method to retrieve the `0` or `1` index, depending on the part you want.

In [33]:
# 1st melt the dataframe
df_ebola_melt = pd.melt(df_ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
df_ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [35]:
# 2nd Create a column called 'str_split' by splitting the 'type_country' column
# you have to access the 'str' attribute of type_country before you can use .split()
df_ebola_melt['str_split'] = df_ebola_melt['type_country'].str.split('_')
df_ebola_melt.head(2)

Unnamed: 0,Date,Day,type_country,counts,str_split
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]"
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]"


In [37]:
# 3rd create column 'type' using '.get()' to retrieve index 0 of 'str_split'
df_ebola_melt['type'] = df_ebola_melt['str_split'].str.get(0)

In [38]:
# 4th create column 'country' using '.get()' to retrieve index 0 of 'str_split'
df_ebola_melt['country'] = df_ebola_melt['str_split'].str.get(1)
df_ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split,country,type
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Guinea,Cases
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Guinea,Cases
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Guinea,Cases
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Guinea,Cases
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Guinea,Cases


In [39]:
# 5th tidy - remove the temp columns
df_ebola_melt = df_ebola_melt[['Date', 'Day', 'type', 'country', 'counts']]
df_ebola_melt.head()

Unnamed: 0,Date,Day,type,country,counts
0,1/5/2015,289,Cases,Guinea,2776.0
1,1/4/2015,288,Cases,Guinea,2775.0
2,1/3/2015,287,Cases,Guinea,2769.0
3,1/2/2015,286,Cases,Guinea,
4,12/31/2014,284,Cases,Guinea,2730.0
