# Tidy Data

**Tidy Data** paper by Hadley Wickham, PhD
* Formalize the way we describe the shape of data
* Gives us a goal when formatting our data
* "Standard way to organize data values within a dataset"

### Motivation for Tidy Data

Take the following 2 tables, the information conveyed in the tables are exactly the same.

|    |Name   | Age | Height (cm)|
|----|-------|-----|------------|
|0   |Daniel |42   |167         |
|1   |John   |     |188         |
|2   |Jane   |24   |172         |

|           |0      |1    |2    |
|-----------|-------|-----|-----|
|Name       |Daniel |John |Jane |
|Age        |42     |     |24   |
|Height (cm)|167    |188  |172  |

### Principle of Tidy Data
* Columns represent separate variables
* Rows represent individual observations
* Observational units form tables

**Tidy Data makes it easier to fix common data problems**

## Reshaping Data using Melt

**Data Problems we are trying to fix**
* Columns containing values, instead of variables
* **Solution: pd.melt()**

In [1]:
import pandas as pd
df = pd.read_csv('https://assets.datacamp.com/production/repositories/666/datasets/c16448e3f4219f900f540c455fdf87b0f3da70e0/airquality.csv')
df.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


### Melt columns Ozone, Solar.R, Wind, and Temp into rows.

In [2]:
airquality_melt = pd.melt(frame=df, id_vars=['Month', 'Day'])
airquality_melt.head()

Unnamed: 0,Month,Day,variable,value
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


### Customizing Melted Data

In [3]:
airquality_melt2 = pd.melt(df, id_vars=['Month', 'Day'], var_name='Measurement', value_name='Reading')
airquality_melt2.tail()

Unnamed: 0,Month,Day,Measurement,Reading
607,9,26,Temp,70.0
608,9,27,Temp,77.0
609,9,28,Temp,75.0
610,9,29,Temp,76.0
611,9,30,Temp,68.0


## Pivoting Data

**Pivot: un-melting data**
* Opposite of melting
* In melting, we turned columns into rows
* Pivoting: turns unique values into separate columns
* Good for reshaping our data
    * Turns Analysis friendly shape into Reporting friendly shape
* Checks if our data violates Tidy Data Principles, such as rows containing observations

### Example:

**UnTidy Data**

In this dataset, notice that 'tmax' and tmin' are reported in separate rows. We want a dataset where for each observation we have the min and max temperature.

IE. Every Row is an Observation and every Column is a Variable

|    |Date       | Element | Value|
|----|-----------|---------|------|
|0   |2010-01-30 |tmax     |27.8  |
|1   |2010-01-30 |tmin     |14.5  |
|2   |2010-02-02 |tmax     |27.3  |
|2   |2010-02-02 |tmin     |14.4  |

```python
weather_tidy = weather.pivot(index= 'date',
                             columns= 'element',
                             values= 'value')
```

**Tidy Data - Element column pivoted**

|    |Date       | tmax | tmin|
|----|-----------|---------|------|
|0   |2010-01-30 |27.8     |14.5  |
|1   |2010-02-02 |27.3     |14.4  |

**Note:** You will not be able to use pivot() if you have duplicate entries. IE) if the same date had multiple tmin / tmax values. For these cases, we would want to use the pivot_table() method

### Pivot Table

* Has a parameter that specifies how to deal with duplicate values
    * Example: Can aggregate the duplicate values by taking their average
* To use pivot_table()
    * Pass in parameter for 'aggfunc'
        * Example: aggfunc = np.mean

In [4]:
airquality_melt2.head()

Unnamed: 0,Month,Day,Measurement,Reading
0,5,1,Ozone,41.0
1,5,2,Ozone,36.0
2,5,3,Ozone,12.0
3,5,4,Ozone,18.0
4,5,5,Ozone,


In [5]:
airquality_melt2.tail()

Unnamed: 0,Month,Day,Measurement,Reading
607,9,26,Temp,70.0
608,9,27,Temp,77.0
609,9,28,Temp,75.0
610,9,29,Temp,76.0
611,9,30,Temp,68.0


In [6]:
airquality_pivot = airquality_melt2.pivot_table(index=['Month', 'Day'],
                                                columns= 'Measurement',
                                                values= 'Reading')
airquality_pivot.head()

Unnamed: 0_level_0,Measurement,Ozone,Solar.R,Temp,Wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5,1,41.0,190.0,67.0,7.4
5,2,36.0,118.0,72.0,8.0
5,3,12.0,149.0,74.0,12.6
5,4,18.0,313.0,62.0,11.5
5,5,,,56.0,14.3


## Beyong Melt and Pivot

* Melting and pivoting are basic tools
* Another common problem:
    * Columns contain multiple bits of information

In the following dataset, the columns m014 stand for 'Males, Age 0 - 14'. Instead of having all these columns, we want to separate our dataset so that we have 1 column for gender, and 1 column for age.

In [7]:
df = pd.read_csv('https://assets.datacamp.com/production/repositories/666/datasets/cf05b5e01009dd5d61d7db5ac5fb790042e7fd09/tb.csv')
df.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
0,AD,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,AE,2000,2.0,4.0,4.0,6.0,5.0,12.0,10.0,,3.0,16.0,1.0,3.0,0.0,0.0,4.0,
2,AF,2000,52.0,228.0,183.0,149.0,129.0,94.0,80.0,,93.0,414.0,565.0,339.0,205.0,99.0,36.0,
3,AG,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,
4,AL,2000,2.0,19.0,21.0,14.0,24.0,19.0,16.0,,3.0,11.0,10.0,8.0,8.0,5.0,11.0,


### Melting and Parsing

First we need to melt the data down, so that the sex-age combination is in the same column. Use the .melt() method, pass in the DataFrame we want to melt, for 'id_vars' pass in the columns we want to fix

In [8]:
df_melt = pd.melt(frame=df, id_vars=['country', 'year'])
df_melt.head()

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0


Now we can parse out the sex column. To do this, we could create a new column in the data set called 'sex'. Then looking at the first letter from the variable column.

In [9]:
# Creates a new 'sex' column from the first character in the 'variable' column
df_melt['sex'] = df_melt['variable'].str[0]
df_melt.head()

Unnamed: 0,country,year,variable,value,sex
0,AD,2000,m014,0.0,m
1,AE,2000,m014,2.0,m
2,AF,2000,m014,52.0,m
3,AG,2000,m014,0.0,m
4,AL,2000,m014,2.0,m
