# Tidy data

* When analyzing data, you often spend more time preparing the data than actually doing what you want.
* But, if you have data in the right format, it can be very easy to do whatever you need, because *tools already exist* for the most common things you need to do.

For example, this is not tidy:

In [1]:
import pandas as pd
instructors1 = pd.DataFrame({"instructor": ["Richard", "Janne", "Simo"],
              "lessons": ["Jupyter; Tidy data; Python", "Python parallel; Plotting; Profiling", "R"]})
instructors1

Unnamed: 0,instructor,lessons
0,Richard,Jupyter; Tidy data; Python
1,Janne,Python parallel; Plotting; Profiling
2,Simo,R


But this is tidy:

In [2]:
instructors2 = pd.DataFrame({"instructor": ["Richard"]*3 + ["Janne"]*3 + ["Simo"]*1,
              "lessons": ["Jupyter", "Tidy data", "Python", "Python parallel", "Plotting", "Profiling", "R"]})
instructors2

Unnamed: 0,instructor,lessons
0,Richard,Jupyter
1,Richard,Tidy data
2,Richard,Python
3,Janne,Python parallel
4,Janne,Plotting
5,Janne,Profiling
6,Simo,R


This data is not tidy.  Why?

This data is not tidy:

In [3]:
attendance = pd.DataFrame(
    {"tuesday": [25, 20, 15],
     "thursday": [20, 15, 10]},
index=["week1", "week2", "week3"])
attendance

Unnamed: 0,tuesday,thursday
week1,25,20
week2,20,15
week3,15,10


The two colums are different observations and are the same type of data.  We can make the table "long" to make it tidy:

In [4]:
attendance2 = attendance.copy()
attendance2["week"] = attendance.index
attendance2 = attendance2.melt(id_vars="week", var_name="day")
attendance2.sort_values("week", inplace=True) 
attendance2.reset_index(drop=True, inplace=True)
attendance2

Unnamed: 0,week,day,value
0,week1,tuesday,25
1,week1,thursday,20
2,week2,tuesday,20
3,week2,thursday,15
4,week3,tuesday,15
5,week3,thursday,10


## In tidy data:
* One **row** is an **observation**
* Each **column** has a **single variable**
* Everything is homogenous
* Each **table** (dataframe) is a **type of observation**

These principles aren't new - it's the same concept as "normal forms" in relational (SQL) databases.

## Exercises

### 01: Using tidy data
How would you answer the following questions with the tidy and non-tidy versions, using the above data?
* How many instructors are there?
* How many modules is Richard teaching?
* What's the average attendance on Thursdays?
* How many different weekdays is the course taught on?

### 02: Why?
What are the advantages and disadvantages of the tidy vs non-tidy versions of the datasets above?

## See also
The tidy data paper: http://vita.had.co.nz/papers/tidy-data.pdf

Highly recommended, it's written to be read by normal people.

## Tidy data operations

The main point of tidy data is that, once data is in a certain standard form, certain standard tools are able to do anything you may need.

**We learn these in the Pandas lesson**.  

### filter

Select only certain rows.

Example: In first example above, which courses did Richard teach?

### select / apply

Select only certain colums, or apply an operation to all values in a column.

### Aggregate

Apply some "aggregate functions" to every value in an entire column, and return only one row.  Example: sum all rows, return one row with that sum.

Example: Total attendance over all days.

### Group

Split all data into certain groups depending on the column.  Usually then you apply something to each group, then combine them together.  This is also called **split-apply-combine**.

Example: split the course attendance data by `day`, then apply `sum` to each `value` column, then combine together to get attendance per weekday.

### Join

Connect two tables together using the value of one column.


## Making data tidy

There are certain common operations to make data tidy:

### "melt"
Melt converts a dataframe from a "wide" form to a "long" form.  It may take some effort to get the column names all correct!

In [5]:
attendance2 = attendance.copy()
attendance2.index.name = "week"
attendance2.reset_index(inplace=True)
attendance2

Unnamed: 0,week,tuesday,thursday
0,week1,25,20
1,week2,20,15
2,week3,15,10


In [6]:
attendance2 = attendance2.melt(id_vars="week")
attendance2

Unnamed: 0,week,variable,value
0,week1,tuesday,25
1,week2,tuesday,20
2,week3,tuesday,15
3,week1,thursday,20
4,week2,thursday,15
5,week3,thursday,10


### Pivot
Pivot is the opposite of melt, converts from long to wide.

In [7]:
attendance3 = attendance2.copy()
attendance3.head()

Unnamed: 0,week,variable,value
0,week1,tuesday,25
1,week2,tuesday,20
2,week3,tuesday,15
3,week1,thursday,20
4,week2,thursday,15


In [8]:
attendance3.pivot(index="week", columns="variable", values="value")

variable,thursday,tuesday
week,Unnamed: 1_level_1,Unnamed: 2_level_1
week1,20,25
week2,15,20
week3,10,15


### Multiple types of data in same column

In [9]:
demographics = pd.DataFrame({"stat": ["f24", "m18", "f35"], "value": [1, 2, 3]})
demographics

Unnamed: 0,stat,value
0,f24,1
1,m18,2
2,f35,3


In [10]:
demographics['age'] = demographics['stat'].apply(lambda stat: int(stat[1:]))
demographics['gender'] = demographics['stat'].apply(lambda stat: stat[0])
demographics

Unnamed: 0,stat,value,age,gender
0,f24,1,24,f
1,m18,2,18,m
2,f35,3,35,f


### Multiple types in one table

In [11]:
temps = pd.DataFrame({"day": ["2019-01-01"]*2 + ["2019-01-02"]*2,
                      "type": ["min", "max", "min", "max"],
                      "temp": [-10, 5, -2, 7]})
temps

Unnamed: 0,day,type,temp
0,2019-01-01,min,-10
1,2019-01-01,max,5
2,2019-01-02,min,-2
3,2019-01-02,max,7


In [12]:
temps.pivot(index="day", columns="type", values="temp")

type,max,min
day,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01,5,-10
2019-01-02,7,-2


### Joining columns

In [13]:
courses = pd.DataFrame({"instructor": [0]*3 + [1]*3 + [2]*1,
              "lessons": ["Jupyter", "Tidy data", "Python", "Python parallel", "Plotting", "Profiling", "R"]})
courses

Unnamed: 0,instructor,lessons
0,0,Jupyter
1,0,Tidy data
2,0,Python
3,1,Python parallel
4,1,Plotting
5,1,Profiling
6,2,R


In [14]:
instructor_names = pd.DataFrame({"name": ["Richard", "Janne", "Simo"]},
                                index=[0, 1, 2])
instructor_names

Unnamed: 0,name
0,Richard
1,Janne
2,Simo


In [15]:
courses.join(instructor_names, on="instructor")

Unnamed: 0,instructor,lessons,name
0,0,Jupyter,Richard
1,0,Tidy data,Richard
2,0,Python,Richard
3,1,Python parallel,Janne
4,1,Plotting,Janne
5,1,Profiling,Janne
6,2,R,Simo


## What next?

You haven't learned *how* to actually do these things - just *what* there is to do.  We get to how to use it later.

There's a lot of database theory that is related to this, which might be interesting but for the most part, you can just learn it through using `pandas` - up soon.