# Data Reshaping + Tidy Data

- [Data Reshaping](#Data-Reshaping)
    - [Melt Example](#Melt-Example)
    - [Pivot Table Example](#Pivot-Table-Example)
- [Tidy Data](#Tidy-Data)
    - [One Column with Multiple Variables](#One-Column-with-Multiple-Variables)
    - [One Variable in Multiple Columns](#One-Variable-in-Multiple-Columns)
    - [Multiple vars in 2 columns](#Multiple-vars-in-2-columns)
    - [Another gnarly example](#Another-gnarly-example)
    - [A More Complex Example](#A-More-Complex-Example)

In [1]:
import pandas as pd
import numpy as np

## Data Reshaping

- **long** data has many rows and few columns
- **wide** data has many columns
- a **melt** takes the data from wide to long
- a **spread**, or **pivot** takes the data from long to wide
- a **transpose** rotates the dataframe 90 degrees

### Melt Example

In [2]:
np.random.seed(123)

# simple data for demonstration
df = pd.DataFrame({
    'a': np.random.randint(1, 11, 3),
    'b': np.random.randint(1, 11, 3),
    'c': np.random.randint(1, 11, 3),
    'x': np.random.randint(1, 11, 3),
    'y': np.random.randint(1, 11, 3),
    'z': np.random.randint(1, 11, 3),    
})
df.head()

Unnamed: 0,a,b,c,x,y,z
0,3,2,7,2,1,5
1,3,4,2,10,10,1
2,7,10,1,1,4,1


Different ways of using `.melt`:

In [3]:
# df.melt()
# df.melt(id_vars='a')
# df.melt(id_vars='x')
# df.melt(id_vars=['a', 'b'])
# df.melt(value_vars=['x', 'y', 'z'])
# df.melt(id_vars=['a', 'b'], value_vars=['x', 'y'], var_name='foo', value_name='bar')

### Pivot Table Example

In [4]:
np.random.seed(123)
df = pd.DataFrame({
    'group': np.random.choice(['A', 'B', 'C'], 20),
    'subgroup': np.random.choice(['one', 'two'], 20),
    'x': np.random.randn(20),
})
df.head()

Unnamed: 0,group,subgroup,x
0,C,two,0.737369
1,B,one,1.490732
2,C,two,-0.935834
3,C,one,1.175829
4,A,one,-1.253881


In [5]:
df.pivot_table('x', 'subgroup', 'group')

group,A,B,C
subgroup,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,-0.71019,-0.669245,0.423405
two,-1.771533,-0.545111,0.087422


## Tidy Data

Tidy Data Characteristics:

- data is tabular, i.e. made up of rows and columns
- there is one value per cell
- each variable is a column
- each observation is a row

General Ideas

- If the units are the same, maybe they should be in the same column
- If one column has measurements of different units, it should be spread out
- Should you be able to groupby some of the columns? combine them
- Can I pass this data to seaborn?
- Can we ask interesting questions and answer them with a group by? I.e. generally we **don't** want to be taking row or column averages.

For the rest of this lesson, we'll look at data that is **not** tidy.

### One Column with Multiple Variables

In [6]:
df = pd.DataFrame({
    'name': ['Sally', 'Jane', 'Billy', 'Suzy'],
    'pet': ['dog: max', 'dog: buddy', 'cat: grizabella', 'hamster: fred']
})
df

Unnamed: 0,name,pet
0,Sally,dog: max
1,Jane,dog: buddy
2,Billy,cat: grizabella
3,Suzy,hamster: fred


### One Variable in Multiple Columns

In [7]:
np.random.seed(123)

df = pd.DataFrame(
    np.random.uniform(60, 100, (4, 4)),
    columns=['Sally', 'Jane', 'Billy', 'Suzy'],
    index = pd.Index(['spelling', 'math', 'reading', 'nuclear physics'], name='subject')
).round(1).reset_index()
df

Unnamed: 0,subject,Sally,Jane,Billy,Suzy
0,spelling,87.9,71.4,69.1,82.1
1,math,88.8,76.9,99.2,87.4
2,reading,79.2,75.7,73.7,89.2
3,nuclear physics,77.5,62.4,75.9,89.5


- what is the average spelling grade?
- What is Jane's average grade?

Sometimes it is desirable to "untidy" the data for quick analysis / visualization. E.g. spread subject out to columns, students as rows.

### Multiple vars in 2 columns

- "incorrect melt"

In [8]:
df = pd.read_csv('./untidy-data/gapminder1.csv')
df.sample(10)

Unnamed: 0,year,country,measure,measurement
1338,1990,Switzerland,life_expect,78.03
108,2000,Bolivia,pop,8152620.0
1396,2005,Afghanistan,fertility,7.0685
638,1955,Switzerland,pop,4980000.0
1296,2000,Rwanda,life_expect,43.413
902,1955,Ecuador,life_expect,51.356
1003,1965,Haiti,life_expect,46.243
1899,1990,New Zealand,fertility,2.061
1802,2000,Italy,fertility,1.286
960,1970,Georgia,life_expect,68.158


### Another gnarly example

In [9]:
df = pd.read_csv('untidy-data/gapminder2.csv')
df.head()

Unnamed: 0,country,life_expect_1955,life_expect_1960,life_expect_1965,life_expect_1970,life_expect_1975,life_expect_1980,life_expect_1985,life_expect_1990,life_expect_1995,...,pop_1960,pop_1965,pop_1970,pop_1975,pop_1980,pop_1985,pop_1990,pop_1995,pop_2000,pop_2005
0,Afghanistan,30.332,31.997,34.02,36.088,38.438,39.854,40.822,41.674,41.763,...,9829450,10997885,12430623,14132019,15112149,13796928,14669339,20881480,23898198,29928987
1,Argentina,64.399,65.142,65.634,67.065,68.481,69.942,70.774,71.868,73.275,...,20616009,22283100,23962313,26081880,28369799,30675059,33022202,35311049,37497728,39537943
2,Aruba,64.381,66.606,68.336,70.941,71.83,74.116,74.494,74.108,73.011,...,57203,59020,59039,59390,60266,64129,66653,67836,69539,71566
3,Australia,70.33,70.93,71.1,71.93,73.49,74.74,76.32,77.56,78.83,...,10361273,11439384,12660160,13771400,14615900,15788300,17022133,18116171,19164620,20090437
4,Austria,67.48,69.54,70.14,70.63,72.17,73.18,74.94,76.04,77.51,...,7047437,7270889,7467086,7578903,7549433,7559776,7722953,8047433,8113413,8184691


### A More Complex Example

In [10]:
sales = pd.read_csv('./untidy-data/sales.csv')
sales

Unnamed: 0,Product,2016 Sales,2016 PPU,2017 Sales,2017 PPU,2018 Sales,2018 PPU
0,A,673,5,231,7,173,9
1,B,259,3,748,5,186,8
2,C,644,3,863,5,632,5
3,D,508,9,356,11,347,14
