## Putting dataframe into long-form

This allows us to use 'groupby' and other pandas functionality

In [1]:
import pandas as pd

In [3]:
# make up some data
data = {'Observer': ['Sam', 'Hannah', 'Ted', 'Alice'],
        'loc_1': [['a', 'b', 'c', 'd'], ['d', 'b', 'b'], ['a', 'c'], ['a', 'a']],
        'loc_2': [['e', 'c', 'f', 'f', 'e'], ['e', 'e', 'f'], ['f', 'c'], ['c', 'c']],
        'loc_3': [['a', 'b', 'b', 'b', 'b'], ['f', 'a', 'a'], ['a', 'b'], ['b', 'b']],
       }
df = pd.DataFrame(data)
print(df)

  Observer         loc_1            loc_2            loc_3
0      Sam  [a, b, c, d]  [e, c, f, f, e]  [a, b, b, b, b]
1   Hannah     [d, b, b]        [e, e, f]        [f, a, a]
2      Ted        [a, c]           [f, c]           [a, b]
3    Alice        [a, a]           [c, c]           [b, b]


In [5]:
# Melt the dataframe into long-form
df_tidy = pd.melt(df, id_vars="Observer", var_name="Location", value_name="Bird codes")
print(df_tidy)

   Observer Location       Bird codes
0       Sam    loc_1     [a, b, c, d]
1    Hannah    loc_1        [d, b, b]
2       Ted    loc_1           [a, c]
3     Alice    loc_1           [a, a]
4       Sam    loc_2  [e, c, f, f, e]
5    Hannah    loc_2        [e, e, f]
6       Ted    loc_2           [f, c]
7     Alice    loc_2           [c, c]
8       Sam    loc_3  [a, b, b, b, b]
9    Hannah    loc_3        [f, a, a]
10      Ted    loc_3           [a, b]
11    Alice    loc_3           [b, b]


## If dataframe is "too long"

Sometimes datasets can be too long and need to be brought to a wider form. In this case, “too long” is not referring to the overall amount of rows of individual observations in the dataset. A dataset is “too long” when a single column in the dataset represents more than one variable, thus creating extra rows despite containing the same amount of information as compared to the same dataset in tidy form. 

In [11]:
## Create Long Dataframe
data = pd.DataFrame({"participant": [1,2,3,1,2,3],
                      "attribute": ["age", "age", "age", "income", "income", "income"],
                      "value": [24, 57, 23, 30, 60, 28]})
## Print Dataframe
print(data)

   participant attribute  value
0            1       age     24
1            2       age     57
2            3       age     23
3            1    income     30
4            2    income     60
5            3    income     28


In [12]:
data_tidy = data.pivot(index="participant",
                         columns="attribute",
                         values="value").reset_index()
data_tidy.columns.name = None
print(data_tidy)

   participant  age  income
0            1   24      30
1            2   57      60
2            3   23      28
