### Pivoting

Often, we might want to reshape our data to make it easier to view certain relationships between the variables. The pivot() and pivot_table() functions from pandas let us reorganize the entire DataFrame as we wish. Let’s look at each in more detail.

#### The pivot() function

The pivot() function is applied to a DataFrame and has three important parameters: index, columns, and values. To each of these parameters, we have to pass the name of a current column of our DataFrame. Pandas then performs the following actions to obtain the new DataFrame

* it takes the entries from the column passed to index and makes these the indices of the new DataFrame
* it takes the entries from the column passed to columns and makes these the column labels of the new DataFrame
* it takes the entries from the column passed to values and uses them to fill in the new DataFrame, by putting them in the corresponding columns


Suppose we have a sensor that reports the coordinates of some mobile device at equal time intervals. Here are the readings from the first four intervals:

In [1]:
import pandas as pd
import numpy as np

values = [3, 81, 1, 56, 71, 91, 54, 94, 64, 90, 21, 36]
coordinates = ["x", "y", "z"] * 4
time = [0] * 3 + [1] * 3 + [2] * 3 + [3] * 3
df = pd.DataFrame({"time": time, "coordinates": coordinates, "values": values})
df

Unnamed: 0,time,coordinates,values
0,0,x,3
1,0,y,81
2,0,z,1
3,1,x,56
4,1,y,71
5,1,z,91
6,2,x,54
7,2,y,94
8,2,z,64
9,3,x,90


 A better representation for this data would be to have the x, y, and z coordinates in their own columns, and have a single row corresponding to each time interval. Let’s try to achieve this with the pivot() function. What should we set our parameters to be?
 
 Well, the index should be the column time, since we want an entry for each distinct time interval. This means that there will be 4 rows in the new DataFrame with the indices 0, 1, 2, and 3. Next, we would like a separate column for each coordinate, so we want to set the parameter columns equal to coordinates. This will create a column for each distinct value in the current column coordinates. Finally, we want the entries from the current columns values to be the values of our new DataFrame. Here is our full command:



In [2]:
df_pivot = df.pivot(index="time", columns="coordinates", values="values")
df_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3,81,1
1,56,71,91
2,54,94,64
3,90,21,36


### The pivot_table() function

The pivot_table() function is a generalization of the pivot() function that allows for duplicated values in the pivoted index/column pairs. To demonstrate this, we need a new example where we have such duplicated values. Let’s suppose that our data contained coordinates from a second sensor which was paired with a different mobile device, defined as follows:



In [3]:
values2 = [6, 82, 9, 47, 8, 12, 64, 88, 53, 46, 59, 60]
# Let’s redefine our DataFrame to contain these values. 
df2 = pd.DataFrame(
    {"time": time * 2, "coordinates": coordinates * 2, "values": values + values2}
)
df2

Unnamed: 0,time,coordinates,values
0,0,x,3
1,0,y,81
2,0,z,1
3,1,x,56
4,1,y,71
5,1,z,91
6,2,x,54
7,2,y,94
8,2,z,64
9,3,x,90


In [4]:
# Now, what happens if we try the same pivot as before?
# df2.pivot(index='time', columns='coordinates', values='values') # ValueError: Index contains duplicate entries, cannot reshape

Pandas gives us an error that we have duplicated entries. For example, rows 0 and 12 both have an x in the column coordinates and a 0 in the column time. This means that the entries of the column values of these two rows would map to the same entry of the new DataFrame. Since pandas doesn’t know how to handle this, it gives us an error. However, there is a solution, provided by the pivot_table() function. This function has an additional parameter called aggfunc, which allows us to specify a function that tells pandas how to aggregate or combine the different values that map to the same entry, and return a single value. The default option is the mean() function. Let’s take a look:



In [5]:
df2_pivot = df2.pivot_table(index="time", columns="coordinates", values="values")
df2_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4.5,81.5,5.0
1,51.5,39.5,51.5
2,59.0,91.0,58.5
3,68.0,40.0,48.0


Suppose that we wanted to compute the distance between the coordinates of the first and second mobile devices. We can define our own function for this:



In [6]:
def distance(a):
    x = np.max(a) - np.min(a)
    return x
df2_pivot = df2.pivot_table(
    index="time", columns="coordinates", values="values", aggfunc=distance
)
df2_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3,1,8
1,9,63,79
2,10,6,11
3,44,38,24


In [7]:
# If instead, we wanted to just list all the values, we can use the tuple() function as follows:
df2_pivot = df2.pivot_table(
    index="time", columns="coordinates", values="values", aggfunc=tuple
)
df2_pivot

coordinates,x,y,z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"(3, 6)","(81, 82)","(1, 9)"
1,"(56, 47)","(71, 8)","(91, 12)"
2,"(54, 64)","(94, 88)","(64, 53)"
3,"(90, 46)","(21, 59)","(36, 60)"


Note that pivot_table only aggregates numerical data types in the parameter values, whereas pivot() aggregates both numeric and non-numeric data types. To see this better let’s consider this example:



In [8]:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
                           'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df

Unnamed: 0,foo,bar,baz,zoo
0,one,A,1,x
1,one,B,2,y
2,one,C,3,z
3,two,A,4,q
4,two,B,5,w
5,two,C,6,t


In [9]:
# Lets try both
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])


Unnamed: 0_level_0,baz,baz,baz,zoo,zoo,zoo
bar,A,B,C,A,B,C
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,2,3,x,y,z
two,4,5,6,q,w,t


In [10]:
df.pivot_table(index='foo', columns='bar', values=['baz', 'zoo'])

Unnamed: 0_level_0,baz,baz,baz
bar,A,B,C
foo,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
one,1,2,3
two,4,5,6


The aggregation over the non-numeric column 'zoo' is done only in the pivot() method.