In [27]:
import pandas as pd
import numpy as np
from datetime import date

# Reshaping by pivoting DataFrame objects

## Pivoting data to and from value and indexes

`DataFrame.pivot(*, index=None, columns=None, values=None` return reshaped DataFrame organized by given index / column values.

This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. 
A `ValueError` is raised if there are any duplicates.

`DataFrame.pivot_table()` is a generalization of pivot that can handle duplicate values for one index/column pairGeneralization of pivot that can handle duplicate values for one index/column pair`.

![image.png](attachment:image.png)

In [103]:
sensor = pd.read_csv('data/accel.csv')
sensor

Unnamed: 0,interval,axis,reading
0,0,X,0.0
1,0,Y,0.5
2,0,Z,1.0
3,1,X,0.1
4,1,Y,0.4
5,1,Z,0.9
6,2,X,0.2
7,2,Y,0.3
8,2,Z,0.8
9,3,X,0.3


In [104]:
sensor.pivot(index='interval',
             columns='axis',
             values='reading')

axis,X,Y,Z
interval,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0,0.5,1.0
1,0.1,0.4,0.9
2,0.2,0.3,0.8
3,0.3,0.2,0.7


In [105]:
s_add = pd.Series([3, ' X', 0.4], index=['interval', 'axis', 'reading'])
sensor = pd.concat([sensor, s_add.to_frame().T])
sensor

Unnamed: 0,interval,axis,reading
0,0,X,0.0
1,0,Y,0.5
2,0,Z,1.0
3,1,X,0.1
4,1,Y,0.4
5,1,Z,0.9
6,2,X,0.2
7,2,Y,0.3
8,2,Z,0.8
9,3,X,0.3


While `pivot()` provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also provides `pivot_table()` for pivoting with aggregation of numeric data.

The function `pivot_table()` can be used to create spreadsheet-style pivot tables.

It takes a number of arguments:

- `data`: a DataFrame object
- `values`: a column or a list of columns to aggregatevalues: a column or a list of columns to aggregate
- `index`: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
- `columns`: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
- `aggfunc`: function to use for aggregation, defaulting to numpy.mean.

The result object is a `DataFrame` having potentially hierarchical indexes on the rows and columns. If the `values` column name is not given, the pivot table will include all of the data in an additional level of hierarchy in the column.

If you pass `margins`=True to `pivot_table()`, special `All` columns and rows will be added with partial group aggregates across the categories on the rows and column.

In [157]:
df_sales = pd.DataFrame(
    data={
        'Province': ['ON', 'QC', 'BC', 'AL', 'AL', 'MN', 'ON'],
        'City': [
            'Toronto',
            'Montreal',
            'Vancouver',
            'Calgary',
            'Edmonton',
            'Winnipeg',
            'Windsor',
        ],
        'Sales': [13, 6, 16, 8, 4, 3, 1],
    }
)
df_sales

Unnamed: 0,Province,City,Sales
0,ON,Toronto,13
1,QC,Montreal,6
2,BC,Vancouver,16
3,AL,Calgary,8
4,AL,Edmonton,4
5,MN,Winnipeg,3
6,ON,Windsor,1


In [158]:
pd.pivot_table(
    df_sales,
    values=['Sales'],
    index=['Province'],
    columns=['City'],
    aggfunc=np.sum,
    margins=True,
).stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Province,City,Unnamed: 2_level_1
AL,All,12.0
AL,Calgary,8.0
AL,Edmonton,4.0
BC,All,16.0
BC,Vancouver,16.0
MN,All,3.0
MN,Winnipeg,3.0
ON,All,14.0
ON,Toronto,13.0
ON,Windsor,1.0


In [159]:
pd.pivot_table(
    df_sales,
    values=['Sales'],
    index=['Province', 'City'],
    aggfunc=np.sum,
    margins=True,
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Province,City,Unnamed: 2_level_1
AL,Calgary,8
AL,Edmonton,4
BC,Vancouver,16
MN,Winnipeg,3
ON,Toronto,13
ON,Windsor,1
QC,Montreal,6
All,,51


## Stacking and unstacking

Similar to the `pivot` function are the `.stack()` and `.unstack()` methods. 

The process of **stacking pivots a level of column labels to the row index**. 
**Unstacking** performs the opposite, that is, **pivoting a level of the row index into the column index**.

One of the differences between stacking/unstacking and performing a pivot is that unlike pivots, the stack and
unstack functions are able to pivot specific levels of a hierarchical index. Also, where a pivot retains the same
number of levels on an index, a stack and unstack always increases the levels on the index of one of the axes
(columns for unstacking and rows for stacking) and decrease the levels on the other axis.

![image.png](attachment:image.png)

The `stack()` function “compresses” a level in the DataFrame columns to produce either:
- a Series, in the case of a simple column Index
- a DataFrame, in the case of a MultiIndex in the columns.

If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns.

In [115]:
ladies

Unnamed: 0,city,name,year,height
1,Taganrog,Lustrova,1977,162
2,Rostov-on-Don,Grigoryan,1977,165
3,Krasnodar,Voronina,1979,158


In [123]:
ladies.stack()

1  city           Taganrog
   name           Lustrova
   year               1977
   height              162
2  city      Rostov-on-Don
   name          Grigoryan
   year               1977
   height              165
3  city          Krasnodar
   name           Voronina
   year               1979
   height              158
dtype: object

In [125]:
tuples = list(
    zip(
        *[
            ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
            ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
        ]
    )
)
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.613089,0.272102
bar,two,-1.54048,-0.05289
baz,one,-0.52848,-1.008783
baz,two,1.716575,-0.582551
foo,one,-0.310618,0.651248
foo,two,1.260391,-0.462776
qux,one,0.115768,2.422963
qux,two,0.471242,1.282186


In [127]:
stacked = df.stack()
stacked

first  second   
bar    one     A    0.613089
               B    0.272102
       two     A   -1.540480
               B   -0.052890
baz    one     A   -0.528480
               B   -1.008783
       two     A    1.716575
               B   -0.582551
foo    one     A   -0.310618
               B    0.651248
       two     A    1.260391
               B   -0.462776
qux    one     A    0.115768
               B    2.422963
       two     A    0.471242
               B    1.282186
dtype: float64

![image.png](attachment:image.png)

**Unstacking** will perform a similar operation in the opposite direction, by moving a level of the row index into a level of the column's axis.

By default unstacks the last level.

In [129]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.613089,0.272102
bar,two,-1.54048,-0.05289
baz,one,-0.52848,-1.008783
baz,two,1.716575,-0.582551
foo,one,-0.310618,0.651248
foo,two,1.260391,-0.462776
qux,one,0.115768,2.422963
qux,two,0.471242,1.282186


In [144]:
stacked.unstack(level=['first', 'second'])

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,0.613089,-1.54048,-0.52848,1.716575,-0.310618,1.260391,0.115768,0.471242
B,0.272102,-0.05289,-1.008783,-0.582551,0.651248,-0.462776,2.422963,1.282186


In [141]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.613089,0.272102
bar,two,-1.54048,-0.05289
baz,one,-0.52848,-1.008783
baz,two,1.716575,-0.582551
foo,one,-0.310618,0.651248
foo,two,1.260391,-0.462776
qux,one,0.115768,2.422963
qux,two,0.471242,1.282186


In [143]:
df.unstack([0, 1])

   first  second
A  bar    one       0.613089
          two      -1.540480
   baz    one      -0.528480
          two       1.716575
   foo    one      -0.310618
          two       1.260391
   qux    one       0.115768
          two       0.471242
B  bar    one       0.272102
          two      -0.052890
   baz    one      -1.008783
          two      -0.582551
   foo    one       0.651248
          two      -0.462776
   qux    one       2.422963
          two       1.282186
dtype: float64

## Melting data to and from long and wide format

Melting is a type of un-pivoting, and is often referred to as changing a `DataFrame` object from **wide format** to
**long format**. This format is common in various statistical analyses, and data you read may be already provided
in a melted form. Or you may need to pass data in this format to other code that expects this organization.

Melting undoes a pivot.

The top-level `melt()` function and the corresponding `DataFrame.melt()` are useful to massage a `DataFrame` into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, **“variable”** and **“value”**. The names of those columns can be customized by supplying the `var_name` and `value_name` parameters.

![image.png](attachment:image.png)

In [145]:
cheese = pd.DataFrame(
    {
        'first': ['John', 'Mary'],
        'last': ['Doe', 'Bo'],
        'height': [5.5, 6.0],
        'weight': [130, 150],
    }
)
cheese

Unnamed: 0,first,last,height,weight
0,John,Doe,5.5,130
1,Mary,Bo,6.0,150


In [148]:
cheese.melt()

Unnamed: 0,variable,value
0,first,John
1,first,Mary
2,last,Doe
3,last,Bo
4,height,5.5
5,height,6.0
6,weight,130
7,weight,150


In [146]:
cheese.melt(id_vars=['first', 'last'])

Unnamed: 0,first,last,variable,value
0,John,Doe,height,5.5
1,Mary,Bo,height,6.0
2,John,Doe,weight,130.0
3,Mary,Bo,weight,150.0


## Cross tabulations

Use `crosstab()` to compute a cross-tabulation of two (or more) factors. By default `crosstab()` computes a frequency table of the factors unless an array of values and an aggregation function are passed.

It takes a number of arguments
- `index`: array-like, values to group by in the rows.
- `columns`: array-like, values to group by in the columns.
- `values`: array-like, optional, array of values to aggregate according to the factors.
- `aggfunc`: function, optional, If no values array is passed, computes a frequency table.
- `rownames`: sequence, default None, must match number of row arrays passed.
- `colnames`: sequence, default None, if passed, must match number of column arrays passed.
- `margins`: boolean, default False, Add row/column margins (subtotals)
- `normalize`: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False. Normalize by dividing all values by the sum of values.

In [162]:
df_sales

Unnamed: 0,Province,City,Sales
0,ON,Toronto,13
1,QC,Montreal,6
2,BC,Vancouver,16
3,AL,Calgary,8
4,AL,Edmonton,4
5,MN,Winnipeg,3
6,ON,Windsor,1


In [160]:
pd.crosstab(
    index=df_sales.Province,
    columns=df_sales.City)

City,Calgary,Edmonton,Montreal,Toronto,Vancouver,Windsor,Winnipeg
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AL,1,1,0,0,0,0,0
BC,0,0,0,0,1,0,0
MN,0,0,0,0,0,0,1
ON,0,0,0,1,0,1,0
QC,0,0,1,0,0,0,0


We can normalize with the row or column totals with the `normalize` parameter. This shows percentage of the total.

In [161]:
pd.crosstab(
    index=df_sales.Province,
    columns=df_sales.City,
    normalize='columns')

City,Calgary,Edmonton,Montreal,Toronto,Vancouver,Windsor,Winnipeg
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AL,1.0,1.0,0.0,0.0,0.0,0.0,0.0
BC,0.0,0.0,0.0,0.0,1.0,0.0,0.0
MN,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ON,0.0,0.0,0.0,1.0,0.0,1.0,0.0
QC,0.0,0.0,1.0,0.0,0.0,0.0,0.0


To change the aggregation function we can provide an argument to `values` and then specify `aggfunc`.

In [164]:
pd.crosstab(
    index=df_sales.Province,
    columns=df_sales.City,
    normalize='columns',
    values=df.Sales,
    aggfunc=np.mean)

City,Calgary,Edmonton,Montreal,Toronto,Vancouver,Windsor,Winnipeg
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AL,1.0,1.0,0.0,0.0,0.0,0.0,0.0
BC,0.0,0.0,0.0,0.0,1.0,0.0,0.0
MN,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ON,0.0,0.0,0.0,1.0,0.0,1.0,0.0
QC,0.0,0.0,1.0,0.0,0.0,0.0,0.0


To get row and column subtotals use the `margings` parameter.

In [166]:
pd.crosstab(
    index=df_sales.Province,
    columns=df_sales.City,
    normalize='columns',
    values=df.Sales,
    aggfunc=np.sum,
    margins=True,
    margins_name='total')

City,Calgary,Edmonton,Montreal,Toronto,Vancouver,Windsor,Winnipeg,total
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AL,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.235294
BC,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.313725
MN,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.058824
ON,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.27451
QC,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.117647
