# Meet the hardest functions of Pandas, Part II
## Master the when and how of `crosstab()`, `pivot()`, `melt()`
<img src='images/hard.jpg'></img>

### Introduction <src id='intro'></src>

### Setup <small id='setup'></small>

In [1]:
# Load necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Enable multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

For the sample data, I will be using the `diamonds` dataset that is built-in to Seaborn. It is sufficiently large enough and has variables that are good to summarize with `crosstab()`:

In [2]:
diamonds = sns.load_dataset('diamonds')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### Pandas `crosstab()`, the basics

Like many functions that compute grouped summary statistics, `crosstab()` works with categorical data. It can be used to group two or more variables and perform calculations for a given value for each group. Of course, such operations are possible using `groupby()` or `pivot_table()` but as we are going to see later, `crosstab()` introduces a number of benefits to your daily workflow. 

`crosstab()` function takes two or more lists, `pandas` series or dataframe columns and returns a frequency of each combination by default. I always like to start with an example so that you can better understand the definition and then I will move on to explain the syntax.

`crosstab()` always returns a dataframe and below is an example. The dataframe is a cross tabulation of two variables from `diamonds`: `cut` and `color`. Cross tabulation just means taking one variable, displaying its groups as indexes and taking the other, displaying its groups as columns.

In [3]:
pd.crosstab(index=diamonds['cut'], columns=diamonds['color'])

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,2834,3903,3826,4884,3115,2093,896
Premium,1603,2337,2331,2924,2360,1428,808
Very Good,1513,2400,2164,2299,1824,1204,678
Good,662,933,909,871,702,522,307
Fair,163,224,312,314,303,175,119


The syntax is fairly simple. `index` is used to group a variable and display them as indexes (rows) and the same for `columns`. If no aggregating function is given, each cell will calculate the sum of all the combinations. For example, the top left cell tells us that there are 2834 _ideally-cut_ diamonds with color code _D_.

Next, for each combination we want to see their mean price. `crosstab()` provides `values` parameter to introduce the third variable to aggregate on:

In [4]:
pd.crosstab(diamonds['cut'], diamonds['color'], values=diamonds['price'], aggfunc=np.mean).round(0)

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,2629.0,2598.0,3375.0,3721.0,3889.0,4452.0,4918.0
Premium,3631.0,3539.0,4325.0,4501.0,5217.0,5946.0,6295.0
Very Good,3470.0,3215.0,3779.0,3873.0,4535.0,5256.0,5104.0
Good,3405.0,3424.0,3496.0,4123.0,4276.0,5079.0,4574.0
Fair,4291.0,3682.0,3827.0,4239.0,5136.0,4685.0,4976.0


Now, each cell contains the mean price for each combination of cut and color. To tell that we want to compute the mean price, we pass the `price` column to `values`. Note that you always have to use `values` and `aggfunc` together. Otherwise, you will get an error. I also used `round()` to round up the answers.

### Pandas crosstab() comparison with pivot_table() and groupby()

Before we move on to more fun stuff, I think I need to make the differences between the three functions that compute grouped summary stats. 

I covered the differences of `pivot_table()` and `groupby()` in the [first part](https://towardsdatascience.com/meet-the-hardest-functions-of-pandas-part-i-7d1f74597e92) of the article. For `crosstab()`, the difference between the three is the syntax and the shape of results. Let's compute the last `crosstab()` table using all three:

In [5]:
# Using groupby()
diamonds.groupby(['cut', 'color'])['price'].mean().round(0)
diamonds.pivot_table(values='price', index='cut', columns='color', aggfunc=np.mean).round(0)
pd.crosstab(index=diamonds['cut'], columns=diamonds['color'], values=diamonds['price'], aggfunc=np.mean).round(0)

cut        color
Ideal      D        2629.0
           E        2598.0
           F        3375.0
           G        3721.0
           H        3889.0
           I        4452.0
           J        4918.0
Premium    D        3631.0
           E        3539.0
           F        4325.0
           G        4501.0
           H        5217.0
           I        5946.0
           J        6295.0
Very Good  D        3470.0
           E        3215.0
           F        3779.0
           G        3873.0
           H        4535.0
           I        5256.0
           J        5104.0
Good       D        3405.0
           E        3424.0
           F        3496.0
           G        4123.0
           H        4276.0
           I        5079.0
           J        4574.0
Fair       D        4291.0
           E        3682.0
           F        3827.0
           G        4239.0
           H        5136.0
           I        4685.0
           J        4976.0
Name: price, dtype: float64

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,2629.0,2598.0,3375.0,3721.0,3889.0,4452.0,4918.0
Premium,3631.0,3539.0,4325.0,4501.0,5217.0,5946.0,6295.0
Very Good,3470.0,3215.0,3779.0,3873.0,4535.0,5256.0,5104.0
Good,3405.0,3424.0,3496.0,4123.0,4276.0,5079.0,4574.0
Fair,4291.0,3682.0,3827.0,4239.0,5136.0,4685.0,4976.0


color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,2629.0,2598.0,3375.0,3721.0,3889.0,4452.0,4918.0
Premium,3631.0,3539.0,4325.0,4501.0,5217.0,5946.0,6295.0
Very Good,3470.0,3215.0,3779.0,3873.0,4535.0,5256.0,5104.0
Good,3405.0,3424.0,3496.0,4123.0,4276.0,5079.0,4574.0
Fair,4291.0,3682.0,3827.0,4239.0,5136.0,4685.0,4976.0


I think you already know your favorite. `groubpy()` returns a series while the other two return identical dataframes as a result. However, it is possible turn the `groupby` series into the same dataframe like this:

In [6]:
grouped = diamonds.groupby(['cut', 'color'])['price'].mean().round(0)
grouped.unstack()

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,2629.0,2598.0,3375.0,3721.0,3889.0,4452.0,4918.0
Premium,3631.0,3539.0,4325.0,4501.0,5217.0,5946.0,6295.0
Very Good,3470.0,3215.0,3779.0,3873.0,4535.0,5256.0,5104.0
Good,3405.0,3424.0,3496.0,4123.0,4276.0,5079.0,4574.0
Fair,4291.0,3682.0,3827.0,4239.0,5136.0,4685.0,4976.0


> If you don't understand the syntaxes of `pivot_table` and `unstack()`, I highly suggest you read the first part of the article.

When it comes to speed, `crosstab()` is faster than `pivot_table()` but both are slower than `groupby()`:

In [10]:
%%timeit
diamonds.pivot_table(values='price', index='cut', columns='color', aggfunc=np.mean)

11.3 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
pd.crosstab(index=diamonds['cut'], columns=diamonds['color'], values=diamonds['price'], aggfunc=np.mean)

11.2 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
%%timeit
diamonds.groupby(['cut', 'color'])['price'].mean().unstack()

4.24 ms ± 41 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


As you see, even when chained with `unstack()`, `groupby()` is 3 times faster than the other two. This tells that if you just want to group and compute summary stats, you should use the same ol' `groupby()`. The speed difference was even larger when I chained other methods like simple `round()`. 

The rest of the comparison will mainly be about `pivot_table()` and `crosstab()`. As you saw, the shape of results of the two functions are the same. The first difference between the two is that `crosstab()` can work with any data type. 

It can accept any array-like objects such as lists, `numpy` arrays, data frame columns (which are `pandas` series). In contrast, `pivot_table()` only works on dataframes. In a helpful StackOverflow [thread](https://stackoverflow.com/questions/36267745/how-is-a-pandas-crosstab-different-from-a-pandas-pivot-table), I found out that if you use `crosstab()` on a dataframe it calls `pivot_table()` under the hood. 

Next is the parameters. There are parameters which exist only in one and vice versa. The first one which is the most popular is `crosstab()`'s `normalize`. `normalize` accepts these options (from the documentation):

- If passed `all` or `True`, will normalize over all values.

- If passed `index` will normalize over each row.

- If passed `columns` will normalize over each column.

Let's see a simple example:

In [19]:
pd.crosstab(diamonds['cut'], diamonds['color'], normalize='all')

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,0.05254,0.072358,0.070931,0.090545,0.057749,0.038802,0.016611
Premium,0.029718,0.043326,0.043215,0.054208,0.043752,0.026474,0.01498
Very Good,0.02805,0.044494,0.040119,0.042621,0.033815,0.022321,0.01257
Good,0.012273,0.017297,0.016852,0.016148,0.013014,0.009677,0.005692
Fair,0.003022,0.004153,0.005784,0.005821,0.005617,0.003244,0.002206


If passed all, for each cell, `pandas` calculates the percentage of the overall amount:

In [18]:
pd.crosstab(diamonds['cut'], diamonds['color'], normalize='all').values.sum()

1.0000000000000002

If passed, `index` or `columns`, the same operation is done column-wise or row-wise:

In [20]:
pd.crosstab(diamonds['cut'], diamonds['color'], normalize='index') # columns

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Ideal,0.131502,0.181105,0.177532,0.226625,0.144541,0.097118,0.041576
Premium,0.116235,0.169458,0.169023,0.212022,0.171126,0.103546,0.058589
Very Good,0.125228,0.198643,0.179109,0.190283,0.150968,0.099652,0.056117
Good,0.134937,0.190175,0.185283,0.177538,0.14309,0.1064,0.062576
Fair,0.101242,0.13913,0.193789,0.195031,0.188199,0.108696,0.073913


In `crosstab()` you can also change the index and column names directly within the function using `rownames` and `colnames`. You don't have to do it manually afterwards. These two arguments are very useful when we group by multiple variables at a time, as you will see later.

### Pandas crosstab(), customizing even further