# Data Wrangling: Join, Combine, and Reshape

First, I introduce the concept of hierarchical indexing in pandas, which is used extensively in some of these operations. I then dig into the particular data manipulations. You can see various applied usages of these tools in [Ch 13: Data Analysis Examples](https://wesmckinney.com/book/data-wrangling.html#data-analysis-examples).


# [Hierarchical Indexing](https://wesmckinney.com/book/data-wrangling.html#pandas_hierarchical)

_Hierarchical indexing_ is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example: create a Series with a list of lists (or arrays) as the index:

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.Series(np.random.uniform(size=9),
                 index=[["a", "a", "a", "b", "b", "c", "c", "d", "d"],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1    0.205161
   2    0.666339
   3    0.384218
b  1    0.145549
   3    0.221350
c  1    0.578501
   2    0.914835
d  2    0.036500
   3    0.360519
dtype: float64

What you’re seeing is a prettified view of a Series with a `MultiIndex` as its index. The “gaps” in the index display mean “use the label directly above”:



In [4]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

With a hierarchically indexed object, so-called _partial_ indexing is possible, enabling you to concisely select subsets of the data:

In [5]:
data["b"]

1    0.145549
3    0.221350
dtype: float64

In [6]:
data["b":"c"]

b  1    0.145549
   3    0.221350
c  1    0.578501
   2    0.914835
dtype: float64

In [7]:
data.loc[["b","d"]]

b  1    0.145549
   3    0.221350
d  2    0.036500
   3    0.360519
dtype: float64

Selection is even possible from an “inner” level. Here I select all of the values having the value `2` from the second index level:



In [8]:
data.loc[:, 2]

a    0.666339
c    0.914835
d    0.036500
dtype: float64

Hierarchical indexing plays an important role in reshaping data and in group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its `unstack` method:



In [9]:
data.unstack()

Unnamed: 0,1,2,3
a,0.205161,0.666339,0.384218
b,0.145549,,0.22135
c,0.578501,0.914835,
d,,0.0365,0.360519


The inverse operation of `unstack` is `stack`:

In [10]:
data.unstack().stack()

a  1    0.205161
   2    0.666339
   3    0.384218
b  1    0.145549
   3    0.221350
c  1    0.578501
   2    0.914835
d  2    0.036500
   3    0.360519
dtype: float64

`stack` and `unstack` will be explored in more detail later in [Reshaping and Pivoting](https://wesmckinney.com/book/data-wrangling.html#prep_reshape).


With a DataFrame, either axis can have a hierarchical index:



In [11]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
                     columns=[["Ohio", "Ohio", "Colorado"],
                              ["Green", "Red", "Green"]])

frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:

In [12]:
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


These names supersede the `name` attribute, which is used only with single-level indexes.


 - Be careful to note that the index names `"state"` and `"color"` are not part of the row labels (the `frame.index` values).
 

You can see how many levels an index has by accessing its `nlevels` attribute:



In [13]:
frame.index.nlevels

2

With partial column indexing you can similarly select groups of columns:

In [14]:
frame["Ohio"]

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A `MultiIndex` can be created by itself and then reused; the columns in the preceding DataFrame with level names could also be created like this:

In [15]:
pd.MultiIndex.from_arrays([["Ohio", "Ohio", "Colorado"],
                          ["Green", "Red", "Green"]],
                          names=["state", "color"])

MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

## Reordering and Sorting Levels
At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The `swaplevel` method takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):
