<h1>Chapter 8: Data Wrangling: Join, Combine, and Reshape<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#8.1-Hierarchical-Indexing" data-toc-modified-id="8.1-Hierarchical-Indexing-1">8.1 Hierarchical Indexing</a></span><ul class="toc-item"><li><span><a href="#Reordering-and-Sorting-Levels" data-toc-modified-id="Reordering-and-Sorting-Levels-1.1">Reordering and Sorting Levels</a></span></li><li><span><a href="#Summary-Statistics-by-Level" data-toc-modified-id="Summary-Statistics-by-Level-1.2">Summary Statistics by Level</a></span></li><li><span><a href="#Indexing-with-a-DataFrame’s-columns" data-toc-modified-id="Indexing-with-a-DataFrame’s-columns-1.3">Indexing with a DataFrame’s columns</a></span></li></ul></li><li><span><a href="#8.2-Combining-and-Merging-Datasets" data-toc-modified-id="8.2-Combining-and-Merging-Datasets-2">8.2 Combining and Merging Datasets</a></span><ul class="toc-item"><li><span><a href="#Database-Style-DataFrame-Joins" data-toc-modified-id="Database-Style-DataFrame-Joins-2.1">Database-Style DataFrame Joins</a></span><ul class="toc-item"><li><span><a href="#many-to-one-merge" data-toc-modified-id="many-to-one-merge-2.1.1">many-to-one merge</a></span></li><li><span><a href="#Many-to-many-merge" data-toc-modified-id="Many-to-many-merge-2.1.2">Many-to-many merge</a></span></li></ul></li><li><span><a href="#Merging-on-Index" data-toc-modified-id="Merging-on-Index-2.2">Merging on Index</a></span><ul class="toc-item"><li><span><a href="#Pandas.merge-method" data-toc-modified-id="Pandas.merge-method-2.2.1">Pandas.merge method</a></span></li><li><span><a href="#DataFrame-join-instance-method" data-toc-modified-id="DataFrame-join-instance-method-2.2.2">DataFrame join instance method</a></span></li></ul></li><li><span><a href="#Concatenating-Along-an-Axis" data-toc-modified-id="Concatenating-Along-an-Axis-2.3">Concatenating Along an Axis</a></span></li><li><span><a href="#Combining-Data-with-Overlap" data-toc-modified-id="Combining-Data-with-Overlap-2.4">Combining Data with Overlap</a></span></li></ul></li><li><span><a href="#8.3-Reshaping-and-Pivoting" data-toc-modified-id="8.3-Reshaping-and-Pivoting-3">8.3 Reshaping and Pivoting</a></span><ul class="toc-item"><li><span><a href="#Reshaping-with-Hierarchical-Indexing" data-toc-modified-id="Reshaping-with-Hierarchical-Indexing-3.1">Reshaping with Hierarchical Indexing</a></span></li><li><span><a href="#Pivoting-“Long”-to-“Wide”-Format" data-toc-modified-id="Pivoting-“Long”-to-“Wide”-Format-3.2">Pivoting “Long” to “Wide” Format</a></span></li><li><span><a href="#Pivoting-“Wide”-to-“Long”-Format" data-toc-modified-id="Pivoting-“Wide”-to-“Long”-Format-3.3">Pivoting “Wide” to “Long” Format</a></span></li></ul></li></ul></div>

In [1]:
# If you use Colab Notebook, you can uncomment the following to mount your Google Drive to Colab
# After that, your colab notebook can read/write files and data in your Google Drive

#from google.colab import drive
#drive.mount('/content/drive')

In [2]:
# If you use Colab Notebook, please change the current directory to be the folder that you save 
# your Notebook and data folder for example, I save my Colab files and data in the following location

#%cd /content/drive/MyDrive/Colab\ Notebooks

In [3]:
# import required libraries and modules, and define default setting for the notebook

import numpy as np
np.random.seed(12345)
np.set_printoptions(precision=4, suppress=True)

import pandas as pd # https://pandas.pydata.org/  Check the documentation there
from pandas import Series, DataFrame # import modules into the local namespace if they are frequently used
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
pd.options.display.max_columns = 20
pd.options.display.max_colwidth = 80

import matplotlib.pyplot as plt
plt.rc("figure", figsize=(10, 6))

In many applications, data may be spread across a number of files or databases, or be
arranged in a form that is not convenient to analyze. This chapter focuses on tools to
help **combine, join, and rearrange** data. 

## 8.1 Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that allows us to have multiple (two or more) index levels on an axis. It provides a way to work with higher dimensional data in a lower dimensional form.

In [4]:
# create a Series of size 9 with two levels of index by passing two lists of indices

data = pd.Series(np.random.uniform(size=9),
                 index=[["a", "a", "a", "b", "b", "c", "c", "d", "d"],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1    0.929616
   2    0.316376
   3    0.183919
b  1    0.204560
   3    0.567725
c  1    0.595545
   2    0.964515
d  2    0.653177
   3    0.748907
dtype: float64

In [5]:
# data is a Series with a MultiIndex as its index

data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

With a hierarchically indexed object, partial indexing is possible, allowing us to concisely select subsets of the data:

In [6]:
# With a hierarchically indexed object, partial indexing is possible, enabling you to concisely 
# select subsets of the data:

# in this example, the sebset is the portion with the first level of index = "b"
data["b"]

1    0.204560
3    0.567725
dtype: float64

In [7]:
# In this example, the subset is the portion with the first level of index = "b":"c"
data["b":"c"]

b  1    0.204560
   3    0.567725
c  1    0.595545
   2    0.964515
dtype: float64

In [8]:
# In this example, the loc method is used to extract the subset whose indices is "b" or "d"
data.loc[["b", "d"]]

b  1    0.204560
   3    0.567725
d  2    0.653177
   3    0.748907
dtype: float64

In [9]:
# Selection is even possible from an “inner” level. 
# In this example, we select all of the values having the value 2 from the second index level

data.loc[:, 2]

a    0.316376
c    0.964515
d    0.653177
dtype: float64

In [10]:
# Hierarchical indexing plays an important role in reshaping data and in group-based
# operations like forming a pivot table

# For example, we can rearrange this data into a DataFrame using its unstack method:

data.unstack()

Unnamed: 0,1,2,3
a,0.929616,0.316376,0.183919
b,0.20456,,0.567725
c,0.595545,0.964515,
d,,0.653177,0.748907


In [11]:
# The inverse operation of unstack is stack:

data.unstack().stack()

a  1    0.929616
   2    0.316376
   3    0.183919
b  1    0.204560
   3    0.567725
c  1    0.595545
   2    0.964515
d  2    0.653177
   3    0.748907
dtype: float64

In [12]:
# With a DataFrame, either axis can have a hierarchical index:
# In this example, both axis-0 and axis-1 use hierachical indices:

frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
                     columns=[["Ohio", "Ohio", "Colorado"],
                              ["Green", "Red", "Green"]])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [13]:
# The hierarchical levels can have names (as strings or any Python objects). If so, these
# will show up in the console output:
    
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [14]:
# nlevels attribute returns the number of index levels 

frame.index.nlevels

2

In [15]:
# nlevels attribute returns the number of column levels 

frame.columns.nlevels

2

In [16]:
# Select a subset of data using partial column indexing 

frame["Ohio"]

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting Levels



In [17]:
# We may need to rearrange the order of the levels on an axis or sort the data
# by the values in one specific level. The swaplevel method takes two level numbers
# or names and returns a new object with the levels interchanged (but the data is
# otherwise unaltered)

frame.swaplevel("key1", "key2")

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [18]:
# sort_index by default sorts the data lexicographically using all the index levels, but
# we can choose to use only a single level or a subset of levels to sort by passing the
# level argument

# in this example, we choose to sort data by level-1 index


print(frame,'\n')

frame.sort_index(level=1)

state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11 



Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [19]:
# swap levels o and 1 and then sort by index level 0

frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and Series have a level option in which we can specify the level we want to aggregate by on a particular axis.

In [20]:
# aggregate by a level on the rows

# In this example, we group data by "key2" index, and aggregrate data using the sum method

frame.groupby(level="key2").sum()

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [21]:
# aggregate by a level on the colums

# In this example, data are grouped by "color", and the sum method is used for data aggregration

frame.groupby(level="color", axis="columns").sum()

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### Indexing with a DataFrame’s columns

In [22]:
# We may use one or more columns from a DataFrame as the row index; 
# alternatively, we may wish to move the row index into the DataFrame’s columns. 
# Here’s an example DataFrame:


frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1),
                      "c": ["one", "one", "one", "two", "two",
                            "two", "two"],
                      "d": [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [23]:
# DataFrame’s set_index function will create a new DataFrame using one or more of
# its columns as the index:

frame2 = frame.set_index(["c", "d"])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [24]:
# By default, the columns set as indices are removed from the DataFrame. 
# But we can can leave them in by passing drop=False to set_index:

frame.set_index(["c", "d"], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [25]:
# reset_index, on the other hand, does the opposite of set_index; 
# the hierarchical index levels are moved into the columns:

frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## 8.2 Combining and Merging Datasets

Data contained in pandas objects can be combined in a number of ways:

**pandas.merge**\
Connect rows in DataFrames based on one or more keys. This will be familiar
to users of SQL or other relational databases, as it implements database join
operations.

**pandas.concat**\
Concatenate or “stack” objects together along an axis.

**combine_first**\
Splice together overlapping data to fill in missing values in one object with values
from another.

### Database-Style DataFrame Joins

Merge or join operations combine datasets by linking rows using one or more keys.
These operations are particularly important in relational databases (e.g., SQL-based).
The **pandas.merge** function in pandas is the main entry point for using these algorithms
on our data.

Table 8-1. Different join types with the how argument\
Table 8-2. pandas.merge function arguments

#### many-to-one merge

In [26]:
# This is an example of a many-to-one join; 
# the data in df1 has multiple rows labeled a and b, 
# whereas df2 has only one row for each value in the key column.

df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"],
                    "data1": pd.Series(range(7), dtype="Int64")})
df2 = pd.DataFrame({"key": ["a", "b", "d"],
                    "data2": pd.Series(range(3), dtype="Int64")})
print(df1,'\n')

print(df2)

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   a      5
6   b      6 

  key  data2
0   a      0
1   b      1
2   d      2


In [27]:
# # Calling pandas.merge with these objects, we obtain

pd.merge(df1, df2)

# Note that we didn’t specify which column to join on. If that information is not
# specified, pandas.merge uses the overlapping column names as the keys. 
# In this example, the column names that overlap are "key"

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [28]:
# It’s a good practice to specify explicitly what column to join on

pd.merge(df1, df2, on="key")

# We notice that the "c" and "d" values and associated data are missing from
# the result. By default, pandas.merge does an "inner" join; the keys in the result are
# the intersection, or the common set found in both tables. 

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [29]:
# Other possible options are "left", "right", and "outer". 
# The outer join takes the union of the keys, combining the effect of applying both left and right joins:
    
pd.merge(df1, df2, how="outer")

# In an outer join, rows from the left or right DataFrame objects that do not match
# on keys in the other DataFrame will appear with NA values in the other DataFrame’s
# columns for the nonmatching rows.

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


In [30]:
# If column names are different in each object, we can specify them separately

df3 = pd.DataFrame({"lkey": ["b", "b", "a", "c", "a", "a", "b"],
                    "data1": pd.Series(range(7), dtype="Int64")})
df4 = pd.DataFrame({"rkey": ["a", "b", "d"],
                    "data2": pd.Series(range(3), dtype="Int64")})
print(df3,'\n')
print(df4,'\n')

pd.merge(df3, df4, left_on="lkey", right_on="rkey")



  lkey  data1
0    b      0
1    b      1
2    a      2
3    c      3
4    a      4
5    a      5
6    b      6 

  rkey  data2
0    a      0
1    b      1
2    d      2 



Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


In [31]:
pd.merge(df3, df4, left_on="lkey", right_on="rkey", how="outer")

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,c,3.0,,
7,,,d,2.0


#### Many-to-many merge

In [32]:
# Many-to-many merges form the Cartesian product of the matching keys. Here’s an example:
    
df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
                    "data1": pd.Series(range(6), dtype="Int64")})
df2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"],
                    "data2": pd.Series(range(5), dtype="Int64")})
print(df1,'\n')
print(df2)

# how="left" use all key combinations found in the left table
pd.merge(df1, df2, on="key", how="left")

# Since there were three "b" rows in the left DataFrame and two in the right one, there
# are six "b" rows in the result. 

  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5 

  key  data2
0   a      0
1   b      1
2   a      2
3   b      3
4   d      4


Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


In [33]:
# The join method passed to the how keyword argument affects only the distinct 
# key values appearing in the result:

# how="inner" Use only the key combinations observed in both tables
pd.merge(df1, df2, how="inner")

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


In [34]:
# To merge with multiple keys, pass a list of column names:

left = pd.DataFrame({"key1": ["foo", "foo", "bar"],
                     "key2": ["one", "two", "one"],
                     "lval": pd.Series([1, 2, 3], dtype='Int64')})
right = pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
                      "key2": ["one", "one", "one", "two"],
                      "rval": pd.Series([4, 5, 6, 7], dtype='Int64')})

print(left,'\n')
print(right,'\n')

# how="outer" Use all key combinations observed in both tables together
pd.merge(left, right, on=["key1", "key2"], how="outer")

  key1 key2  lval
0  foo  one     1
1  foo  two     2
2  bar  one     3 

  key1 key2  rval
0  foo  one     4
1  foo  one     5
2  bar  one     6
3  bar  two     7 



Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


In [35]:
# treatment of overlapping column names. 

#For example
pd.merge(left, right, on="key1")

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [36]:
# pandas.merge has a suffixes option for specifying strings to append to 
# overlapping names in the left and right DataFrame objects:

pd.merge(left, right, on="key1", suffixes=("_left", "_right"))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


### Merging on Index

#### Pandas.merge method

In [37]:
# In some cases, the merge key(s) in a DataFrame will be found in its index (row
# labels). In this case, you can pass left_index=True or right_index=True (or both) to
# indicate that the index should be used as the merge key:
    
left1 = pd.DataFrame({"key": ["a", "b", "a", "a", "b", "c"],
                      "value": pd.Series(range(6), dtype="Int64")})
right1 = pd.DataFrame({"group_val": [3.5, 7]}, index=["a", "b"])
print(left1,'\n')
print(right1,'\n')

pd.merge(left1, right1, left_on="key", right_index=True)

  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5 

   group_val
a        3.5
b        7.0 



Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [38]:
# In this example, how = "outer". 

print(left1,'\n')
print(right1,'\n')
    
pd.merge(left1, right1, left_on="key", right_index=True, how="outer")


  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5 

   group_val
a        3.5
b        7.0 



Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


In [39]:
# With hierarchically indexed data, things are more complicated, as joining on index is
# equivalent to a multiple-key merge:
    
lefth = pd.DataFrame({"key1": ["Ohio", "Ohio", "Ohio",
                               "Nevada", "Nevada"],
                      "key2": [2000, 2001, 2002, 2001, 2002],
                      "data": pd.Series(range(5), dtype="Int64")})

righth_index = pd.MultiIndex.from_arrays(
    [
        ["Nevada", "Nevada", "Ohio", "Ohio", "Ohio", "Ohio"],
        [2001, 2000, 2000, 2000, 2001, 2002]
    ]
)

righth = pd.DataFrame({"event1": pd.Series([0, 2, 4, 6, 8, 10], dtype="Int64",
                                           index=righth_index),
                       "event2": pd.Series([1, 3, 5, 7, 9, 11], dtype="Int64",
                                           index=righth_index)})
print(lefth,'\n')
print(righth,'\n')

     key1  key2  data
0    Ohio  2000     0
1    Ohio  2001     1
2    Ohio  2002     2
3  Nevada  2001     3
4  Nevada  2002     4 

             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11 



In [40]:
# In this case, we have to indicate multiple columns to merge on as a list (note the
# handling of duplicate index values with how="outer"):
    
pd.merge(lefth, righth, left_on=["key1", "key2"], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0,4,5
0,Ohio,2000,0,6,7
1,Ohio,2001,1,8,9
2,Ohio,2002,2,10,11
3,Nevada,2001,3,0,1


In [41]:
pd.merge(lefth, righth, left_on=["key1", "key2"],
         right_index=True, how="outer")

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Nevada,2002,4.0,,
4,Nevada,2000,,2.0,3.0


In [42]:
# Using the indexes of both sides of the merge is also possible

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=["a", "c", "e"],
                     columns=["Ohio", "Nevada"]).astype("Int64")
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=["b", "c", "d", "e"],
                      columns=["Missouri", "Alabama"]).astype("Int64")
print(left2,'\n')
print(right2,'\n')

pd.merge(left2, right2, how="outer", left_index=True, right_index=True)

   Ohio  Nevada
a     1       2
c     3       4
e     5       6 

   Missouri  Alabama
b         7        8
c         9       10
d        11       12
e        13       14 



Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


#### DataFrame join instance method

In [43]:
# DataFrame has a join instance method to simplify merging by index. It can also be
# used to combine many DataFrame objects having the same or similar indexes but
# nonoverlapping columns. In the prior example, we could have written

left2.join(right2, how="outer")

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [44]:
# Compared with pandas.merge, DataFrame’s join method performs a left join on the
# join keys by default. It also supports joining the index of the passed DataFrame on
# one of the columns of the calling DataFrame

print(left1,'\n')
print(right1,'\n')

left1.join(right1, on="key")

  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5 

   group_val
a        3.5
b        7.0 



Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


In [45]:
# Lastly, for simple index-on-index merges, we can pass a list of DataFrames to join
# as an alternative to using the more general pandas.concat function described in the
# next section:

another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=["a", "c", "e", "f"],
                       columns=["New York", "Oregon"])
print(another,'\n')

print(left2,'\n')
print(right2,'\n')

left2.join([right2, another])

   New York  Oregon
a       7.0     8.0
c       9.0    10.0
e      11.0    12.0
f      16.0    17.0 

   Ohio  Nevada
a     1       2
c     3       4
e     5       6 

   Missouri  Alabama
b         7        8
c         9       10
d        11       12
e        13       14 



Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1,2,,,7.0,8.0
c,3,4,9.0,10.0,9.0,10.0
e,5,6,13.0,14.0,11.0,12.0


In [46]:
left2.join([right2, another], how="outer")

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0
b,,,7.0,8.0,,
d,,,11.0,12.0,,
f,,,,,16.0,17.0


### Concatenating Along an Axis 

Another kind of data combination operation is referred to interchangeably as concatenation or stacking. 

Table 8-3. pandas.concat function arguments

In [47]:
# NumPy’s concatenate function can do this with NumPy arrays:
    
arr = np.arange(12).reshape((3, 4))
print(arr,'\n')

np.concatenate([arr, arr], axis=1) # axis=1 means concatenate along axis 1

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]] 



array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [48]:
# we have three Series with no index overlap:

s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

print(s1,'\n')
print(s2,'\n')
print(s3,'\n')

a    0
b    1
dtype: Int64 

c    2
d    3
e    4
dtype: Int64 

f    5
g    6
dtype: Int64 



In [49]:
# pandas.concat with these objects in a list glues together the values and indexes:

pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64

In [50]:
# By default, pandas.concat works along axis="index", producing another Series. If
# we pass axis="columns" or axis=1, the result will be a DataFrame

pd.concat([s1, s2, s3], axis="columns")

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [51]:
# In this case there is no overlap on the other axis, which is the union (the "outer" join) of the indexes. 
    
s4 = pd.concat([s1, s3])
print(s4,'\n')

pd.concat([s1, s4], axis="columns")

a    0
b    1
f    5
g    6
dtype: Int64 



Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [52]:
# We can instead intersect them by passing join="inner":

pd.concat([s1, s4], axis="columns", join="inner")

# compared to the last cell, the "f" and "g" labels disappeared because of the join="inner" option.

Unnamed: 0,0,1
a,0,0
b,1,1


In [53]:
# A potential issue is that the concatenated pieces are not identifiable in the result.
# Suppose instead we would like to create a hierarchical index on the concatenation axis.
# To do this, use the keys argument

result = pd.concat([s1, s1, s3], keys=["one", "two", "three"])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: Int64

In [54]:
result.unstack()

Unnamed: 0,a,b,f,g
one,0.0,1.0,,
two,0.0,1.0,,
three,,,5.0,6.0


In [55]:
# In the case of combining Series along axis="columns", the keys become the DataFrame column headers

pd.concat([s1, s2, s3], axis="columns", keys=["one", "two", "three"])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [56]:
# The same logic extends to DataFrame objects

# In this example, df1 and df2 are two DataFrame objects
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=["a", "b", "c"],
                   columns=["one", "two"])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=["a", "c"],
                   columns=["three", "four"])
print(df1,'\n')
print(df2,'\n')

# we concatenate them along axis-1 (columns). 
pd.concat([df1, df2], axis="columns", keys=["level1", "level2"])

# Here the keys argument is used to create a hierarchical index where the first level can
# be used to identify each of the concatenated DataFrame objects.

   one  two
a    0    1
b    2    3
c    4    5 

   three  four
a      5     6
c      7     8 



Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [57]:
# If we pass a dictionary of objects instead of a list, the dictionary’s keys will be used
# for the keys option

pd.concat({"level1": df1, "level2": df2}, axis="columns")

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [58]:
# we can name the created axis levels with the names argument:

pd.concat([df1, df2], axis="columns", keys=["level1", "level2"],
          names=["upper", "lower"])

upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [59]:
# Sometimes, the row index does not contain any relevant data:

df1 = pd.DataFrame(np.random.standard_normal((3, 4)),
                   columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.standard_normal((2, 3)),
                   columns=["b", "d", "a"])
print(df1,'\n')
print(df2,'\n')

          a         b         c         d
0  1.248804  0.774191 -0.319657 -0.624964
1  1.078814  0.544647  0.855588  1.343268
2 -0.267175  1.793095 -0.652929 -1.886837 

          b         d         a
0  1.059626  0.644448 -0.007799
1 -0.449204  2.448963  0.667226 



In [60]:
# In this case, we can pass ignore_index=True, which discards the indexes from each DataFrame 
# and concatenates the data in the columns only, assigning a new default index

pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,1.248804,0.774191,-0.319657,-0.624964
1,1.078814,0.544647,0.855588,1.343268
2,-0.267175,1.793095,-0.652929,-1.886837
3,-0.007799,1.059626,,0.644448
4,0.667226,-0.449204,,2.448963


### Combining Data with Overlap

There is another data combination situation that can’t be expressed as either a merge
or concatenation operation. You may have two datasets with indexes that overlap in
full or in part.

In [61]:
# consider NumPy’s where function, which performs the array-oriented equivalent of an if-else expression

a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
              index=["f", "e", "d", "c", "b", "a"])
b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
              index=["a", "b", "c", "d", "e", "f"])
print(a,'\n')
print(b,'\n')

np.where(pd.isna(a), b, a) # whenever values in a are null, values from b are selected

f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64 

a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64 



array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

In [62]:
# the combine_first method does the same thing column by column. We can think of it as “patching” missing data
# in the calling object with data from the object we pass:

a.combine_first(b)

a    0.0
b    4.5
c    3.5
d    0.0
e    2.5
f    5.0
dtype: float64

In [63]:
# The output of combine_first with DataFrame objects will have the union of all the column names

df1 = pd.DataFrame({"a": [1., np.nan, 5., np.nan],
                    "b": [np.nan, 2., np.nan, 6.],
                    "c": range(2, 18, 4)})
df2 = pd.DataFrame({"a": [5., 4., np.nan, 3., 7.],
                    "b": [np.nan, 3., 4., 6., 8.]})
print(df1,'\n')
print(df2,'\n')

df1.combine_first(df2)

     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14 

     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0 



Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## 8.3 Reshaping and Pivoting

There are a number of basic operations for rearranging tabular data. These are
referred to as **reshape** or **pivot** operations

### Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.

There are two primary actions:\
**stack**\
This “rotates” or pivots from the columns in the data to the rows.\
**unstack**\
This pivots from the rows into the columns.

In [64]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(["Ohio", "Colorado"], name="state"),
                    columns=pd.Index(["one", "two", "three"],
                    name="number"))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [65]:
# the stack method pivots from the columns in the data to the rows:

result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [66]:
# the unstack method pivotes from the rows in the data to columns

result.unstack()



number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [67]:
# By default, the innermost level is unstacked (same with stack).
# We can unstack a different level by passing a level number or name:

print(result,'\n')

result.unstack(level=0) # rows by state become columns by state

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64 



state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [68]:
# instead of specifying the level for unstack, pass the name of the level:

result.unstack(level="state")

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [69]:
# Unstacking might introduce missing data if all of the values in the level aren’t found in each subgroup.

s1 = pd.Series([0, 1, 2, 3], index=["a", "b", "c", "d"], dtype="Int64")
s2 = pd.Series([4, 5, 6], index=["c", "d", "e"], dtype="Int64")
print(s1,'\n')
print(s2,'\n')

data2 = pd.concat([s1, s2], keys=["one", "two"])
print(data2,'\n')

data2.unstack()

a    0
b    1
c    2
d    3
dtype: Int64 

c    4
d    5
e    6
dtype: Int64 

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: Int64 



Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2,3,
two,,,4,5,6.0


In [70]:
# Stacking filters out missing data by default, so the operation is more easily invertible

data2.unstack().stack()

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: Int64

In [71]:
# setting dropna=False, stacking doesn't filter out missing data

data2.unstack().stack(dropna=False)

one  a       0
     b       1
     c       2
     d       3
     e    <NA>
two  a    <NA>
     b    <NA>
     c       4
     d       5
     e       6
dtype: Int64

In [72]:
# When we unstack in a DataFrame, the level unstacked becomes the lowest level in the result.
# in this example, state becomes a column level lower than side by default

df = pd.DataFrame({"left": result, "right": result + 5},
                  columns=pd.Index(["left", "right"], name="side"))
print(df,'\n')

df.unstack(level="state")

side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10 



side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


In [73]:
# Similarly, when we stack in a DataFrame, the level stacked becomes the lowest level in the result.
# in this example, side becomes a row level lower than number by default

df.unstack(level="state").stack(level="side")

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


### Pivoting “Long” to “Wide” Format

A common way to store multiple time series in databases and CSV files is what
is sometimes called long or stacked format. In this format, individual values are
represented by a single row in a table rather than multiple values per row.

In [74]:
# In this example, data is a pandas DataFrame with five columns

data = pd.read_csv("examples/macrodata.csv")
data = data.loc[:, ["year", "quarter", "realgdp", "infl", "unemp"]]
data.head()

Unnamed: 0,year,quarter,realgdp,infl,unemp
0,1959,1,2710.349,0.0,5.8
1,1959,2,2778.801,2.34,5.1
2,1959,3,2775.488,2.74,5.3
3,1959,4,2785.204,0.27,5.6
4,1960,1,2847.699,2.31,5.2


In [75]:
# First, we use pandas.PeriodIndex (which represents time intervals rather than points
# in time) to combine the year and quarter columns to set the index to consist of 
# datetime values at the end of each quarter

# Notes:
# DataFrame.pop(item) returns the item and drop it from the DataFrame
# PeriodIndex.to_timestamp(freq=None, how='start') Cast to DatetimeArray/Index. freq="D" is week or longer

periods = pd.PeriodIndex(year=data.pop("year"),
                         quarter=data.pop("quarter"),
                         name="date")
print(periods,'\n')

data.index = periods.to_timestamp("D") # set the index of data as the timestamps
data.head()

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203) 



Unnamed: 0_level_0,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,2710.349,0.0,5.8
1959-04-01,2778.801,2.34,5.1
1959-07-01,2775.488,2.74,5.3
1959-10-01,2785.204,0.27,5.6
1960-01-01,2847.699,2.31,5.2


In [76]:
# select a subset of columns and give the columns index the name "item"

data = data.reindex(columns=["realgdp", "infl", "unemp"])
data.columns.name = "item"
data.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,2710.349,0.0,5.8
1959-04-01,2778.801,2.34,5.1
1959-07-01,2775.488,2.74,5.3
1959-10-01,2785.204,0.27,5.6
1960-01-01,2847.699,2.31,5.2


In [77]:
# Lastly, we reshape with stack, turn the new index levels into columns with reset_index, 
# and finally give the column containing the data values the name "value":

long_data = (data.stack()
             .reset_index()
             .rename(columns={0: "value"}))

long_data[:10]

Unnamed: 0,date,item,value
0,1959-01-01,realgdp,2710.349
1,1959-01-01,infl,0.0
2,1959-01-01,unemp,5.8
3,1959-04-01,realgdp,2778.801
4,1959-04-01,infl,2.34
5,1959-04-01,unemp,5.1
6,1959-07-01,realgdp,2775.488
7,1959-07-01,infl,2.74
8,1959-07-01,unemp,5.3
9,1959-10-01,realgdp,2785.204


Data is frequently stored this way in relational SQL databases, as a fixed schema (column
names and data types) allows the number of distinct values in the item column
to change as data is added to the table. In the previous example, date and item would
usually be the primary keys (in relational database parlance), offering both relational
integrity and easier joins. In some cases, the data may be more difficult to work with
in this format; we might prefer to have a DataFrame containing one column per
distinct item value indexed by timestamps in the date column. DataFrame’s pivot
method performs exactly this transformation

In [78]:
# 
pivoted = long_data.pivot(index="date", columns="item",
                          values="value")
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,0.0,2710.349,5.8
1959-04-01,2.34,2778.801,5.1
1959-07-01,2.74,2775.488,5.3
1959-10-01,0.27,2785.204,5.6
1960-01-01,2.31,2847.699,5.2


In [79]:
long_data.index.name = None

In [80]:
# The first two values passed are the columns to be used, respectively, as the row and
# column index, then finally an optional value column to fill the DataFrame. 
# Suppose we have two value columns that we want to reshape simultaneously

long_data["value2"] = np.random.standard_normal(len(long_data))
long_data[:10]

Unnamed: 0,date,item,value,value2
0,1959-01-01,realgdp,2710.349,0.802926
1,1959-01-01,infl,0.0,0.575721
2,1959-01-01,unemp,5.8,1.381918
3,1959-04-01,realgdp,2778.801,0.000992
4,1959-04-01,infl,2.34,-0.143492
5,1959-04-01,unemp,5.1,-0.206282
6,1959-07-01,realgdp,2775.488,-0.222392
7,1959-07-01,infl,2.74,-1.682403
8,1959-07-01,unemp,5.3,1.811659
9,1959-10-01,realgdp,2785.204,-0.351305


In [81]:
# By omitting the last argument, we obtain a DataFrame with hierarchical columns

pivoted = long_data.pivot(index="date", columns="item")
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-01-01,0.0,2710.349,5.8,0.575721,0.802926,1.381918
1959-04-01,2.34,2778.801,5.1,-0.143492,0.000992,-0.206282
1959-07-01,2.74,2775.488,5.3,-1.682403,-0.222392,1.811659
1959-10-01,0.27,2785.204,5.6,0.128317,-0.351305,-1.313554
1960-01-01,2.31,2847.699,5.2,-0.615939,0.498327,0.174072


In [82]:
# Note that pivot is equivalent to creating a hierarchical index using set_index followed
# by a call to unstack:

unstacked = long_data.set_index(["date", "item"]).unstack(level="item")
unstacked

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-01-01,0.00,2710.349,5.8,0.575721,0.802926,1.381918
1959-04-01,2.34,2778.801,5.1,-0.143492,0.000992,-0.206282
1959-07-01,2.74,2775.488,5.3,-1.682403,-0.222392,1.811659
1959-10-01,0.27,2785.204,5.6,0.128317,-0.351305,-1.313554
1960-01-01,2.31,2847.699,5.2,-0.615939,0.498327,0.174072
...,...,...,...,...,...,...
2008-07-01,-3.16,13324.600,6.0,0.996170,-0.972227,-1.266682
2008-10-01,-8.79,13141.920,6.9,-0.117213,-0.943992,-0.830112
2009-01-01,0.94,12925.410,8.1,-0.231267,0.377070,0.359247
2009-04-01,3.37,12901.504,9.2,0.069801,0.631954,-0.042890


In [83]:
# select part of the pivot table

pivoted["value"].head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,0.0,2710.349,5.8
1959-04-01,2.34,2778.801,5.1
1959-07-01,2.74,2775.488,5.3
1959-10-01,0.27,2785.204,5.6
1960-01-01,2.31,2847.699,5.2


### Pivoting “Wide” to “Long” Format

An inverse operation to pivot for DataFrames is pandas.melt. Rather than transforming
one column into many in a new DataFrame, it merges multiple columns into
one, producing a DataFrame that is longer than the input.

In [84]:
df = pd.DataFrame({"key": ["foo", "bar", "baz"],
                   "A": [1, 2, 3],
                   "B": [4, 5, 6],
                   "C": [7, 8, 9]})
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [85]:
# The "key" column may be a group indicator, and the other columns are data values.
# When using pandas.melt, we must indicate which columns (if any) are group indicators.
# Let’s use "key" as the only group indicator here

melted = pd.melt(df, id_vars="key")
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


In [86]:
# Using pivot, we can reshape back to the original layout

reshaped = melted.pivot(index="key", columns="variable",
                        values="value")
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


In [87]:
# Since the result of pivot creates an index from the column used as the row labels, we
# may want to use reset_index to move the data back into a column:

reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


In [88]:
# We can also specify a subset of columns to use as value columns

pd.melt(df, id_vars="key", value_vars=["A", "B"])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


In [89]:
# pandas.melt can be used without any group identifiers

pd.melt(df, value_vars=["A", "B", "C"])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


In [90]:
pd.melt(df, value_vars=["key", "A", "B"])

Unnamed: 0,variable,value
0,key,foo
1,key,bar
2,key,baz
3,A,1
4,A,2
5,A,3
6,B,4
7,B,5
8,B,6
