# Data Wrangling: Join, Combine and Reshape
## DAT540 Introduction to Data Science
## University of Stavanger
### L11.1
#### Antorweep Chakravorty (antorweep.chakravorty@uis.no)

- **Hierarchical Indexing**
- enables multiple (two or more) index levels on an axis
- Creates *MultiIndex* datastruct that holds multiple indices on an axis
- With a hierarchical index object, *partial* indexing is possible enabling concise ways to select subset of data
- It also allows selection from an "inner" level
- Hierarchical indexing plays an important role in reshaping data and group-based operations
- A Series with MultiIndex can be rearranged into a DataFrame using the *unstack* instance method
- Similarly a DataFrame can be converted into a MultiIndex series using the *stack* instance method that will transform the column labels as inner row indices
- DataFrames allows MultiIndex on either axis
- Hierarchical levels can have names as strings or any python object
- MultiIndex can be created and reused using its *from_arrays* method

- *Indexing with a DataFrame's columns*
- One or more columns of a DataFrame can be used to index its rows. Additionally, the rows can also be used to index the columns
- The *set_index* instance method can be used to index rows based on columns
- By default the columns are removed from the DataFrame and can be changed by specifying the *drop* argument as False
- *reset_index* instance method moves hierarchical index levels into columns

In [1]:
import numpy as np
import pandas as pd

# Get the adult dataset from UCI Machine Learning repo
adult = pd.read_csv('./data/adult_sanitized.csv')

In [2]:
# sort the adult dataset values by 'Native-Country', 'Sex'
adult.sort_values(by=['Native-Country', 'Sex'], inplace=True)
# strip any whitespaces for 'Native-Country' column values
adult['Native-Country'] = adult['Native-Country'].map(lambda x: x.strip())
adult['Sex'] = adult['Sex'].map(lambda x: x.strip())

# set the index of the adult data set using set_index to 'Native-Country', 'Sex'. 
# Store the resulting dataset into a variable adult_mi
adult_mi = adult.set_index(['Native-Country', 'Sex'])
# Display the adult_mi head(3)
adult_mi.tail(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Capital-Gain,Capital-Loss,Weekly-Work-Hours,Salary
Native-Country,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Yugoslavia,Male,29437,43,Private,557349,10th,6,Married-civ-spouse,Transport-moving,Husband,White,0,0,40,<=50K
Yugoslavia,Male,30543,22,Private,180060,HS-grad,9,Never-married,Exec-managerial,Not-in-family,White,0,0,40,<=50K
Yugoslavia,Male,31519,29,Private,273051,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,52,>50K


In [3]:
# Use loc to retrive all rows from adult_mi with multiindex ['United-States', 'Female']
adult_mi.loc[('United-States', 'Female')].head(3)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Capital-Gain,Capital-Loss,Weekly-Work-Hours,Salary
Native-Country,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
United-States,Female,5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,0,0,40,<=50K
United-States,Female,8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,14084,0,50,>50K
United-States,Female,12,23,Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,0,0,30,<=50K


In [10]:
# Create a Series with MultiIndex 
# index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,3,1,2,2,3]]
# Gen a sequence of 9 randn
data = pd.Series(np.random.randn(9),
                index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], [1,2,3,1,3,1,2,2,3]])
data

a  1   -1.143905
   2   -0.479229
   3   -0.003369
b  1    0.597347
   3   -0.296286
c  1   -0.745082
   2    0.538509
d  2   -1.316708
   3   -0.803749
dtype: float64

In [11]:
# Stack the series to create a DataFrame
# Since the original series has two levels 0 and 1, we can choose the level to unstack on (default innermost)
data.unstack(0)


Unnamed: 0,a,b,c,d
1,-1.143905,0.597347,-0.745082,
2,-0.479229,,0.538509,-1.316708
3,-0.003369,-0.296286,,-0.803749


In [12]:
# The unstacked DataFrame can be transformed into a stacked series using the stack method
data.unstack().stack()


a  1   -1.143905
   2   -0.479229
   3   -0.003369
b  1    0.597347
   3   -0.296286
c  1   -0.745082
   2    0.538509
d  2   -1.316708
   3   -0.803749
dtype: float64

- *Reordering and Sorting Levels*
- The *swaplevel* instance method takes two level numbers or names and returns a new object with the levels interchanged
- *sort_index* instance method sorts the data using only the values in a single level

In [13]:
# swap 'Native-Country', 'Sex' in adult_mi data set using instance method swaplevel
adult_mi.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Capital-Gain,Capital-Loss,Weekly-Work-Hours,Salary
Native-Country,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
?,Female,51,18,Private,226956,HS-grad,9,Never-married,Other-service,Own-child,White,0,0,30,<=50K
?,Female,93,30,Private,117747,HS-grad,9,Married-civ-spouse,Sales,Wife,Asian-Pac-Islander,0,1573,35,<=50K


In [14]:
# In case adult_mi Native-Country will become the innermost index
adult_mi.swaplevel('Native-Country', 'Sex').head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-Num,Marital-Status,Occupation,Relationship,Race,Capital-Gain,Capital-Loss,Weekly-Work-Hours,Salary
Sex,Native-Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Female,?,51,18,Private,226956,HS-grad,9,Never-married,Other-service,Own-child,White,0,0,30,<=50K
Female,?,93,30,Private,117747,HS-grad,9,Married-civ-spouse,Sales,Wife,Asian-Pac-Islander,0,1573,35,<=50K


- *Summary Statistics by Level*
- Many descriptive and summary statistics on DataFrame and Series have a *level* argument that allows specification of the level on which aggregation is to be performed over a particular axis

In [16]:
# Get quick summary stats from adult_mi
# Perfrom mean on adult_mi with level='Native-Country'
adult_mi.mean(level='Native-Country').head(3)

  adult_mi.mean(level='Native-Country').head(3)
  return self._agg_by_level(


Unnamed: 0_level_0,Unnamed: 0,Age,Fnlwgt,Education-Num,Capital-Gain,Capital-Loss,Weekly-Work-Hours
Native-Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
?,16172.542955,38.74055,193384.587629,10.601375,1809.621993,118.469072,41.515464
Cambodia,14857.315789,37.789474,193080.368421,8.789474,1027.842105,183.052632,40.894737
Canada,14686.082645,42.545455,179852.933884,10.652893,1504.132231,129.933884,40.404959


- **Combining and Merging Datasets**
- It is often necessary to join, combine or merge datasets from different repositories
- Data contained in pandas objects can be combined together in multiple ways
  - *pandas.merge* top-level method connects rows in DataFrames based on one or more keys. Similar to join operations
  - *pandas.concat* top-level method concatenates or "stacks" together objects along an axis
  - *combine_first* instance method enables splicing together with overlapping data to fill in missing values in one object with values from another
- *Database-Style DataFrame Joins*
- *Merge* or *Join* operations combine datasets by linking rows using one or more keys using *pandas.merge*
- The *on* argument specifies the column(s) to be used for performing a join
- Without the on argument, merge uses overlapping column names as keys
- If the column names are different in each object, they can be specified separately using *left_on* and *right_on* arguments for the left and right dataframes respectively
- By default merge does an *inner join*
- The *how* argument allows 'left', 'right' and 'outer' joins 
- To merge with multiple keys, a list of column names are passed to the *on* argument
- *suffixes=()* argument takes a tuple and adds suffixes to overlapping columns when performing merge

In [17]:
np.random.seed(1234)
# Generate two data sets with random values.
df1 = pd.DataFrame({'key': ['b', 'b', 'a',  'c', 'a', 'a', 'd'],
                   'data1': np.random.randint(0, 10, 7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                   'data2': np.random.randint(0, 20, 3)})
print('df1:\n', df1)
print('df2:\n', df2)

df1:
   key  data1
0   b      3
1   b      6
2   a      5
3   c      4
4   a      8
5   a      9
6   d      1
df2:
   key  data2
0   a      9
1   b     11
2   d     12


In [18]:
# Perfrom a default inner join using pd.merge on df1 and df2.
# pandas automatically chooses the overlapping columns as the keys.
# For performing merge on specific columns that overlap, it can be defined using argument "on". 
# Could be a single label or a list of overlapping labels
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,3,11
1,b,6,11
2,a,5,9
3,a,8,9
4,a,9,9
5,d,1,12


In [19]:
# Change the column names for key as key1 and key2 for df1 and df2 respectively
df1.columns = ['key1', 'data1']
df2.columns = ['key2', 'data2']
print('df1:\n', df1)
print('df2:\n', df2)

df1:
   key1  data1
0    b      3
1    b      6
2    a      5
3    c      4
4    a      8
5    a      9
6    d      1
df2:
   key2  data2
0    a      9
1    b     11
2    d     12


In [20]:
# Try perfroming a simple/default merge on them now using pd.merge

try:
  pd.merge(df1, df2)
except:
  print('No common columns to perform merge on.')

No common columns to perform merge on.


In [21]:
# add a new column "data" to df1 and df2 with random values (len=7 and len=3 respectively)
df1['data'] = np.random.randn(7)
df2['data'] = np.random.randn(3)

print('df1:\n', df1)
print('df2:\n', df2)

# Perform merge on df1 and df2 with "left_on" and "right_on" argument specifying the key columns for each DataFrame
# Also specify the type of join as outer using the *how* argument
# Additionally use "suffixes" argument to specify a string to append to overlapping names in the left and right DFs
pd.merge(df1, df2, left_on='key1', right_on='key2', how='outer', suffixes=('_l', '_r'))

df1:
   key1  data1      data
0    b      3 -0.008679
1    b      6 -0.321061
2    a      5  1.056970
3    c      4 -0.590180
4    a      8 -0.387864
5    a      9 -0.046539
6    d      1 -0.998716
df2:
   key2  data2      data
0    a      9 -0.824126
1    b     11  0.515387
2    d     12  0.315523


Unnamed: 0,key1,data1,data_l,key2,data2,data_r
0,b,3,-0.008679,b,11.0,0.515387
1,b,6,-0.321061,b,11.0,0.515387
2,a,5,1.05697,a,9.0,-0.824126
3,a,8,-0.387864,a,9.0,-0.824126
4,a,9,-0.046539,a,9.0,-0.824126
5,c,4,-0.59018,,,
6,d,1,-0.998716,d,12.0,0.315523


- Merge function arguments

<img src='images/merge_args.png'>

- *Merging on Index*
- Merging can also be performed on row indices (hierarchical and non-hierarchical)
- The *left_index=True* or *right_index=True* (or both) arguments indicates that the index should be used as the merge key
- Alternatively DataFrame also provides the *join* instance method to perform merge operation on indices
- The *join* method takes in similar arguments like merge
- Index-on-index merging using join can also accept a list of DataFrames to perform multiple joins with multiple DataFrames

In [22]:
# Transform df1 and df2 with set_index, to include key1 and key2 as their respective row indices 
df1.set_index('key1', inplace=True)
df2.set_index('key2', inplace=True)
print('df1:\n', df1)
print('df2:\n', df2)

df1:
       data1      data
key1                 
b         3 -0.008679
b         6 -0.321061
a         5  1.056970
c         4 -0.590180
a         8 -0.387864
a         9 -0.046539
d         1 -0.998716
df2:
       data2      data
key2                 
a         9 -0.824126
b        11  0.515387
d        12  0.315523


In [23]:
# Merge df1 and df2 using their indices.
# Use "left_index", "right_index" or both (when the index names are different)
pd.merge(df1, df2, left_index=True, right_index=True)

Unnamed: 0,data1,data_x,data2,data_y
a,5,1.05697,9,-0.824126
a,8,-0.387864,9,-0.824126
a,9,-0.046539,9,-0.824126
b,3,-0.008679,11,0.515387
b,6,-0.321061,11,0.515387
d,1,-0.998716,12,0.315523


In [24]:
# Perform merge on indices using "join" instance method
df1.join(df2, lsuffix='_l', rsuffix='_r', how='outer')

Unnamed: 0,data1,data_l,data2,data_r
a,5,1.05697,9.0,-0.824126
a,8,-0.387864,9.0,-0.824126
a,9,-0.046539,9.0,-0.824126
b,3,-0.008679,11.0,0.515387
b,6,-0.321061,11.0,0.515387
c,4,-0.59018,,
d,1,-0.998716,12.0,0.315523
