In [27]:
import pandas as pd
import numpy as np
from datetime import date

# Reshaping by concatenating, merging, joining and combining data

## Concatenating data in multiple objects

[Merge, join, concatenate and compare Userguide](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

[Exercises](http://localhost:8888/notebooks/Projects/pandas_workshop/exercises/da_exercises_2.ipynb)

Concatenation is performed using the pandas function `pd.concat()`. The general syntax to concatenate data
is to pass a list of objects to be concatenated. 

This function concat objects along axis and perform set logic of concatenation: union or intersection:
- take the union of them all, `join='outer'`. This is the default option as it results in zero information loss.
- take the intersection, `join='inner'`.

Concatenation takes into account only **row and column labels** and do not - values. It's equivalent to `union` operator of the SQL.

### Concatenation of the Series

Concatenation of the `Series` objects simply results in a new `Series`, with the values copied in sequence.

In [3]:
s1 = pd.Series(np.arange(0, 3))
s2 = pd.Series(np.arange(5, 8))
s1, s2

(0    0
 1    1
 2    2
 dtype: int64,
 0    5
 1    6
 2    7
 dtype: int64)

In [4]:
pd.concat([s1 ,s2])

0    0
1    1
2    2
0    5
1    6
2    7
dtype: int64

A fairly common use of the `keys` argument is to override the column names when creating a new `DataFrame` based on existing `Series`. 

In [66]:
s4 = pd.Series([0, 1, 2, 3], name='foo')
s5 = pd.Series([0, 1, 2, 3])
s6 = pd.Series([0, 1, 4, 5])
pd.concat([s4, s5, s6], axis=1, keys=['red', 'blue', 'yellow'])

Unnamed: 0,red,blue,yellow
0,0,0,0
1,1,1,1
2,2,2,4
3,3,3,5


### Concatenation of the DataFrames

The default functionality of `DataFrame` concatenating results in the rows being appended in order, and can result in duplicate index labels along the rows index.

The resulting set of column labels is defined by the union of the index labels in the specified `DataFrame`
objects. This is an alignment that is applied across all the source objects (there can be more than two). Pandas
will insert the `NaN` values if a column in the result does not exist in the DataFrame object currently being
processed.

In [10]:
df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c'])
df1

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


In [8]:
df2 = pd.DataFrame(np.arange(9, 18).reshape(3, 3), columns=['a', 'b', 'c'])
df2

Unnamed: 0,a,b,c
0,9,10,11
1,12,13,14
2,15,16,17


In [9]:
pd.concat([df1, df2])

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
0,9,10,11
1,12,13,14
2,15,16,17


In [13]:
df2.rename(columns={'b': 'd'}, inplace=True)
df2

Unnamed: 0,a,d,c
0,9,10,11
1,12,13,14
2,15,16,17


In [14]:
pd.concat([df1, df2])

Unnamed: 0,a,b,c,d
0,0,1.0,2,
1,3,4.0,5,
2,6,7.0,8,
0,9,,11,10.0
1,12,,14,13.0
2,15,,17,16.0


It is possible to give each group of data in the result its own name using the `keys` parameter. This creates a
hierarchical index on the `DataFrame` object that lets you refer to each group of data independently via the
DataFrame object's `.loc` property. This is convenient if you later need to determine where data in the
resulting DataFrame object was sourced.

In [15]:
pd.concat([df1, df2], keys=['df1', 'df2'])

Unnamed: 0,Unnamed: 1,a,b,c,d
df1,0,0,1.0,2,
df1,1,3,4.0,5,
df1,2,6,7.0,8,
df2,0,9,,11,10.0
df2,1,12,,14,13.0
df2,2,15,,17,16.0


In [22]:
pd.concat([df1, df2], axis=1, keys=['df1', 'df2'])

Unnamed: 0_level_0,df1,df1,df1,df2,df2,df2
Unnamed: 0_level_1,a,b,c,a,d,c
0,0,1,2,9,10,11
1,3,4,5,12,13,14
2,6,7,8,15,16,17


The `pd.concat()` function allows you to specify the axis on which to apply the alignment during the
concatenation.
The result contains duplicate columns because the concatenation first aligns by the row index labels
of each `DataFrame` object, and then fills in the columns from the first `DataFrame` object and then the
second without regard to the row index labels.

In [16]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,a,b,c,a.1,d,c.1
0,0,1,2,9,10,11
1,3,4,5,12,13,14
2,6,7,8,15,16,17


In [18]:
df3 = pd.DataFrame(np.arange(20, 26).reshape(3, 2),
                   columns=['a', 'd'],
                   index=[2, 3, 4])
df3

Unnamed: 0,a,d
2,20,21
3,22,23
4,24,25


In [20]:
# alignment is along row labels first columns from df1 and df3 duplicated,
# Nan filled in where those rows and columns don't exist in the source

pd.concat([df1, df3], axis=1)

Unnamed: 0,a,b,c,a.1,d
0,0.0,1.0,2.0,,
1,3.0,4.0,5.0,,
2,6.0,7.0,8.0,20.0,21.0
3,,,,22.0,23.0
4,,,,24.0,25.0


In [67]:
pd.concat({"x": df1, "y": df2, "z": df3})

Unnamed: 0,Unnamed: 1,a,b,c,d
x,0,0,1.0,2.0,
x,1,3,4.0,5.0,
x,2,6,7.0,8.0,
y,0,9,,11.0,10.0
y,1,12,,14.0,13.0
y,2,15,,17.0,16.0
z,2,20,,,21.0
z,3,22,,,23.0
z,4,24,,,25.0


### Concatenating with mixed ndims

You can concatenate a mix of `Series` and `DataFrame` objects. The Series will be transformed to DataFrame with the column name as the name of the Series.

In [62]:
s = pd.Series(['X0', 'X1', 'X2', 'X3'], name="X")
pd.concat([df1, s], axis=1)

Unnamed: 0,a,b,c,X
0,0.0,1.0,2.0,X0
1,3.0,4.0,5.0,X1
2,6.0,7.0,8.0,X2
3,,,,X3


Since we’re concatenating a `Series` to a `DataFrame`, we could have achieved the same result with `DataFrame.assign()`. To concatenate an arbitrary number of pandas objects (DataFrame or Series), use `concat`.

If unnamed Series are passed they will be numbered consecutively.

In [64]:
s3 = pd.Series(['_0', '_1', '_2', '_3'])
pd.concat([df1, s3, s3, s3], axis=1)

Unnamed: 0,a,b,c,0,1,2
0,0.0,1.0,2.0,_0,_0,_0
1,3.0,4.0,5.0,_1,_1,_1
2,6.0,7.0,8.0,_2,_2,_2
3,,,,_3,_3,_3


Passing `ignore_index=True` will drop all name references.

In [65]:
pd.concat([df1, s], axis=1, ignore_index=True)

Unnamed: 0,0,1,2,3
0,0.0,1.0,2.0,X0
1,3.0,4.0,5.0,X1
2,6.0,7.0,8.0,X2
3,,,,X3


#### Appending rows to a DataFrame

If you have a `Series` that you want to append as a single row to a `DataFrame`, you can convert the row into a `DataFrame` and use `concat`.

In [69]:
s7 = pd.Series(['X0', 'X1', 'X2', 'X3'], index=['a', 'b', 'c', 'D'])
pd.concat([df1, s7.to_frame().T], ignore_index=True)

Unnamed: 0,a,b,c,D
0,0,1,2,
1,3,4,5,
2,6,7,8,
3,X0,X1,X2,X3


### Specifying join type

A default concatenation actually performs an **outer join** operation along the index labels on the axis *opposite* of the concatenation (the rows index). This makes the resulting set of labels similar to having performed a union of those labels.

The type of join can be changed by specifying `join` as a parameter.

In [21]:
pd.concat([df1, df3], axis=1, join='inner')

Unnamed: 0,a,b,c,a.1,d
2,6,7,8,20,21


## Merging and joining data

Pandas allows the merging of pandas objects with database-like join operations, using the `pd.merge()`
function and the `.merge()` method of a `DataFrame` object. A merge combines the data of two pandas
objects by finding **matching values** in one or more columns or row indexes. It then returns a new object that
represents a combination of the data from both, based on relational-database-like join semantics applied to
those values.

Merges are useful as they allow us to model a single DataFrame for each type of data (one of the rules of
having tidy data), but to be able to relate data in different `DataFrame` objects using values existing in both
sets of data.

The default type of join performed by `pd.merge()` is an iner join. To use another join method, specify the
join type using the `how` parameter of the `pd.merge()` function (or the `.merge()` method). The valid options
are:
- `inner`: This is the intersection of keys from both DataFrame objects
- `outer`: This is the union of keys from both DataFrame objects
- `left`: This only uses keys from the left DataFrame
- `right`: This only uses keys from the right DataFrame
- `cross`: Create the cartesian product of rows of both frames

The related `join()` method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use `DataFrame.join` to save yourself some typing.

In [24]:
customers = pd.DataFrame({
    'customer_id': [10, 11],
    'name': ['Mike', 'April'],
    'address': ['address for Mike', 'address for April']
})
customers

Unnamed: 0,customer_id,name,address
0,10,Mike,address for Mike
1,11,April,address for April


In [28]:
orders = pd.DataFrame({
    'customer_id': [10, 11, 10],
    'order_date': [date(2014, 12, 1), date(2014, 3, 12), date(2014, 4, 14)]
})
orders

Unnamed: 0,customer_id,order_date
0,10,2014-12-01
1,11,2014-03-12
2,10,2014-04-14


By default Padas determine the columns in both objects with common labels, which treated as the keys to perform the join, and matches values in the key columns of both DF objects. 

In [30]:
customers.merge(orders)

Unnamed: 0,customer_id,name,address,order_date
0,10,Mike,address for Mike,2014-12-01
1,10,Mike,address for Mike,2014-04-14
2,11,April,address for April,2014-03-12


In [33]:
left_data = pd.DataFrame({
    'key1': ['a', 'b', 'c'],
    'key2': ['x', 'y', 'z'],
    'lval1': [0, 1, 2]
    },
    index=[0, 1, 2]
)

right_data = pd.DataFrame({
    'key1': ['a', 'b', 'c'],
    'key2': ['x', 'a', 'z'],
    'lval1': [6, 7, 8]
    },
    index=[1, 2, 3])

left_data, right_data

(  key1 key2  lval1
 0    a    x      0
 1    b    y      1
 2    c    z      2,
   key1 key2  lval1
 1    a    x      6
 2    b    a      7
 3    c    z      8)

In [35]:
left_data.merge(right_data, on='key1')

Unnamed: 0,key1,key2_x,lval1_x,key2_y,lval1_y
0,a,x,0,x,6
1,b,y,1,a,7
2,c,z,2,z,8


In [36]:
left_data.merge(right_data, on=['key1', 'key2'])

Unnamed: 0,key1,key2,lval1_x,lval1_y
0,a,x,0,6
1,c,z,2,8


### Checking for duplicate keys

You can use the `validate` argument to automatically check whether there are unexpected duplicates in the merge keys. Key uniqueness is checked before merge operations and so should protect against memory *overflows*. Checking key uniqueness is also a good way to ensure user data structures are as expected.

If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in the left DataFrame, one can use the `validate='one_to_many'`/ `validate='many_to_one'` argument instead, which will not raise an exception.

In [77]:
left_data.loc[1, 'key1'] = 'a'
pd.merge(left_data, right_data, how='outer', on='key1', validate='one_to_one')

MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

In [79]:
pd.merge(left_data, right_data, how='outer', on='key1', validate='many_to_one')

Unnamed: 0,key1,key2_x,lval1_x,key2_y,lval1_y
0,a,x,0.0,x,6
1,a,y,1.0,x,6
2,c,z,2.0,z,8
3,b,,,a,7


To perform a merge with the labels of the row indexes of the two `DataFrame` objects, you can use the
`left_index=True` and `right_index=True` parameters.

In [80]:
pd.merge(left_data, right_data, left_index=True, right_index=True)

Unnamed: 0,key1_x,key2_x,lval1_x,key1_y,key2_y,lval1_y
1,a,y,1,a,x,6
2,c,z,2,b,a,7


If you would like to merge based on columns with different names in each object, you can use the `left_on` and `right_on` parameters, passing the name or names of the columns to each respective parameter.

You can specify these suffixes using the `suffixes` parameter to append respective suffixes to columns with same names&

In [45]:
ladies = pd.DataFrame({
    'city': ['Taganrog', 'Rostov-on-Don', 'Krasnodar'],
    'name': ['Lustrova', 'Grigoryan', 'Voronina'],
    'year': [1977, 1977, 1979],
    'height': [162, 165, 158]
    },
    index=[1, 2, 3])

boys = pd.DataFrame({
    'place': ['Taganrog', 'Rostov-on-Don', 'Krasnodar'],
    'name': ['Antoshkin', 'Goltsov', 'Limonov'],
    'year': [1976, 1977, 1975],
    'height': [172, 175, 177]
    },
    index=[4, 5, 6])

ladies.merge(boys, left_on='city', right_on='place', suffixes=['_ladies', '_boys'])

Unnamed: 0,city,name_ladies,year_ladies,height_ladies,place,name_boys,year_boys,height_boys
0,Taganrog,Lustrova,1977,162,Taganrog,Antoshkin,1976,172
1,Rostov-on-Don,Grigoryan,1977,165,Rostov-on-Don,Goltsov,1977,175
2,Krasnodar,Voronina,1979,158,Krasnodar,Limonov,1975,177


## Joining on index

The pandas library also provides a `.join()` method that can be used to perform a join using the index labels
of the two DataFrame objects (instead of values in columns).

If the columns in the two `DataFrame` objects do not have unique column names, you must specify suffixes using the `lsuffix` and `rsuffix` parameters (automatic suffixing is not performed, as with merge).

The default type of join performed is an *left join*. Note that this differs from the default of the `.merge()`
method, which defaults to inner. To change to an inner join, specify `how='inner'`.

In [55]:
ladies.join(boys, lsuffix='_ladies', rsuffix='_boys', how='outer')

Unnamed: 0,city,name_ladies,year_ladies,height_ladies,place,name_boys,year_boys,height_boys
1,Taganrog,Lustrova,1977.0,162.0,,,,
2,Rostov-on-Don,Grigoryan,1977.0,165.0,,,,
3,Krasnodar,Voronina,1979.0,158.0,,,,
4,,,,,Taganrog,Antoshkin,1976.0,172.0
5,,,,,Rostov-on-Don,Goltsov,1977.0,175.0
6,,,,,Krasnodar,Limonov,1975.0,177.0


The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:

In [57]:
# get the same effect

pd.merge(ladies, boys, left_index=True, right_index=True, how='outer', suffixes=['_ladies', '_boys'])

Unnamed: 0,city,name_ladies,year_ladies,height_ladies,place,name_boys,year_boys,height_boys
1,Taganrog,Lustrova,1977.0,162.0,,,,
2,Rostov-on-Don,Grigoryan,1977.0,165.0,,,,
3,Krasnodar,Voronina,1979.0,158.0,,,,
4,,,,,Taganrog,Antoshkin,1976.0,172.0
5,,,,,Rostov-on-Don,Goltsov,1977.0,175.0
6,,,,,Krasnodar,Limonov,1975.0,177.0


In [81]:
# # get almost the same effect
pd.concat([ladies, boys], axis=1)

Unnamed: 0,city,name,year,height,place,name.1,year.1,height.1
1,Taganrog,Lustrova,1977.0,162.0,,,,
2,Rostov-on-Don,Grigoryan,1977.0,165.0,,,,
3,Krasnodar,Voronina,1979.0,158.0,,,,
4,,,,,Taganrog,Antoshkin,1976.0,172.0
5,,,,,Rostov-on-Don,Goltsov,1977.0,175.0
6,,,,,Krasnodar,Limonov,1975.0,177.0


### Joining key columns on an index

`join()` takes an optional `on` argument which may be a column or multiple column names, which specifies that the passed DataFrame is to be aligned on that column in the DataFrame.

The default for `DataFrame.join` is to perform a *left join*, which uses only the keys found in the calling DataFrame. Other join types can be just as easily performed:
- `left`: use calling frame’s index (or column if on is specified)
- `right`: use other’s index
- `outer`: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
- `inner`: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one
- `cross`: creates the cartesian product from both frames, preserves the order of the left keyscross: creates the cartesian product from both frames, preserves the order of the left keys.



In [89]:
ladies.join(boys.set_index('year'), on='year', lsuffix='_ladies', rsuffix='_boys')

Unnamed: 0,city,name_ladies,year,height_ladies,place,name_boys,height_boys
1,Taganrog,Lustrova,1977,162,Rostov-on-Don,Goltsov,175.0
2,Rostov-on-Don,Grigoryan,1977,165,Rostov-on-Don,Goltsov,175.0
3,Krasnodar,Voronina,1979,158,,,


## Combining data

To perform column-wise combine with another DataFrame via function use 

`DataFrame.combine(other, func, fill_value=None, overwrite=True)` method.

The method combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

In [92]:
df_c1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
df_c2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2
df_c1.combine(df_c2, take_smaller)

Unnamed: 0,A,B
0,0,3
1,0,3


To update null elements in the DataFrame with value in the same location in other DataFrame use `DataFrame.combine_first(other)`.

Method combines two DataFrame objects by filling **null values in one DataFrame** with **non-null values from other DataFrame**. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).

In [93]:
df_cf1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df_cf2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df_cf1.combine_first(df_c2)

Unnamed: 0,A,B
0,1.0,3.0
1,0.0,4.0


In [95]:
df1 = pd.DataFrame([[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]])
df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])
df1.combine_first(df2)

Unnamed: 0,0,1,2
0,,3.0,5.0
1,-4.6,,-8.2
2,-5.0,7.0,4.0


## Other methods to learn

### Merging ordered data

A `merge_ordered()` function allows combining time series and other ordered data. In particular it has an optional fill_method keyword to fill/interpolate missing data.

### Merging asof

A `merge_asof()` is similar to an ordered left-join except that we match on nearest key rather than equal keys. For each row in the left DataFrame, we select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest match on the on key.

### Comparing objects

The `compare()` method allows you to compare two DataFrame or Series, respectively, and summarize their differences.

For example, you might want to compare two DataFrame and stack their differences side by side.

By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.