# Join, Combine, and Reshape a DataFrame

---

Oftentimes, the data is in different files and in different format. The analyst have to be able to deal with such kind of problem and appropriately join different data files in order to do successful operations on the whole data and not only one part of it. In this lecture, we will cover one of the most important and slightly advanced functionalities of Pandas - how to join and combine several DataFrames along with somewhat familiar Pivoting and cross-tabulation operations.


### Lecture outline

---

* Hierarchical Indexing (MultiIndex)


* Combining and Merging


* Joining and Concatenation


* Reshaping and Pivoting


* Groupby


* Cross Tabulation


* Long to Wide format


* Wide to Long format

In [1]:
import pandas as pd

import numpy as np

## Hierarchical Indexing (MultiIndex)

---

Before we delve deep into Pandas merging and reshaping operations, it's essential to know what is a hierarchical index and how to work with it.

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form, like Series (1d) and DataFrame (2d).


> Note that, operations on hierarchical indexed DataFrame is different due to several indices. Hence, we have to differentiate which index to use.

#### Reference

[MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)


[Multiindexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#multiindexing)

### Intro

In [None]:
multi_df = pd.DataFrame(data=np.random.randint(100, size=9),
                        index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                               [1, 2, 3, 1, 3, 1, 2, 1, 3]],
                        columns=["values"])


multi_df

In [None]:
multi_df.index # Return index object

multi_df.index.levels # Return index levels

multi_df.index.names # Return names in index levels. Currently no names

In [None]:
multi_df.index.names = ["index_1", "index_2"]

multi_df.index.names

In [None]:
multi_df.columns.names = ["column_index"]

multi_df.columns.names

### Slicing

In [None]:
multi_df

In [None]:
multi_df.xs(key="a", axis=0, level=0) # Get values at specified index

multi_df.xs(key=2, axis=0, level=1) # Get values at specified index

multi_df.xs(key=("a", 3)) # Get values at several indexes

multi_df.xs(key=("a", 3), axis=0, level=[0, 1]) # Get values at several indexes and levels

multi_df.xs(key="values", axis=1) # Get values at vertical axis

Instead of `xs()` method we can use familiar `loc` for slicing on different axis.

In [None]:
All = slice(None) # Python built-in slicer

In [None]:
multi_df.loc["a"] # Slice at the first level

multi_df.loc[["a", "c"]] # Selective slice at the first level

multi_df.loc["a"].loc[:2] # Slice at the second level


multi_df.loc[("a", All), All] # Return all values for "a" index at the first level

multi_df.loc[(All, 1), All] # Return all 1's from the second level

multi_df.loc[(All, 1), ("values")] # Same as above one. Selects all first level index and "1" from the second level

multi_df.loc[(slice("a", "c"), 2), All] # Selective slicing at both index level

### Reordering and Sorting Levels

---

Sometimes, we need to swap the index levels and/or sort multiindex DataFrame by either one or both index. Here, comes the solution for that.

In [None]:
multi_df

In [None]:
multi_df.swaplevel("index_2", "index_1") # Swap or change the index levels

We can sort multiindex DataFrame either by index or values.

In [None]:
multi_df.sort_index(level=0) # Sort by index level 0

multi_df.sort_index(level=1) # Sort by index level 1

In [None]:
multi_df

In [None]:
multi_df.sort_values(by=("values")) # Sort by column

### Summary Statistics by Level

In [None]:
multi_df

In [None]:
multi_df.sum() # Sum up all the values

multi_df.sum(level=0) # Sum up numbers at the level 0

multi_df.sum(level=1) # Sum up numbers at the level 1

Other statistical and/or arithmetic functions works like that. We have to explicitly indicate at which level we want to perform the particular operation.

### Set and Reset MultiIndex

---

We can set and hence reset multiple index in our DataFrame by using `set_index()` and `reset_index()` methods.

In [None]:
multi_df.reset_index(level=0) # Reset level 0 index


multi_df.reset_index(level=1) # Reset level 1 index


multi_df.reset_index() # Reset all the index

In [None]:
multi_df = multi_df.reset_index() # Reset index and set it again


multi_df

In [None]:
multi_df

In [None]:
multi_df.set_index(keys=["index_1", "index_2"]) # Set columns as index

By default the columns are removed from the DataFrame. However, we can leave them inside DataFrame.

In [None]:
multi_df.set_index(keys=["index_1", "index_2"], drop=False)

## Combining and Merging

---

In this part we will see how we can bring multiple DataFrame objects together, either by merging them horizontally, or by concatenating them vertically, along with combining and joining DataFrames.


* `merge()` - for combining data on common columns or indices


    * supports inner/left/right/full
    * can only join two DataFrames at a time
    * supports column-column, index-column, index-index joins


That's not all. We also see how Pandas `append()` method works.



> Bonus: **CROSS JOIN** or **CARTESIAN PRODUCT**



> Big Bonus: `merge_asof()` to merge on nearest keys rather than equal keys.

#### Reference


[Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)


[Merge](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#merge)


[Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101)


[Database-style DataFrame or named Series joining/merging](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join)

### Merging


---

Database-Style joining.



![Venn Diagram](images/merge.png)

In [41]:
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                     'value': [10, 20, 30, 40]})


left

Unnamed: 0,key,value
0,A,10
1,B,20
2,C,30
3,D,40


In [42]:
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
                      'value': [20, 40, 50, 60]})


right

Unnamed: 0,key,value
0,B,20
1,D,40
2,E,50
3,F,60


In [43]:
pd.merge(left=left, right=right, how="inner", on="key") # Inner join

Unnamed: 0,key,value_x,value_y
0,B,20,20
1,D,40,40


In [44]:
pd.merge(left=left, right=right, how="left", on="key") # Left join

Unnamed: 0,key,value_x,value_y
0,A,10,
1,B,20,20.0
2,C,30,
3,D,40,40.0


In [45]:
pd.merge(left=left, right=right, how="right", on="key") # Right join

Unnamed: 0,key,value_x,value_y
0,B,20.0,20
1,D,40.0,40
2,E,,50
3,F,,60


In [46]:
pd.merge(left=left, right=right, how="outer", on="key") # Outer join

Unnamed: 0,key,value_x,value_y
0,A,10.0,
1,B,20.0,20.0
2,C,30.0,
3,D,40.0,40.0
4,E,,50.0
5,F,,60.0


If the column name we are merging on are different, we can use `right_on` and `left_on` arguments inside `merge()` function. To see these features in action, let modify our DataFrames.

In [47]:
left = left.rename({"key": "first_left_key"}, axis=1)

left

Unnamed: 0,first_left_key,value
0,A,10
1,B,20
2,C,30
3,D,40


In [49]:
right = right.rename({"key": "first_right_key"}, axis=1)

right

Unnamed: 0,first_right_key,value
0,B,20
1,D,40
2,E,50
3,F,60


In [50]:
pd.merge(left=left, right=right, how="inner", left_on="first_left_key", right_on="first_right_key")

Unnamed: 0,first_left_key,value_x,first_right_key,value_y
0,B,20,B,20
1,D,40,D,40


What if we want to use two or more columns for merging? That's not a problem. First of all, we need to add new columns to our DataFrames to perform multiple column merge.

In [51]:
left = left.rename({"first_left_key": "key_1"}, axis=1)

left.insert(1, "key_2", left["key_1"].str.lower())

left

Unnamed: 0,key_1,key_2,value
0,A,a,10
1,B,b,20
2,C,c,30
3,D,d,40


In [52]:
right = right.rename({"first_right_key": "key_1"}, axis=1)

right.insert(1, "key_2", right["key_1"].str.lower())

right

Unnamed: 0,key_1,key_2,value
0,B,b,20
1,D,d,40
2,E,e,50
3,F,f,60


In [53]:
pd.merge(left=left, right=right, how="inner", on=["key_1", "key_2"]) # Inner join with multiple key


left.merge(right=right, how="inner", on=["key_1", "key_2"]) # Same as above

Unnamed: 0,key_1,key_2,value_x,value_y
0,B,b,20,20
1,D,d,40,40


We can also merge DataFrames by using the index. To do so, first we need to set index for our DataFrames

In [54]:
left = left.set_index("key_1")

left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [55]:
right = right.set_index("key_1")

right

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
B,b,20
D,d,40
E,e,50
F,f,60


In [56]:
pd.merge(left=left, right=right, how="inner", left_index=True, right_index=True) # Inner join based on index

Unnamed: 0_level_0,key_2_x,value_x,key_2_y,value_y
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
B,b,20,b,20
D,d,40,d,40


### Cross Join

---

Cross Join is the same as Cartesian Product on `X-Y` plane

![Venn Diagram](images/cross_join.png)

In [57]:
left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [58]:
right

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
B,b,20
D,d,40
E,e,50
F,f,60


In [59]:
left.merge(right, how="cross")

Unnamed: 0,key_2_x,value_x,key_2_y,value_y
0,a,10,b,20
1,a,10,d,40
2,a,10,e,50
3,a,10,f,60
4,b,20,b,20
5,b,20,d,40
6,b,20,e,50
7,b,20,f,60
8,c,30,b,20
9,c,30,d,40


### `append()`

---

Append rows of the second DataFrame to the end of the first DataFrame. Columns in the second DataFrame that are not in the first DataFrame are added as new columns.

In [60]:
left.append(right, ignore_index=False) # Preserves the index of the DataFrame

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40
B,b,20
D,d,40
E,e,50
F,f,60


In [61]:
left.append(right, ignore_index=True) # Resets the old index and sets new one

Unnamed: 0,key_2,value
0,a,10
1,b,20
2,c,30
3,d,40
4,b,20
5,d,40
6,e,50
7,f,60


Let add one more column to the right DataFrame to see if `append()` method really adds new columns.

In [62]:
right["new_value"] = right["value"] * 2

right

Unnamed: 0_level_0,key_2,value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,b,20,40
D,d,40,80
E,e,50,100
F,f,60,120


In [63]:
left.append(right, ignore_index=False) # Indeed, "append()" method adds new column

Unnamed: 0_level_0,key_2,value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,a,10,
B,b,20,
C,c,30,
D,d,40,
B,b,20,40.0
D,d,40,80.0
E,e,50,100.0
F,f,60,120.0


### `merge_asof()`

---

Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the `merge_asof()` function. The `merge_asof()` is similar to an ordered left-join merge except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

#### Reference


[pandas.merge_asof](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html#pandas-merge-asof)

In [66]:
trades = pd.DataFrame({'time': pd.to_datetime(['20160525 13:30:00.023',
                                               '20160525 13:30:00.038',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.048']),
                       'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
                       'price': [51.95, 51.95,720.77, 720.92, 98.00],
                       'quantity': [75, 155,100, 100, 100]},
                      columns=['time', 'ticker', 'price', 'quantity'])



trades

Unnamed: 0,time,ticker,price,quantity
0,2016-05-25 13:30:00.023,MSFT,51.95,75
1,2016-05-25 13:30:00.038,MSFT,51.95,155
2,2016-05-25 13:30:00.048,GOOG,720.77,100
3,2016-05-25 13:30:00.048,GOOG,720.92,100
4,2016-05-25 13:30:00.048,AAPL,98.0,100


In [67]:
quotes = pd.DataFrame({'time': pd.to_datetime(['20160525 13:30:00.023',
                                               '20160525 13:30:00.023',
                                               '20160525 13:30:00.030',
                                               '20160525 13:30:00.041',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.049',
                                               '20160525 13:30:00.072',
                                               '20160525 13:30:00.075']),
                       'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
                       'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
                       'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
                      columns=['time', 'ticker', 'bid', 'ask'])


quotes

Unnamed: 0,time,ticker,bid,ask
0,2016-05-25 13:30:00.023,GOOG,720.5,720.93
1,2016-05-25 13:30:00.023,MSFT,51.95,51.96
2,2016-05-25 13:30:00.030,MSFT,51.97,51.98
3,2016-05-25 13:30:00.041,MSFT,51.99,52.0
4,2016-05-25 13:30:00.048,GOOG,720.5,720.93
5,2016-05-25 13:30:00.049,AAPL,97.99,98.01
6,2016-05-25 13:30:00.072,GOOG,720.5,720.88
7,2016-05-25 13:30:00.075,MSFT,52.01,52.03


In [70]:
pd.merge_asof(trades, quotes, on="time", by="ticker") # Approximate or nearest merge

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.97,51.98
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


If you observe carefully, you can notice the reason behind `NaN` appearing in the `AAPL` ticker row. Since the right DataFrame quotes didn't have any time value less than `13:30:00.048` (the time in the left table) for `AAPL` ticker, `NaN`s were introduced in the bid and ask columns.

### Combining

---

There is another data combination situation that can’t be expressed as either a merge or concatenation operation. Imagine the situation of having two datasets whose indexes overlap in full or part.

As a motivating example, consider NumPy’s `where()` function, which performs the array-oriented equivalent of an `if-else` expression.

In [None]:
series_1 = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])


series_1

In [None]:
series_2 = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])


series_2

If `series_1` is null then `series_2`, otherwise `series_1`

In [None]:
np.where(pd.isnull(series_1), series_2, series_1)

Pandas Series object has a `combine_first()` method, which performs the equivalent of the above operation along with Pandas usual data alignment logic.

In [None]:
series_2[:-2].combine_first(series_1[2:])

There is a `combine()` method which takes a function and combines the series according to this function. The function takes two scalars as inputs and returns a single element.

In [None]:
series_2.combine(series_1, max)

In [None]:
series_2.combine(series_1, min)

Now, it's time to perform same operation for DataFrames to see how it works when we have DataFrame instead of Series.

In [None]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})


df1

In [None]:
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})



df2

In [None]:
df1.combine_first(df2) # Updates null elements with value in the same location in other

Pandas DataFrame `combine()` method takes two Series and produce Series or one single element. In other words, perform column-wise combine with another DataFrame.

In [None]:
df1.combine(df2, np.minimum) # np.minimum performs elementwise min operation

In [None]:
df1.combine(df2, np.maximum) # np.maximum performs elementwise max operation

In [None]:
df1.combine(df2, np.add) # np.add performs elementwise summution

## Joining and Concatenation

---


* `join()` - for combining data on a key column or an index


    * supports inner/left (default)/right/full
    * can join multiple DataFrames at a time
    * supports index-index joins


* `concat()` - for combining DataFrames across rows or columns


    * supports inner/full (default)
    * can join multiple DataFrames at a time
    * supports index-index joins



Under the hood, `join()` uses `merge()`, but it provides a more efficient way to join DataFrames than a fully specified `merge()` method. Moreover, `join()` can be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

### Join

---


https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html


https://realpython.com/pandas-merge-join-and-concat/#pandas-join-combining-data-on-a-column-or-index


https://stackoverflow.com/questions/53645882/pandas-merging-101/65167356#65167356

In [64]:
left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [65]:
right

Unnamed: 0_level_0,key_2,value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,b,20,40
D,d,40,80
E,e,50,100
F,f,60,120


### Concatenation


https://stackoverflow.com/questions/49620538/what-are-the-levels-keys-and-names-arguments-for-in-pandas-concat-functio


https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html


https://realpython.com/pandas-merge-join-and-concat/#pandas-concat-combining-data-across-rows-or-columns

**Row Concatenation**


![Concatenation](images/concat_row.png)

**Column Concatenation**


![Concatenation](images/concat_column.png)

## Groups and Aggregations with groupby()

---

აქ ჩაამატე `Group_By` ნოუთბუქი

In [None]:
athletes = pd.read_csv('athletes.csv')
athletes.info()

In [None]:
# Simply calling groupby returns a GroupBy object 
# This does not calculate anything yet!
g = athletes.groupby('nationality')[['gold', 'silver', 'bronze']]

In [None]:
# Calling an aggregation function on the GroupBy object
# applies the calculation for every group
# and constructs a DataFrame with the results
g.sum()

In [None]:
# We can select multiple columns to group by
# And we can select a subset of columns to do
g = athletes.groupby(['sport', 'sex'])[['weight', 'height']]

In [None]:
# Because we selected only 2 columns, this calculation will now be cheaper
g.mean()

## Reshaping Rows and Colums with stack() and unstack()

In [None]:
m = pd.read_csv('monthly_data.csv')
m

In [None]:
# Preparation: move the 'YYYY' column into the index
m.set_index('YYYY', inplace=True)
m

In [None]:
# stack() moves data from rows into a single column
m.stack()

In [None]:
# stack() also allows quick calculations over all cells
m.stack().sum()

In [None]:
w = athletes.groupby(['sport', 'sex'])['weight'].mean()
w

In [None]:
# unstack() takes the inner index level and creates a column for every unique index
# It then moves the data into these columns
w.unstack()

## Reshaping Rows and Colums with pivot()

---

აქ ჩაამატე `Pivot_Table` ნოუთბუქი

In [None]:
p = pd.DataFrame({'id': [823905, 823905,
                         235897, 235897, 235897,
                         983422, 983422],
                  'item': ['prize', 'unit', 
                           'prize', 'unit', 'stock', 
                           'prize', 'stock'],
                  'value': [3.49, 'kg',
                            12.89, 'l', 50,
                            0.49, 4]})
p

In [None]:
# pivot() moves data from rows into columns
# so that we end up with a wider, shorter DataFrame

# The first argument is the column that will be used for row indices
# The second argument is the column that will be used to create column labels
p.pivot('id', 'item')

In [None]:
grades = pd.DataFrame([[6, 4, 5], [7, 8, 7], [6, 7, 9], [6, 5, 5], [5, 2, 7]], 
                       index = ['Mary', 'John', 'Ann', 'Pete', 'Laura'],
                       columns = ['test_1', 'test_2', 'test_3'])
grades.reset_index(inplace=True)
grades

In [None]:
# melt() is the opposite of pivot()
# It moves the data from the rows into a single column
# The column names will show up in a new column called "variable"
grades.melt(id_vars=['index'])

## Long to Wide format

---

https://chrisalbon.com/python/data_wrangling/pandas_long_to_wide/

## Wide to Long format

---

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html


https://stackoverflow.com/questions/36537945/reshape-wide-to-long-in-pandas


https://stackoverflow.com/questions/22798934/pandas-long-to-wide-reshape-by-two-variables



# ესენი ნახე

---

https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#grouping


https://towardsdatascience.com/reshape-pandas-dataframe-with-pivot-table-in-python-tutorial-and-visualization-2248c2012a31


https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html



https://stackoverflow.com/questions/15322632/python-pandas-df-groupby-agg-column-reference-in-agg


https://stackoverflow.com/questions/14916358/reshaping-dataframes-in-pandas-based-on-column-labels


https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html


https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe

# Summary

---

Now you know how to merge and concatenate datasets together. You will find such functions very useful for
combining data to get more complex or complicated results and to do analysis with. A solid understanding of
how to merge data is absolutely essentially when you are procuring, cleaning, and manipulating data. It's
worth knowing how to join different datasets quickly, and the different options you can use when joining
datasets, and I would encourage you to check out the pandas docs for joining and concatenating data.