# Join, Combine, and Reshape a DataFrame

---

Oftentimes, the data is in different files and in different format. The analyst have to be able to deal with such kind of problem and appropriately join different data files in order to do successful operations on the whole data and not only one part of it. In this lecture, we will cover one of the most important and slightly advanced functionalities of Pandas - how to join and combine several DataFrames along with somewhat familiar Pivoting and cross-tabulation operations.


### Lecture outline

---

* Hierarchical Indexing (MultiIndex)


* Combining and Merging


* Joining and Concatenation


* Reshaping and Pivoting


    * Wide to Long format
    
    * Long to Wide format


* Groupby


* Pivot Table


* Cross Tabulation

In [1]:
import pandas as pd

import numpy as np

## Hierarchical Indexing (MultiIndex)

---

Before we delve deep into Pandas merging and reshaping operations, it's essential to know what is a hierarchical index and how to work with it.

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form, like Series (1d) and DataFrame (2d).


> Note that, operations on hierarchical indexed DataFrame is different due to several indices. Hence, we have to differentiate which index to use.

#### Reference

[MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)


[Multiindexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#multiindexing)

### Intro

In [17]:
np.random.seed(425)

In [18]:
multi_df = pd.DataFrame(data=np.random.randint(100, size=9),
                        index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                               [1, 2, 3, 1, 3, 1, 2, 1, 3]],
                        columns=["values"])


multi_df

Unnamed: 0,Unnamed: 1,values
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [19]:
multi_df.index # Return index object

multi_df.index.levels # Return index levels

multi_df.index.names # Return names in index levels. Currently no names

FrozenList([None, None])

In [20]:
multi_df.index.names = ["index_1", "index_2"]

multi_df.index.names

FrozenList(['index_1', 'index_2'])

In [21]:
multi_df.columns.names = ["column_index"]

multi_df.columns.names

FrozenList(['column_index'])

### Slicing

In [22]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [27]:
multi_df.xs(key="a", axis=0, level=0) # Get values at specified index

multi_df.xs(key=2, axis=0, level=1) # Get values at specified index

multi_df.xs(key=("a", 3)) # Get values at several indexes

multi_df.xs(key=("a", 3), axis=0, level=[0, 1]) # Get values at several indexes and levels

multi_df.xs(key="values", axis=1) # Get values at vertical axis

index_1  index_2
a        1          13
         2          13
         3          82
b        1          96
         3          76
c        1          82
         2          19
d        1          59
         3          27
Name: values, dtype: int64

Instead of `xs()` method we can use familiar `loc` for slicing on different axis.

In [28]:
All = slice(None) # Python built-in slicer

In [29]:
All

slice(None, None, None)

In [30]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [38]:
multi_df.loc["a"] # Slice at the first level

multi_df.loc[["a", "c"]] # Selective slice at the first level

multi_df.loc["a"].loc[:2] # Slice at the second level


multi_df.loc[("a", All), All] # Return all values for "a" index at the first level

multi_df.loc[(All, 1), All] # Return all 1's from the second level

multi_df.loc[(All, 1), ("values")] # Same as above one. Selects all first level index and "1" from the second level

multi_df.loc[(slice("a", "c"), 2), All] # Selective slicing at both index level

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,2,13
c,2,19


### Reordering and Sorting Levels

---

Sometimes, we need to swap the index levels and/or sort multiindex DataFrame by either one or both index. Here, comes the solution for that.

In [39]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [40]:
multi_df.swaplevel("index_2", "index_1") # Swap or change the index levels

Unnamed: 0_level_0,column_index,values
index_2,index_1,Unnamed: 2_level_1
1,a,13
2,a,13
3,a,82
1,b,96
3,b,76
1,c,82
2,c,19
1,d,59
3,d,27


We can sort multiindex DataFrame either by index or values.

In [43]:
multi_df.sort_index(level=0) # Sort by index level 0

multi_df.sort_index(level=1) # Sort by index level 1

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
b,1,96
c,1,82
d,1,59
a,2,13
c,2,19
a,3,82
b,3,76
d,3,27


In [44]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [45]:
multi_df.sort_values(by=("values")) # Sort by column

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
c,2,19
d,3,27
d,1,59
b,3,76
a,3,82
c,1,82
b,1,96


### Summary Statistics by Level

In [46]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [49]:
multi_df.sum() # Sum up all the values

multi_df.sum(level=0) # Sum up numbers at the level 0

multi_df.sum(level=1) # Sum up numbers at the level 1

column_index,values
index_2,Unnamed: 1_level_1
1,250
2,32
3,185


Other statistical and/or arithmetic functions works like that. We have to explicitly indicate at which level we want to perform the particular operation.

### Set and Reset MultiIndex

---

We can set and hence reset multiple index in our DataFrame by using `set_index()` and `reset_index()` methods.

In [50]:
multi_df

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


In [53]:
multi_df.reset_index(level=0) # Reset level 0 index


multi_df.reset_index(level=1) # Reset level 1 index


multi_df.reset_index() # Reset all the index

column_index,index_1,index_2,values
0,a,1,13
1,a,2,13
2,a,3,82
3,b,1,96
4,b,3,76
5,c,1,82
6,c,2,19
7,d,1,59
8,d,3,27


In [54]:
multi_df = multi_df.reset_index() # Reset index and set it again


multi_df

column_index,index_1,index_2,values
0,a,1,13
1,a,2,13
2,a,3,82
3,b,1,96
4,b,3,76
5,c,1,82
6,c,2,19
7,d,1,59
8,d,3,27


In [55]:
multi_df.set_index(keys=["index_1", "index_2"]) # Set columns as index

Unnamed: 0_level_0,column_index,values
index_1,index_2,Unnamed: 2_level_1
a,1,13
a,2,13
a,3,82
b,1,96
b,3,76
c,1,82
c,2,19
d,1,59
d,3,27


By default the columns are removed from the DataFrame. However, we can leave them inside DataFrame.

In [56]:
multi_df.set_index(keys=["index_1", "index_2"], drop=False)

Unnamed: 0_level_0,column_index,index_1,index_2,values
index_1,index_2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,1,a,1,13
a,2,a,2,13
a,3,a,3,82
b,1,b,1,96
b,3,b,3,76
c,1,c,1,82
c,2,c,2,19
d,1,d,1,59
d,3,d,3,27


## Combining and Merging

---

In this part we will see how we can bring multiple DataFrame objects together, either by merging them horizontally, or by concatenating them vertically, along with combining and joining DataFrames.


* `merge()` - for combining data on common columns or indices


    * supports inner/left/right/full
    * can only join two DataFrames at a time
    * supports column-column, index-column, index-index joins


That's not all. We also see how Pandas `append()` method works.



> Bonus: **CROSS JOIN** or **CARTESIAN PRODUCT**



> Big Bonus: `merge_asof()` to merge on nearest keys rather than equal keys.

#### Reference


[Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)


[Merge](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#merge)


[Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101)


[Database-style DataFrame or named Series joining/merging](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join)

### Merging


---

Database-Style joining.



![Venn Diagram](images/merge.png)

In [57]:
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                     'value': [10, 20, 30, 40]})


left

Unnamed: 0,key,value
0,A,10
1,B,20
2,C,30
3,D,40


In [58]:
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
                      'value': [20, 40, 50, 60]})


right

Unnamed: 0,key,value
0,B,20
1,D,40
2,E,50
3,F,60


In [59]:
pd.merge(left=left, right=right, how="inner", on="key") # Inner join

Unnamed: 0,key,value_x,value_y
0,B,20,20
1,D,40,40


In [60]:
pd.merge(left=left, right=right, how="left", on="key") # Left join

Unnamed: 0,key,value_x,value_y
0,A,10,
1,B,20,20.0
2,C,30,
3,D,40,40.0


In [61]:
pd.merge(left=left, right=right, how="right", on="key") # Right join

Unnamed: 0,key,value_x,value_y
0,B,20.0,20
1,D,40.0,40
2,E,,50
3,F,,60


In [62]:
pd.merge(left=left, right=right, how="outer", on="key") # Outer join

Unnamed: 0,key,value_x,value_y
0,A,10.0,
1,B,20.0,20.0
2,C,30.0,
3,D,40.0,40.0
4,E,,50.0
5,F,,60.0


If the column name we are merging on are different, we can use `right_on` and `left_on` arguments inside `merge()` function. To see these features in action, let modify our DataFrames.

In [63]:
left = left.rename({"key": "first_left_key"}, axis=1)

left

Unnamed: 0,first_left_key,value
0,A,10
1,B,20
2,C,30
3,D,40


In [64]:
right = right.rename({"key": "first_right_key"}, axis=1)

right

Unnamed: 0,first_right_key,value
0,B,20
1,D,40
2,E,50
3,F,60


In [65]:
pd.merge(left=left, right=right, how="inner", left_on="first_left_key", right_on="first_right_key")

Unnamed: 0,first_left_key,value_x,first_right_key,value_y
0,B,20,B,20
1,D,40,D,40


What if we want to use two or more columns for merging? That's not a problem. First of all, we need to add new columns to our DataFrames to perform multiple column merge.

In [66]:
left = left.rename({"first_left_key": "key_1"}, axis=1)

left.insert(1, "key_2", left["key_1"].str.lower())

left

Unnamed: 0,key_1,key_2,value
0,A,a,10
1,B,b,20
2,C,c,30
3,D,d,40


In [67]:
right = right.rename({"first_right_key": "key_1"}, axis=1)

right.insert(1, "key_2", right["key_1"].str.lower())

right

Unnamed: 0,key_1,key_2,value
0,B,b,20
1,D,d,40
2,E,e,50
3,F,f,60


In [69]:
pd.merge(left=left, right=right, how="inner", on=["key_1", "key_2"]) # Inner join with multiple key


left.merge(right=right, how="inner", on=["key_1", "key_2"]) # Same as above

Unnamed: 0,key_1,key_2,value_x,value_y
0,B,b,20,20
1,D,d,40,40


We can also merge DataFrames by using the index. To do so, first we need to set index for our DataFrames

In [70]:
left = left.set_index("key_1")

left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [71]:
right = right.set_index("key_1")

right

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
B,b,20
D,d,40
E,e,50
F,f,60


In [72]:
pd.merge(left=left, right=right, how="inner", left_index=True, right_index=True) # Inner join based on index

Unnamed: 0_level_0,key_2_x,value_x,key_2_y,value_y
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
B,b,20,b,20
D,d,40,d,40


### Cross Join

---

Cross Join is the same as Cartesian Product on `X-Y` plane

![Venn Diagram](images/cross_join.png)

In [73]:
left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [74]:
right

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
B,b,20
D,d,40
E,e,50
F,f,60


In [75]:
left.merge(right, how="cross")

Unnamed: 0,key_2_x,value_x,key_2_y,value_y
0,a,10,b,20
1,a,10,d,40
2,a,10,e,50
3,a,10,f,60
4,b,20,b,20
5,b,20,d,40
6,b,20,e,50
7,b,20,f,60
8,c,30,b,20
9,c,30,d,40


### `append()`

---

Append rows of the second DataFrame to the end of the first DataFrame. Columns in the second DataFrame that are not in the first DataFrame are added as new columns.

In [76]:
left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [77]:
right

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
B,b,20
D,d,40
E,e,50
F,f,60


In [78]:
left.append(right, ignore_index=False) # Preserves the index of the DataFrame

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40
B,b,20
D,d,40
E,e,50
F,f,60


In [79]:
left.append(right, ignore_index=True) # Resets the old index and sets new one

Unnamed: 0,key_2,value
0,a,10
1,b,20
2,c,30
3,d,40
4,b,20
5,d,40
6,e,50
7,f,60


Let add one more column to the right DataFrame to see if `append()` method really adds new columns.

In [80]:
right["new_value"] = right["value"] * 2

right

Unnamed: 0_level_0,key_2,value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,b,20,40
D,d,40,80
E,e,50,100
F,f,60,120


In [81]:
left.append(right, ignore_index=False) # Indeed, "append()" method adds new column

Unnamed: 0_level_0,key_2,value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,a,10,
B,b,20,
C,c,30,
D,d,40,
B,b,20,40.0
D,d,40,80.0
E,e,50,100.0
F,f,60,120.0


### `merge_asof()`

---

Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the `merge_asof()` function. The `merge_asof()` is similar to an ordered left-join merge except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

#### Reference


[pandas.merge_asof](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html#pandas-merge-asof)

In [82]:
trades = pd.DataFrame({'time': pd.to_datetime(['20160525 13:30:00.023',
                                               '20160525 13:30:00.038',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.048']),
                       'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
                       'price': [51.95, 51.95,720.77, 720.92, 98.00],
                       'quantity': [75, 155,100, 100, 100]},
                      columns=['time', 'ticker', 'price', 'quantity'])



trades

Unnamed: 0,time,ticker,price,quantity
0,2016-05-25 13:30:00.023,MSFT,51.95,75
1,2016-05-25 13:30:00.038,MSFT,51.95,155
2,2016-05-25 13:30:00.048,GOOG,720.77,100
3,2016-05-25 13:30:00.048,GOOG,720.92,100
4,2016-05-25 13:30:00.048,AAPL,98.0,100


In [83]:
quotes = pd.DataFrame({'time': pd.to_datetime(['20160525 13:30:00.023',
                                               '20160525 13:30:00.023',
                                               '20160525 13:30:00.030',
                                               '20160525 13:30:00.041',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.049',
                                               '20160525 13:30:00.072',
                                               '20160525 13:30:00.075']),
                       'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
                       'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
                       'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
                      columns=['time', 'ticker', 'bid', 'ask'])


quotes

Unnamed: 0,time,ticker,bid,ask
0,2016-05-25 13:30:00.023,GOOG,720.5,720.93
1,2016-05-25 13:30:00.023,MSFT,51.95,51.96
2,2016-05-25 13:30:00.030,MSFT,51.97,51.98
3,2016-05-25 13:30:00.041,MSFT,51.99,52.0
4,2016-05-25 13:30:00.048,GOOG,720.5,720.93
5,2016-05-25 13:30:00.049,AAPL,97.99,98.01
6,2016-05-25 13:30:00.072,GOOG,720.5,720.88
7,2016-05-25 13:30:00.075,MSFT,52.01,52.03


In [84]:
pd.merge_asof(trades, quotes, on="time", by="ticker") # Approximate or nearest merge

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.97,51.98
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


If you observe carefully, you can notice the reason behind `NaN` appearing in the `AAPL` ticker row. Since the right DataFrame quotes didn't have any time value less than `13:30:00.048` (the time in the left table) for `AAPL` ticker, `NaN`s were introduced in the bid and ask columns.

### Combining

---

There is another data combination situation that can’t be expressed as either a merge or concatenation operation. Imagine the situation of having two datasets whose indexes overlap in full or part.

As a motivating example, consider NumPy’s `where()` function, which performs the array-oriented equivalent of an `if-else` expression.

In [85]:
series_1 = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])


series_1

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [86]:
series_2 = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])


series_2

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

If `series_1` is null then `series_2`, otherwise `series_1`

In [87]:
np.where(pd.isnull(series_1), series_2, series_1)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

Pandas Series object has a `combine_first()` method, which performs the equivalent of the above operation along with Pandas usual data alignment logic.

In [88]:
series_2[:-2].combine_first(series_1[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

There is a `combine()` method which takes a function and combines the series according to this function. The function takes two scalars as inputs and returns a single element.

In [89]:
series_2.combine(series_1, max)

f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

In [90]:
series_2.combine(series_1, min)

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

Now, it's time to perform same operation for DataFrames to see how it works when we have DataFrame instead of Series.

In [91]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})


df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [92]:
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})



df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [93]:
df1.combine_first(df2) # Updates null elements with value in the same location in other

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


Pandas DataFrame `combine()` method takes two Series and produce Series or one single element. In other words, perform column-wise combine with another DataFrame.

In [94]:
df1.combine(df2, np.minimum) # np.minimum performs elementwise min operation

Unnamed: 0,a,b,c
0,1.0,,
1,,2.0,
2,,,
3,,6.0,
4,,,


In [95]:
df1.combine(df2, np.maximum) # np.maximum performs elementwise max operation

Unnamed: 0,a,b,c
0,5.0,,
1,,3.0,
2,,,
3,,6.0,
4,,,


In [96]:
df1.combine(df2, np.add) # np.add performs elementwise summation

Unnamed: 0,a,b,c
0,6.0,,
1,,5.0,
2,,,
3,,12.0,
4,,,


## Joining and Concatenation

---


* `join()` - for combining data on a key column or an index


    * supports inner/left (default)/right/full
    * can join multiple DataFrames at a time
    * supports index-index joins


* `concat()` - for combining DataFrames across rows or columns


    * supports inner/full (default)
    * can join multiple DataFrames at a time
    * supports index-index joins



Under the hood, `join()` uses `merge()`, but it provides a more efficient way to join DataFrames than a fully specified `merge()` method. Moreover, `join()` can be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

### Join

---

In [97]:
left

Unnamed: 0_level_0,key_2,value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [98]:
right

Unnamed: 0_level_0,key_2,value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,b,20,40
D,d,40,80
E,e,50,100
F,f,60,120


As we have overlapping columns in `left` and `right` DataFrame, we have to use `lsuffix` and `rsuffix` arguments while calling `join()` method

In [99]:
left.join(right, lsuffix="_left", rsuffix="_right") # By default performs LEFT join

Unnamed: 0_level_0,key_2_left,value_left,key_2_right,value_right,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,a,10,,,
B,b,20,b,20.0,40.0
C,c,30,,,
D,d,40,d,40.0,80.0


In [100]:
left.join(right, lsuffix="_caller", rsuffix="_other", how="inner") # INNER join index-to-index

Unnamed: 0_level_0,key_2_caller,value_caller,key_2_other,value_other,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
B,b,20,b,20,40
D,d,40,d,40,80


`join()` method can join several DataFrames compared to `merge()` method which only can join two at a time.

In [101]:
middle = pd.DataFrame({'key_1': ['A', 'B', 'C', 'D'],
                       'middle_value': [1, 2, 3, 4]})


middle = middle.set_index("key_1")


middle

Unnamed: 0_level_0,middle_value
key_1,Unnamed: 1_level_1
A,1
B,2
C,3
D,4


In [102]:
left = left.rename({"key_2":"left_key_2", "value":"left_value"}, axis=1)

right = right.rename({"key_2":"right_key_2", "value":"right_value"}, axis=1)

In [103]:
left

Unnamed: 0_level_0,left_key_2,left_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,a,10
B,b,20
C,c,30
D,d,40


In [104]:
middle

Unnamed: 0_level_0,middle_value
key_1,Unnamed: 1_level_1
A,1
B,2
C,3
D,4


In [105]:
right

Unnamed: 0_level_0,right_key_2,right_value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,b,20,40
D,d,40,80
E,e,50,100
F,f,60,120


In [106]:
left.join([middle, right], how="inner")

Unnamed: 0_level_0,left_key_2,left_value,middle_value,right_key_2,right_value,new_value
key_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,b,20,2,b,20,40
D,d,40,4,d,40,80


### Concatenation


---

Concatenation is a bit different from the merging techniques we saw above. With merging, we can expect the resulting dataset to have rows from the first DataFrame mixed with the second DataFrame based on some commonality. Depending on the type of merge, we might also lose rows that don’t have matches in the other dataset.

With concatenation, your datasets are just stacked together along an axis — either the row axis or column axis. Visually, a concatenation with no parameters along rows would look like this:

#### Reference

[Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Row Concatenation**


![Concatenation](images/concat_row.png)

In [107]:
left = (left.reset_index(drop=True)
            .rename({"left_value":"value"}, axis=1))

left

Unnamed: 0,left_key_2,value
0,a,10
1,b,20
2,c,30
3,d,40


In [108]:
middle.insert(0, "middle_key_2", list(middle.index.str.lower()))

middle = (middle.reset_index(drop=True)
                .rename({"middle_value": "value"}, axis=1))

middle

Unnamed: 0,middle_key_2,value
0,a,1
1,b,2
2,c,3
3,d,4


In [109]:
right = (right.drop("new_value", axis=1)
              .reset_index(drop=True)
              .rename({"right_value": "value"}, axis=1))

right

Unnamed: 0,right_key_2,value
0,b,20
1,d,40
2,e,50
3,f,60


In [110]:
pd.concat([left, middle, right], axis=0) # By default performs OUTER join

Unnamed: 0,left_key_2,value,middle_key_2,right_key_2
0,a,10,,
1,b,20,,
2,c,30,,
3,d,40,,
0,,1,a,
1,,2,b,
2,,3,c,
3,,4,d,
0,,20,,b
1,,40,,d


In [111]:
pd.concat([left, middle, right], axis=0, join="inner") # INNER join

Unnamed: 0,value
0,10
1,20
2,30
3,40
0,1
1,2
2,3
3,4
0,20
1,40


In [112]:
pd.concat([left, middle, right], keys=["left_key_2", "middle_key_2", "right_key_2"], axis=0) # Creates MultiIndex

Unnamed: 0,Unnamed: 1,left_key_2,value,middle_key_2,right_key_2
left_key_2,0,a,10,,
left_key_2,1,b,20,,
left_key_2,2,c,30,,
left_key_2,3,d,40,,
middle_key_2,0,,1,a,
middle_key_2,1,,2,b,
middle_key_2,2,,3,c,
middle_key_2,3,,4,d,
right_key_2,0,,20,,b
right_key_2,1,,40,,d


**Column Concatenation**


![Concatenation](images/concat_column.png)

In [113]:
pd.concat([left, middle, right], axis=1) # Concatenation along vertical axis - adding columns

Unnamed: 0,left_key_2,value,middle_key_2,value.1,right_key_2,value.2
0,a,10,a,1,b,20
1,b,20,b,2,d,40
2,c,30,c,3,e,50
3,d,40,d,4,f,60


In [114]:
pd.concat([left, middle, right], keys=["left_key_2", "middle_key_2", "right_key_2"], axis=1) # Column-wise MultiIndex

Unnamed: 0_level_0,left_key_2,left_key_2,middle_key_2,middle_key_2,right_key_2,right_key_2
Unnamed: 0_level_1,left_key_2,value,middle_key_2,value,right_key_2,value
0,a,10,a,1,b,20
1,b,20,b,2,d,40
2,c,30,c,3,e,50
3,d,40,d,4,f,60


## Reshaping and Pivoting

---

Sometimes, we need to reshape our DataFrame, meaning that to change its format. Reshaping can be done in two ways. We can convert our long format data into wide format or vice versa.

#### Reference


[Reshaping and pivot tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)

### Reshaping Rows and Colums with `stack()` and `unstack()`

In [115]:
monthly_data = pd.read_csv("data/monthly_data.csv")


monthly_data = monthly_data.set_index('YYYY') # Set "YYYY" column as index


monthly_data

Unnamed: 0_level_0,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,YEAR
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2008,10140,10239,10050,10111,10159,10159,10141,10117,10178,10148,10125,10182,10146
2009,10137,10140,10140,10141,10188,10168,10128,10165,10208,10166,10041,10068,10141
2010,10151,10034,10168,10194,10158,10166,10158,10129,10147,10135,10057,10133,10136
2011,10182,10161,10227,10192,10182,10154,10123,10130,10149,10182,10194,10099,10165
2012,10194,10286,10271,10053,10159,10127,10139,10155,10149,10109,10108,10085,10153
2013,10142,10169,10099,10155,10113,10180,10201,10176,10151,10129,10155,10170,10153
2014,10055,10031,10164,10148,10154,10184,10143,10117,10189,10142,10103,10172,10134
2015,10135,10164,10198,10214,10152,10195,10142,10152,10171,10186,10150,10217,10173
2016,10100,10099,10144,10122,10140,10137,10168,10183,10177,10214,10144,10283,10159
2017,10228,10151,10154,10211,10170,10134,10141,10162,10135,10176,10141,10120,10160


`stack()` method moves data from rows into a single column

In [117]:
stacked_monthly_data = monthly_data.stack()

pd.DataFrame(stacked_monthly_data)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,JAN,10140
2008,FEB,10239
2008,MAR,10050
2008,APR,10111
2008,MAY,10159
...,...,...
2017,SEP,10135
2017,OCT,10176
2017,NOV,10141
2017,DEC,10120


`unstack()` takes the inner index level and creates a column for every unique index. It then moves the data into these columns.

In [118]:
stacked_monthly_data.unstack()

Unnamed: 0_level_0,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,YEAR
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2008,10140,10239,10050,10111,10159,10159,10141,10117,10178,10148,10125,10182,10146
2009,10137,10140,10140,10141,10188,10168,10128,10165,10208,10166,10041,10068,10141
2010,10151,10034,10168,10194,10158,10166,10158,10129,10147,10135,10057,10133,10136
2011,10182,10161,10227,10192,10182,10154,10123,10130,10149,10182,10194,10099,10165
2012,10194,10286,10271,10053,10159,10127,10139,10155,10149,10109,10108,10085,10153
2013,10142,10169,10099,10155,10113,10180,10201,10176,10151,10129,10155,10170,10153
2014,10055,10031,10164,10148,10154,10184,10143,10117,10189,10142,10103,10172,10134
2015,10135,10164,10198,10214,10152,10195,10142,10152,10171,10186,10150,10217,10173
2016,10100,10099,10144,10122,10140,10137,10168,10183,10177,10214,10144,10283,10159
2017,10228,10151,10154,10211,10170,10134,10141,10162,10135,10176,10141,10120,10160


`unstack()` might introduce missing data if all of the values in the level aren’t found in each of the subgroups. Let consider the following example.

In [119]:
s1 = pd.Series([0, 1, 2, 3], index=["a", "b", "c", "d"])

s2 = pd.Series([4, 5, 6], index=["c", "d", "e"])

test_data = pd.concat([s1, s2], keys=["one", "two"])

test_data

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [120]:
test_data.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


What if we `unstack()` the initial DataFrame?

In [121]:
monthly_data.head()

Unnamed: 0_level_0,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,YEAR
YYYY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2008,10140,10239,10050,10111,10159,10159,10141,10117,10178,10148,10125,10182,10146
2009,10137,10140,10140,10141,10188,10168,10128,10165,10208,10166,10041,10068,10141
2010,10151,10034,10168,10194,10158,10166,10158,10129,10147,10135,10057,10133,10136
2011,10182,10161,10227,10192,10182,10154,10123,10130,10149,10182,10194,10099,10165
2012,10194,10286,10271,10053,10159,10127,10139,10155,10149,10109,10108,10085,10153


In [122]:
unstacked_monthly_data = monthly_data.unstack()

unstacked_monthly_data

      YYYY
JAN   2008    10140
      2009    10137
      2010    10151
      2011    10182
      2012    10194
              ...  
YEAR  2013    10153
      2014    10134
      2015    10173
      2016    10159
      2017    10160
Length: 130, dtype: int64

Let convert unstacked initial DataFrame from Pandas Series to Pandas DataFrame and then reset index.

In [123]:
pd.DataFrame(unstacked_monthly_data).reset_index() # We converted Wide format data into Long format

Unnamed: 0,level_0,YYYY,0
0,JAN,2008,10140
1,JAN,2009,10137
2,JAN,2010,10151
3,JAN,2011,10182
4,JAN,2012,10194
...,...,...,...
125,YEAR,2013,10153
126,YEAR,2014,10134
127,YEAR,2015,10173
128,YEAR,2016,10159


### Wide to Long format

---


When converting wide format into long format, we merge multiple columns into one, which produces a DataFrame that is longer than the input.

`melt()` is the opposite of `pivot()` as it moves the data from the rows into a single column.

In [124]:
wide_data = pd.DataFrame([["Mary", 6, 4, 5, ],
                          ["John", 7, 8, 7],
                          ["Ann", 6, 7, 9],
                          ["Pete", 6, 5, 5],
                          ["Laura", 5, 2, 7]], 
                         columns = ["name", "test_1", "test_2", "test_3"])


wide_data

Unnamed: 0,name,test_1,test_2,test_3
0,Mary,6,4,5
1,John,7,8,7
2,Ann,6,7,9
3,Pete,6,5,5
4,Laura,5,2,7


In [125]:
pd.melt(wide_data, id_vars=["name"]) # Returns Long format

Unnamed: 0,name,variable,value
0,Mary,test_1,6
1,John,test_1,7
2,Ann,test_1,6
3,Pete,test_1,6
4,Laura,test_1,5
5,Mary,test_2,4
6,John,test_2,8
7,Ann,test_2,7
8,Pete,test_2,5
9,Laura,test_2,2


In [126]:
pd.melt(wide_data, id_vars=["name"], value_vars=["test_1"]) # Use one column as value variable

Unnamed: 0,name,variable,value
0,Mary,test_1,6
1,John,test_1,7
2,Ann,test_1,6
3,Pete,test_1,6
4,Laura,test_1,5


In [127]:
pd.melt(wide_data, id_vars=["name"], value_vars=["test_1", "test_2"]) # Use two columns as value variables

Unnamed: 0,name,variable,value
0,Mary,test_1,6
1,John,test_1,7
2,Ann,test_1,6
3,Pete,test_1,6
4,Laura,test_1,5
5,Mary,test_2,4
6,John,test_2,8
7,Ann,test_2,7
8,Pete,test_2,5
9,Laura,test_2,2


After converting our DataFrame from wide to long format, we see that there are two new columns, `variable` and `value`. We can change them while converting by specifying `var_name` and `value_name` arguments, respectively.

In [128]:
pd.melt(wide_data, id_vars=["name"], var_name="test", value_name="grades")

Unnamed: 0,name,test,grades
0,Mary,test_1,6
1,John,test_1,7
2,Ann,test_1,6
3,Pete,test_1,6
4,Laura,test_1,5
5,Mary,test_2,4
6,John,test_2,8
7,Ann,test_2,7
8,Pete,test_2,5
9,Laura,test_2,2


### Long to Wide format

---

To convert Wide format data into a Long format, we use `pivot()` method. `pivot()` moves data from rows into columns.

Let first create long format data. `pivot()` is an inverse operation to Pandas `melt()` operation we saw above.

In [129]:
raw_data = {"patient": [1, 1, 1, 2, 2], 
            "obs": [1, 2, 3, 1, 2], 
            "treatment": [0, 1, 0, 1, 0],
            "score": [6252, 24243, 2345, 2342, 23525]}


long_data = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])


long_data

Unnamed: 0,patient,obs,treatment,score
0,1,1,0,6252
1,1,2,1,24243
2,1,3,0,2345
3,2,1,1,2342
4,2,2,0,23525


In [130]:
wide_data = long_data.pivot(index="patient", columns="obs", values="score")


wide_data

obs,1,2,3
patient,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,6252.0,24243.0,2345.0
2,2342.0,23525.0,


## Groupby

---

Sometimes we want to select data based on groups and understand aggregated data on a group level. Fortunately Pandas has a `groupby()` method to speed up such task. The idea behind the groupby() function is  that it takes some DataFrame, splits it into chunks based on some key values, applies computation on those  chunks, then combines the results back together into another DataFrame. In Pandas this is referred as the `split-apply-combine` pattern.


![Split_Apply_Combine](images/split_apply_combine.png)


---


* **Splitting** the data into groups based on some criteria.


* **Applying** a function to each group independently.


* **Combining** the results into a data structure.


$$
$$

The **Split** step is the most straightforward. We may wish to split the data set into groups based on some key(s) and do something with those groups.


In the **Apply** step we're doing one of the following:


* Aggregation


    * Compute group sum, mean, variance, etc.
    * Compute group size/count


* Transformation


    * Standardize data in a group
    * Filling NAs within groups with a value derived from each group
    
    
* Filtration


    * Filtering out data based on some criteria

#### Reference



[Group by: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)


[Grouping](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#grouping)


[Combining with stats and GroupBy](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#combining-with-stats-and-groupby)

In [None]:
series = pd.Series(data=[0, 5, 10, 5, 10, 15, 10, 15, 20],
                   index=["A", "B", "C", "A", "B", "C", "A", "B", "C"])


series

In [None]:
series.groupby(by=series.index) # Retruns SeriesGroupBy object. Does not compute anything yet

In [None]:
series.groupby(by=series.index).sum() # Group by index and then sum them up

We can calculate several aggregation functions, such as count, mean, sum, etc.

In [None]:
series.groupby(by=series.index).agg([np.sum, np.mean, np.min, np.max])

In [None]:
series.groupby(by=series.index).agg(["sum", "mean", "count"])

#### Let see how `groupby()` works with DataFrames

In [None]:
athletes = pd.read_csv("data/athletes.csv")


athletes.head()

Like Series groupby, DataFrame groupby returns `DataFrameGroupBy` object. Actually. it's a DataFrame. Hence, we can perform DataFrame common operations, such as slicing, filtering, and aggregation by columns.

In [None]:
athletes.groupby(by=["nationality"])

Calling an aggregation function on the `GroupBy` object applies the calculation for every group and constructs a DataFrame with the results.

In [None]:
athletes.groupby(by=["nationality"])[["height", "weight"]].mean() # Mean height and weight by nationality

In [None]:
athletes.groupby(by=["sex", "nationality"])[["height", "weight"]].mean() # Mean height and weight by sex and nationality

Let count the number of medals by country. To do, we have to group by country and then count the amount of medals.

In [None]:
medal_counts = athletes.groupby(by=["nationality"])[["gold", "silver", "bronze"]].sum()

medal_counts

Not very informative right? Let sort the resulted DataFrame by values and see which country got the highest number of medals in each type.

In [None]:
medal_counts.sort_values(by=["gold", "silver", "bronze"], ascending=[False, False, False]).head()

In [None]:
medal_counts.nlargest(n=5, columns=["gold", "silver", "bronze"]) # Same as above

Medal counts by sex and country. Are female better than male?

In [None]:
medal_counts_by_sex = athletes.groupby(by=["nationality", "sex"])[["gold", "silver", "bronze"]].sum()


medal_counts_by_sex.nlargest(5, ["gold", "silver", "bronze"])

In [None]:
athletes[athletes["nationality"]=="RUS"][["sex", "gold", "silver", "bronze"]].groupby("sex").sum()

> <font color='red'>Do you notice weird thing in the above `groupby()`? What is it? Why it happened?</font>

Let see the average height and weight by sex and sport. We can even group them by country.

In [None]:
athletes.groupby(["sport", "sex"])[["weight", "height"]].mean()

`groupby()` is a powerful and commonly used tool for data cleaning and data analysis. Once you have grouped the data by some category you have a DataFrame of just those values and you can conduct aggregated analysis on the segments that you are interested in. The `groupby()` method follows a `split-apply-combine` approach - first the data is split into subgroups, then you can apply some transformation, filtering, or aggregation, and then the results are combined automatically by Pandas for us.

## Pivot Table

---


A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of
the aggregation function. A pivot table is itself a DataFrame, where the rows represent one variable that
you're interested in, the columns another, and the cell's some aggregate value. A pivot table also tends to
includes marginal values as well, which are the sums for each column and row. This allows you to be able to
see the relationship between two variables at just a glance.


Behind the `pivot_table()` method of Pandas, there is `groupby()` facility combined with reshape operations utilizing hierarchical indexing.


> Pandas `pivot()` and `pivot_table()` are not the same. They are similar and in some cases they are complements.



`pivot_table()` is a generalization of `pivot()` that can handle duplicate values for one pivoted index/column pair, whereas `pivot()` can’t deal with duplicate values.

$$
$$


**Pandas `pivot_table()` has the same functionality as excel pivot table**

#### Reference


[Pivot tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#pivot-tables)


[Pandas Pivot Table Explained](https://pbpython.com/pandas-pivot-table-explained.html)

In [None]:
pivot_df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                               "bar", "bar", "bar", "bar"],
                         "B": ["one", "one", "one", "two", "two",
                               "one", "one", "two", "two"],
                         "C": ["small", "large", "large", "small",
                               "small", "large", "small", "small", "large"],
                         "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                         "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})



pivot_df

The simplest Pivot Table

In [None]:
pivot_df.pivot_table(index=["A"]) # Returns average of only numerical columns by default

We can pivot our DataFrame by two or more columns

In [None]:
pivot_df.pivot_table(index=["A","B"])

Pivot Table with column values


<div class="alert alert-info">

One of the confusing points with the `pivot_table()` is the use of `columns` and `values` . Remember, `columns` are optional - they provide an additional way to segment the actual `values` you care about. The aggregation functions are applied to the `values` you list.

</div>

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["E"])

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["C"],
                     aggfunc=["mean", "sum"])

**Fully-fledged Pivot Table**

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["C"],
                     values="D",
                     aggfunc="sum",
                     margins=True,
                     margins_name="Total",
                     fill_value=0)

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["C"],
                     values=["D", "E"],
                     aggfunc="sum",
                     margins=True,
                     margins_name="Total",
                     fill_value=0)

`Pivot Tables` are incredibly useful when dealing with numeric data, especially if you're trying to summarize the data in some form. You'll regularly be creating new pivot tables on slices of data, whether you're exploring the data yourself or preparing data for others to report on. And of course, you can pass any function you want to the aggregate function, including those that you define yourself.

## Cross-Tabulation

---

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies, unless an array of values and an aggregation function are passed.

#### Reference


[Cross tabulations](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#cross-tabulations)


[Pandas Crosstab Explained](https://pbpython.com/pandas-crosstab.html)

Define column names for the data, since the data does not have any.

In [None]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

In [None]:
cross_df = pd.read_csv("data/automobile.data",
                       header=None,
                       names=headers,
                       na_values="?") # Convert "?" into NaN

In [None]:
cross_df.head()

The DataFrame contains many rows and is not convenient to work with. Let extract only top automobile producers, such as:

In [None]:
models = ["toyota", "nissan", "mazda", "honda",
          "mitsubishi", "subaru", "volkswagen", "volvo"]

In [None]:
cross_df = cross_df[cross_df["make"].isin(models)]

cross_df.head()

The simplest `cross-tab`. Let calculate how many different `body_style` these car makers made.

In [None]:
pd.crosstab(index=cross_df["make"],
            columns=cross_df["body_style"])

In [None]:
cross_df.groupby(["make", "body_style"])["body_style"].count().unstack().fillna(0) # Same as above, but with groupby

In [None]:
cross_df.pivot_table(index="make", columns="body_style", aggfunc={"body_style": len}, fill_value=0) # Same with pivot_table

In [None]:
pd.crosstab(index=cross_df["make"],
            columns=cross_df["num_doors"],
            margins=True,
            margins_name="Total") # Include totals across rows and columns

Cross-Tab in not only used to count the frequencies. Let calculate the average price across car makers and break it down by car type.

In [None]:
pd.crosstab(index=cross_df["make"],
            columns=cross_df["body_style"],
            values=cross_df["price"],
            aggfunc="mean").round(0).fillna("")

Pandas `crosstab()` is even smarter in a way that we can pass in multiple columns and it will group them. For example: If we want to see how the data is distributed by front wheel drive (fwd) and rear wheel drive (rwd), we can include the `drive_wheels` column by including it in the list of valid columns in the second argument to the `crosstab()`.

In [None]:
pd.crosstab(cross_df["make"],
            [cross_df["body_style"],
             cross_df["drive_wheels"]])

In [None]:
pd.crosstab([cross_df["make"], cross_df["num_doors"]],
            [cross_df["body_style"],
             cross_df["drive_wheels"]],
            rownames=["Auto Manufacturer", "Doors"],
            colnames=['Body Style', "Drive Type"],
            dropna=False)

# Summary

---

Now you know how to merge and concatenate datasets together. You will find such functions very useful for
combining data to get more complex or complicated results and to do analysis with. A solid understanding of
how to merge data is absolutely essentially when you are procuring, cleaning, and manipulating data. It's
worth knowing how to join different datasets quickly, and the different options you can use when joining
datasets, and I would encourage you to check out the pandas docs for joining and concatenating data.