# Join, Combine, and Reshape a DataFrame

---

Oftentimes, the data is in different files and in different format. The analyst have to be able to deal with such kind of problem and appropriately join different data files in order to do successful operations on the whole data and not only one part of it. In this lecture, we will cover one of the most important and slightly advanced functionalities of Pandas - how to join and combine several DataFrames along with somewhat familiar Pivoting and cross-tabulation operations.


### Lecture outline

---

* Hierarchical Indexing (MultiIndex)


* Combining and Merging


* Joining and Concatenation


* Reshaping and Pivoting


    * Wide to Long format
    
    * Long to Wide format


* Groupby


* Pivot Table


* Cross Tabulation

In [None]:
import pandas as pd

import numpy as np

## Hierarchical Indexing (MultiIndex)

---

Before we delve deep into Pandas merging and reshaping operations, it's essential to know what is a hierarchical index and how to work with it.

Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form, like Series (1d) and DataFrame (2d).


> Note that, operations on hierarchical indexed DataFrame is different due to several indices. Hence, we have to differentiate which index to use.

#### Reference

[MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)


[Multiindexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#multiindexing)

### Intro

In [None]:
multi_df = pd.DataFrame(data=np.random.randint(100, size=9),
                        index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                               [1, 2, 3, 1, 3, 1, 2, 1, 3]],
                        columns=["values"])


multi_df

In [None]:
multi_df.index # Return index object

multi_df.index.levels # Return index levels

multi_df.index.names # Return names in index levels. Currently no names

In [None]:
multi_df.index.names = ["index_1", "index_2"]

multi_df.index.names

In [None]:
multi_df.columns.names = ["column_index"]

multi_df.columns.names

### Slicing

In [None]:
multi_df

In [None]:
multi_df.xs(key="a", axis=0, level=0) # Get values at specified index

multi_df.xs(key=2, axis=0, level=1) # Get values at specified index

multi_df.xs(key=("a", 3)) # Get values at several indexes

multi_df.xs(key=("a", 3), axis=0, level=[0, 1]) # Get values at several indexes and levels

multi_df.xs(key="values", axis=1) # Get values at vertical axis

Instead of `xs()` method we can use familiar `loc` for slicing on different axis.

In [None]:
All = slice(None) # Python built-in slicer

In [None]:
multi_df.loc["a"] # Slice at the first level

multi_df.loc[["a", "c"]] # Selective slice at the first level

multi_df.loc["a"].loc[:2] # Slice at the second level


multi_df.loc[("a", All), All] # Return all values for "a" index at the first level

multi_df.loc[(All, 1), All] # Return all 1's from the second level

multi_df.loc[(All, 1), ("values")] # Same as above one. Selects all first level index and "1" from the second level

multi_df.loc[(slice("a", "c"), 2), All] # Selective slicing at both index level

### Reordering and Sorting Levels

---

Sometimes, we need to swap the index levels and/or sort multiindex DataFrame by either one or both index. Here, comes the solution for that.

In [None]:
multi_df

In [None]:
multi_df.swaplevel("index_2", "index_1") # Swap or change the index levels

We can sort multiindex DataFrame either by index or values.

In [None]:
multi_df.sort_index(level=0) # Sort by index level 0

multi_df.sort_index(level=1) # Sort by index level 1

In [None]:
multi_df

In [None]:
multi_df.sort_values(by=("values")) # Sort by column

### Summary Statistics by Level

In [None]:
multi_df

In [None]:
multi_df.sum() # Sum up all the values

multi_df.sum(level=0) # Sum up numbers at the level 0

multi_df.sum(level=1) # Sum up numbers at the level 1

Other statistical and/or arithmetic functions works like that. We have to explicitly indicate at which level we want to perform the particular operation.

### Set and Reset MultiIndex

---

We can set and hence reset multiple index in our DataFrame by using `set_index()` and `reset_index()` methods.

In [None]:
multi_df.reset_index(level=0) # Reset level 0 index


multi_df.reset_index(level=1) # Reset level 1 index


multi_df.reset_index() # Reset all the index

In [None]:
multi_df = multi_df.reset_index() # Reset index and set it again


multi_df

In [None]:
multi_df

In [None]:
multi_df.set_index(keys=["index_1", "index_2"]) # Set columns as index

By default the columns are removed from the DataFrame. However, we can leave them inside DataFrame.

In [None]:
multi_df.set_index(keys=["index_1", "index_2"], drop=False)

## Combining and Merging

---

In this part we will see how we can bring multiple DataFrame objects together, either by merging them horizontally, or by concatenating them vertically, along with combining and joining DataFrames.


* `merge()` - for combining data on common columns or indices


    * supports inner/left/right/full
    * can only join two DataFrames at a time
    * supports column-column, index-column, index-index joins


That's not all. We also see how Pandas `append()` method works.



> Bonus: **CROSS JOIN** or **CARTESIAN PRODUCT**



> Big Bonus: `merge_asof()` to merge on nearest keys rather than equal keys.

#### Reference


[Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)


[Merge](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#merge)


[Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101)


[Database-style DataFrame or named Series joining/merging](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join)

### Merging


---

Database-Style joining.



![Venn Diagram](images/merge.png)

In [None]:
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                     'value': [10, 20, 30, 40]})


left

In [None]:
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
                      'value': [20, 40, 50, 60]})


right

In [None]:
pd.merge(left=left, right=right, how="inner", on="key") # Inner join

In [None]:
pd.merge(left=left, right=right, how="left", on="key") # Left join

In [None]:
pd.merge(left=left, right=right, how="right", on="key") # Right join

In [None]:
pd.merge(left=left, right=right, how="outer", on="key") # Outer join

If the column name we are merging on are different, we can use `right_on` and `left_on` arguments inside `merge()` function. To see these features in action, let modify our DataFrames.

In [None]:
left = left.rename({"key": "first_left_key"}, axis=1)

left

In [None]:
right = right.rename({"key": "first_right_key"}, axis=1)

right

In [None]:
pd.merge(left=left, right=right, how="inner", left_on="first_left_key", right_on="first_right_key")

What if we want to use two or more columns for merging? That's not a problem. First of all, we need to add new columns to our DataFrames to perform multiple column merge.

In [None]:
left = left.rename({"first_left_key": "key_1"}, axis=1)

left.insert(1, "key_2", left["key_1"].str.lower())

left

In [None]:
right = right.rename({"first_right_key": "key_1"}, axis=1)

right.insert(1, "key_2", right["key_1"].str.lower())

right

In [None]:
pd.merge(left=left, right=right, how="inner", on=["key_1", "key_2"]) # Inner join with multiple key


left.merge(right=right, how="inner", on=["key_1", "key_2"]) # Same as above

We can also merge DataFrames by using the index. To do so, first we need to set index for our DataFrames

In [None]:
left = left.set_index("key_1")

left

In [None]:
right = right.set_index("key_1")

right

In [None]:
pd.merge(left=left, right=right, how="inner", left_index=True, right_index=True) # Inner join based on index

### Cross Join

---

Cross Join is the same as Cartesian Product on `X-Y` plane

![Venn Diagram](images/cross_join.png)

In [None]:
left

In [None]:
right

In [None]:
left.merge(right, how="cross")

### `append()`

---

Append rows of the second DataFrame to the end of the first DataFrame. Columns in the second DataFrame that are not in the first DataFrame are added as new columns.

In [None]:
left.append(right, ignore_index=False) # Preserves the index of the DataFrame

In [None]:
left.append(right, ignore_index=True) # Resets the old index and sets new one

Let add one more column to the right DataFrame to see if `append()` method really adds new columns.

In [None]:
right["new_value"] = right["value"] * 2

right

In [None]:
left.append(right, ignore_index=False) # Indeed, "append()" method adds new column

### `merge_asof()`

---

Pandas provides special functions for merging Time-series DataFrames. Perhaps the most useful and popular one is the `merge_asof()` function. The `merge_asof()` is similar to an ordered left-join merge except that you match on nearest key rather than equal keys. For each row in the left DataFrame, you select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

#### Reference


[pandas.merge_asof](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html#pandas-merge-asof)

In [None]:
trades = pd.DataFrame({'time': pd.to_datetime(['20160525 13:30:00.023',
                                               '20160525 13:30:00.038',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.048']),
                       'ticker': ['MSFT', 'MSFT','GOOG', 'GOOG', 'AAPL'],
                       'price': [51.95, 51.95,720.77, 720.92, 98.00],
                       'quantity': [75, 155,100, 100, 100]},
                      columns=['time', 'ticker', 'price', 'quantity'])



trades

In [None]:
quotes = pd.DataFrame({'time': pd.to_datetime(['20160525 13:30:00.023',
                                               '20160525 13:30:00.023',
                                               '20160525 13:30:00.030',
                                               '20160525 13:30:00.041',
                                               '20160525 13:30:00.048',
                                               '20160525 13:30:00.049',
                                               '20160525 13:30:00.072',
                                               '20160525 13:30:00.075']),
                       'ticker': ['GOOG', 'MSFT', 'MSFT','MSFT', 'GOOG', 'AAPL', 'GOOG','MSFT'],
                       'bid': [720.50, 51.95, 51.97, 51.99,720.50, 97.99, 720.50, 52.01],
                       'ask': [720.93, 51.96, 51.98, 52.00,720.93, 98.01, 720.88, 52.03]},
                      columns=['time', 'ticker', 'bid', 'ask'])


quotes

In [None]:
pd.merge_asof(trades, quotes, on="time", by="ticker") # Approximate or nearest merge

If you observe carefully, you can notice the reason behind `NaN` appearing in the `AAPL` ticker row. Since the right DataFrame quotes didn't have any time value less than `13:30:00.048` (the time in the left table) for `AAPL` ticker, `NaN`s were introduced in the bid and ask columns.

### Combining

---

There is another data combination situation that can’t be expressed as either a merge or concatenation operation. Imagine the situation of having two datasets whose indexes overlap in full or part.

As a motivating example, consider NumPy’s `where()` function, which performs the array-oriented equivalent of an `if-else` expression.

In [None]:
series_1 = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])


series_1

In [None]:
series_2 = pd.Series([0.0, 1.0, 2.0, 3.0, 4.0, np.nan],
                     index=['f', 'e', 'd', 'c', 'b', 'a'])


series_2

If `series_1` is null then `series_2`, otherwise `series_1`

In [None]:
np.where(pd.isnull(series_1), series_2, series_1)

Pandas Series object has a `combine_first()` method, which performs the equivalent of the above operation along with Pandas usual data alignment logic.

In [None]:
series_2[:-2].combine_first(series_1[2:])

There is a `combine()` method which takes a function and combines the series according to this function. The function takes two scalars as inputs and returns a single element.

In [None]:
series_2.combine(series_1, max)

In [None]:
series_2.combine(series_1, min)

Now, it's time to perform same operation for DataFrames to see how it works when we have DataFrame instead of Series.

In [None]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})


df1

In [None]:
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})



df2

In [None]:
df1.combine_first(df2) # Updates null elements with value in the same location in other

Pandas DataFrame `combine()` method takes two Series and produce Series or one single element. In other words, perform column-wise combine with another DataFrame.

In [None]:
df1.combine(df2, np.minimum) # np.minimum performs elementwise min operation

In [None]:
df1.combine(df2, np.maximum) # np.maximum performs elementwise max operation

In [None]:
df1.combine(df2, np.add) # np.add performs elementwise summution

## Joining and Concatenation

---


* `join()` - for combining data on a key column or an index


    * supports inner/left (default)/right/full
    * can join multiple DataFrames at a time
    * supports index-index joins


* `concat()` - for combining DataFrames across rows or columns


    * supports inner/full (default)
    * can join multiple DataFrames at a time
    * supports index-index joins



Under the hood, `join()` uses `merge()`, but it provides a more efficient way to join DataFrames than a fully specified `merge()` method. Moreover, `join()` can be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

### Join

---

In [None]:
left

In [None]:
right

As we have overlapping columns in `left` and `right` DataFrame, we have to use `lsuffix` and `rsuffix` arguments while calling `join()` method

In [None]:
left.join(right, lsuffix="_left", rsuffix="_right") # By default performs LEFT join

In [None]:
left.join(right, lsuffix="_caller", rsuffix="_other", how="inner") # INNER join index-to-index

`join()` method can join several DataFrames compared to `merge()` method which only can join two at a time.

In [None]:
middle = pd.DataFrame({'key_1': ['A', 'B', 'C', 'D'],
                       'middle_value': [1, 2, 3, 4]})


middle = middle.set_index("key_1")


middle

In [None]:
left = left.rename({"key_2":"left_key_2", "value":"left_value"}, axis=1)

right = right.rename({"key_2":"right_key_2", "value":"right_value"}, axis=1)

In [None]:
left.join([middle, right], how="inner")

### Concatenation


---

Concatenation is a bit different from the merging techniques we saw above. With merging, we can expect the resulting dataset to have rows from the first DataFrame mixed with the second DataFrame based on some commonality. Depending on the type of merge, we might also lose rows that don’t have matches in the other dataset.

With concatenation, your datasets are just stacked together along an axis — either the row axis or column axis. Visually, a concatenation with no parameters along rows would look like this:

#### Reference

[Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Row Concatenation**


![Concatenation](images/concat_row.png)

In [None]:
left = (left.reset_index(drop=True)
            .rename({"left_value":"value"}, axis=1))

left

In [None]:
middle.insert(0, "middle_key_2", list(middle.index.str.lower()))

middle = (middle.reset_index(drop=True)
                .rename({"middle_value": "value"}, axis=1))

middle

In [None]:
right = (right.drop("new_value", axis=1)
              .reset_index(drop=True)
              .rename({"right_value": "value"}, axis=1))

right

In [None]:
pd.concat([left, middle, right], axis=0) # By default performs OUTER join

In [None]:
pd.concat([left, middle, right], axis=0, join="inner") # INNER join

In [None]:
pd.concat([left, middle, right], keys=["left_key_2", "middle_key_2", "right_key_2"], axis=0) # Creates MultiIndex

**Column Concatenation**


![Concatenation](images/concat_column.png)

In [None]:
pd.concat([left, middle, right], axis=1) # Concatenation along vertical axis - adding columns

In [None]:
pd.concat([left, middle, right], keys=["left_key_2", "middle_key_2", "right_key_2"], axis=1) # Column-wise MultiIndex

## Reshaping and Pivoting

---

Sometimes, we need to reshape our DataFrame, meaning that to change its format. Reshaping can be done in two ways. We can convert our long format data into wide format or vice versa.

#### Reference


[Reshaping and pivot tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)

### Reshaping Rows and Colums with `stack()` and `unstack()`

In [None]:
monthly_data = pd.read_csv("data/monthly_data.csv")


monthly_data = monthly_data.set_index('YYYY') # Set "YYYY" column as index


monthly_data

`stack()` method moves data from rows into a single column

In [None]:
stacked_monthly_data = monthly_data.stack()

stacked_monthly_data

`unstack()` takes the inner index level and creates a column for every unique index. It then moves the data into these columns.

In [None]:
stacked_monthly_data.unstack()

`unstack()` might introduce missing data if all of the values in the level aren’t found in each of the subgroups. Let consider the following example.

In [None]:
s1 = pd.Series([0, 1, 2, 3], index=["a", "b", "c", "d"])

s2 = pd.Series([4, 5, 6], index=["c", "d", "e"])

test_data = pd.concat([s1, s2], keys=["one", "two"])

test_data

In [None]:
test_data.unstack()

What if we `unstack()` the initial DataFrame?

In [None]:
monthly_data.head()

In [None]:
unstacked_monthly_data = monthly_data.unstack()

unstacked_monthly_data

Let convert unstacked initial DataFrame from Pandas Series to Pandas DataFrame and then reset index.

In [None]:
pd.DataFrame(unstacked_monthly_data).reset_index() # We converted Wide format data into Long format

### Wide to Long format

---


When converting wide format into long format, we merge multiple columns into one, which produces a DataFrame that is longer than the input.

`melt()` is the opposite of `pivot()` as it moves the data from the rows into a single column.

In [None]:
wide_data = pd.DataFrame([["Mary", 6, 4, 5, ],
                          ["John", 7, 8, 7],
                          ["Ann", 6, 7, 9],
                          ["Pete", 6, 5, 5],
                          ["Laura", 5, 2, 7]], 
                         columns = ["name", "test_1", "test_2", "test_3"])


wide_data

In [None]:
pd.melt(wide_data, id_vars=["name"]) # Returns Long format

In [None]:
pd.melt(wide_data, id_vars=["name"], value_vars=["test_1"]) # Use one column as value variable

In [None]:
pd.melt(wide_data, id_vars=["name"], value_vars=["test_1", "test_2"]) # Use two columns as value variables

After converting our DataFrame from wide to long format, we see that there are two new columns, `variable` and `value`. We can change them while converting by specifying `var_name` and `value_name` arguments, respectively.

In [None]:
pd.melt(wide_data, id_vars=["name"], var_name="test", value_name="grades")

### Long to Wide format

---

To convert Wide format data into a Long format, we use `pivot()` method. `pivot()` moves data from rows into columns.

Let first create long format data. `pivot()` is an inverse operation to Pandas `melt()` operation we saw above.

In [None]:
raw_data = {"patient": [1, 1, 1, 2, 2], 
            "obs": [1, 2, 3, 1, 2], 
            "treatment": [0, 1, 0, 1, 0],
            "score": [6252, 24243, 2345, 2342, 23525]}


long_data = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])


long_data

In [None]:
wide_data = long_data.pivot(index="patient", columns="obs", values="score")


wide_data

## Groupby

---

Sometimes we want to select data based on groups and understand aggregated data on a group level. Fortunately Pandas has a `groupby()` method to speed up such task. The idea behind the groupby() function is  that it takes some DataFrame, splits it into chunks based on some key values, applies computation on those  chunks, then combines the results back together into another DataFrame. In Pandas this is referred as the `split-apply-combine` pattern.


![Split_Apply_Combine](images/split_apply_combine.png)


---


* **Splitting** the data into groups based on some criteria.


* **Applying** a function to each group independently.


* **Combining** the results into a data structure.


$$
$$

The **Split** step is the most straightforward. We may wish to split the data set into groups based on some key(s) and do something with those groups.


In the **Apply** step we're doing one of the following:


* Aggregation


    * Compute group sum, mean, variance, etc.
    * Compute group size/count


* Transformation


    * Standardize data in a group
    * Filling NAs within groups with a value derived from each group
    
    
* Filtration


    * Filtering out data based on some criteria

#### Reference



[Group by: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)


[Grouping](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#grouping)


[Combining with stats and GroupBy](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#combining-with-stats-and-groupby)

In [None]:
series = pd.Series(data=[0, 5, 10, 5, 10, 15, 10, 15, 20],
                   index=["A", "B", "C", "A", "B", "C", "A", "B", "C"])


series

In [None]:
series.groupby(by=series.index) # Retruns SeriesGroupBy object. Does not compute anything yet

In [None]:
series.groupby(by=series.index).sum() # Group by index and then sum them up

We can calculate several aggregation functions, such as count, mean, sum, etc.

In [None]:
series.groupby(by=series.index).agg([np.sum, np.mean, np.min, np.max])

In [None]:
series.groupby(by=series.index).agg(["sum", "mean", "count"])

#### Let see how `groupby()` works with DataFrames

In [None]:
athletes = pd.read_csv("data/athletes.csv")


athletes.head()

Like Series groupby, DataFrame groupby returns `DataFrameGroupBy` object. Actually. it's a DataFrame. Hence, we can perform DataFrame common operations, such as slicing, filtering, and aggregation by columns.

In [None]:
athletes.groupby(by=["nationality"])

Calling an aggregation function on the `GroupBy` object applies the calculation for every group and constructs a DataFrame with the results.

In [None]:
athletes.groupby(by=["nationality"])[["height", "weight"]].mean() # Mean height and weight by nationality

In [None]:
athletes.groupby(by=["sex", "nationality"])[["height", "weight"]].mean() # Mean height and weight by sex and nationality

Let count the number of medals by country. To do, we have to group by country and then count the amount of medals.

In [None]:
medal_counts = athletes.groupby(by=["nationality"])[["gold", "silver", "bronze"]].sum()

medal_counts

Not very informative right? Let sort the resulted DataFrame by values and see which country got the highest number of medals in each type.

In [None]:
medal_counts.sort_values(by=["gold", "silver", "bronze"], ascending=[False, False, False]).head()

In [None]:
medal_counts.nlargest(n=5, columns=["gold", "silver", "bronze"]) # Same as above

Medal counts by sex and country. Are female better than male?

In [None]:
medal_counts_by_sex = athletes.groupby(by=["nationality", "sex"])[["gold", "silver", "bronze"]].sum()


medal_counts_by_sex.nlargest(5, ["gold", "silver", "bronze"])

In [None]:
athletes[athletes["nationality"]=="RUS"][["sex", "gold", "silver", "bronze"]].groupby("sex").sum()

> <font color='red'>Do you notice weird thing in the above `groupby()`? What is it? Why it happened?</font>

Let see the average height and weight by sex and sport. We can even group them by country.

In [None]:
athletes.groupby(["sport", "sex"])[["weight", "height"]].mean()

`groupby()` is a powerful and commonly used tool for data cleaning and data analysis. Once you have grouped the data by some category you have a DataFrame of just those values and you can conduct aggregated analysis on the segments that you are interested in. The `groupby()` method follows a `split-apply-combine` approach - first the data is split into subgroups, then you can apply some transformation, filtering, or aggregation, and then the results are combined automatically by Pandas for us.

## Pivot Table

---


A pivot table is a way of summarizing data in a DataFrame for a particular purpose. It makes heavy use of
the aggregation function. A pivot table is itself a DataFrame, where the rows represent one variable that
you're interested in, the columns another, and the cell's some aggregate value. A pivot table also tends to
includes marginal values as well, which are the sums for each column and row. This allows you to be able to
see the relationship between two variables at just a glance.


Behind the `pivot_table()` method of Pandas, there is `groupby()` facility combined with reshape operations utilizing hierarchical indexing.


> Pandas `pivot()` and `pivot_table()` are not the same. They are similar and in some cases they are complements.



`pivot_table()` is a generalization of `pivot()` that can handle duplicate values for one pivoted index/column pair, whereas `pivot()` can’t deal with duplicate values.

$$
$$


**Pandas `pivot_table()` has the same functionality as excel pivot table**

#### Reference


[Pivot tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#pivot-tables)


[Pandas Pivot Table Explained](https://pbpython.com/pandas-pivot-table-explained.html)

In [None]:
pivot_df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                               "bar", "bar", "bar", "bar"],
                         "B": ["one", "one", "one", "two", "two",
                               "one", "one", "two", "two"],
                         "C": ["small", "large", "large", "small",
                               "small", "large", "small", "small", "large"],
                         "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                         "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})



pivot_df

The simplest Pivot Table

In [None]:
pivot_df.pivot_table(index=["A"]) # Returns average of only numerical columns by default

We can pivot our DataFrame by two or more columns

In [None]:
pivot_df.pivot_table(index=["A","B"])

Pivot Table with column values


<div class="alert alert-info">

One of the confusing points with the `pivot_table()` is the use of `columns` and `values` . Remember, `columns` are optional - they provide an additional way to segment the actual `values` you care about. The aggregation functions are applied to the `values` you list.

</div>

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["E"])

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["C"],
                     aggfunc=["mean", "sum"])

**Fully-fledged Pivot Table**

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["C"],
                     values="D",
                     aggfunc="sum",
                     margins=True,
                     margins_name="Total",
                     fill_value=0)

In [None]:
pivot_df.pivot_table(index=["A", "B"],
                     columns=["C"],
                     values=["D", "E"],
                     aggfunc="sum",
                     margins=True,
                     margins_name="Total",
                     fill_value=0)

`Pivot Tables` are incredibly useful when dealing with numeric data, especially if you're trying to summarize the data in some form. You'll regularly be creating new pivot tables on slices of data, whether you're exploring the data yourself or preparing data for others to report on. And of course, you can pass any function you want to the aggregate function, including those that you define yourself.

## Cross-Tabulation

---

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies, unless an array of values and an aggregation function are passed.

#### Reference


[Cross tabulations](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#cross-tabulations)


[Pandas Crosstab Explained](https://pbpython.com/pandas-crosstab.html)

Define column names for the data, since the data does not have any.

In [None]:
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

In [None]:
cross_df = pd.read_csv("data/automobile.data",
                       header=None,
                       names=headers,
                       na_values="?") # Convert "?" into NaN

In [None]:
cross_df.head()

The DataFrame contains many rows and is not convenient to work with. Let extract only top automobile producers, such as:

In [None]:
models = ["toyota", "nissan", "mazda", "honda",
          "mitsubishi", "subaru", "volkswagen", "volvo"]

In [None]:
cross_df = cross_df[cross_df["make"].isin(models)]

cross_df.head()

The simplest `cross-tab`. Let calculate how many different `body_style` these car makers made.

In [None]:
pd.crosstab(index=cross_df["make"],
            columns=cross_df["body_style"])

In [None]:
cross_df.groupby(["make", "body_style"])["body_style"].count().unstack().fillna(0) # Same as above, but with groupby

In [None]:
cross_df.pivot_table(index="make", columns="body_style", aggfunc={"body_style": len}, fill_value=0) # Same with pivot_table

In [None]:
pd.crosstab(index=cross_df["make"],
            columns=cross_df["num_doors"],
            margins=True,
            margins_name="Total") # Include totals across rows and columns

Cross-Tab in not only used to count the frequencies. Let calculate the average price across car makers and break it down by car type.

In [None]:
pd.crosstab(index=cross_df["make"],
            columns=cross_df["body_style"],
            values=cross_df["price"],
            aggfunc="mean").round(0).fillna("")

Pandas `crosstab()` is even smarter in a way that we can pass in multiple columns and it will group them. For example: If we want to see how the data is distributed by front wheel drive (fwd) and rear wheel drive (rwd), we can include the `drive_wheels` column by including it in the list of valid columns in the second argument to the `crosstab()`.

In [None]:
pd.crosstab(cross_df["make"],
            [cross_df["body_style"],
             cross_df["drive_wheels"]])

In [None]:
pd.crosstab([cross_df["make"], cross_df["num_doors"]],
            [cross_df["body_style"],
             cross_df["drive_wheels"]],
            rownames=["Auto Manufacturer", "Doors"],
            colnames=['Body Style', "Drive Type"],
            dropna=False)

# Summary

---

Now you know how to merge and concatenate datasets together. You will find such functions very useful for
combining data to get more complex or complicated results and to do analysis with. A solid understanding of
how to merge data is absolutely essentially when you are procuring, cleaning, and manipulating data. It's
worth knowing how to join different datasets quickly, and the different options you can use when joining
datasets, and I would encourage you to check out the pandas docs for joining and concatenating data.