# Combining Datasets with Pandas

## Learning Objectives

At the end of this notebook you should be able to
- combine DataFrames with Pandas
- describe the different joining methods (how to join DataFrames)

Pandas functions that allow us to combine two sets of data include the use of `pd.merge()`, `df.join()`, `df.merge()`, and `pd.concat()`. For the most part, these do largely the same things (although you'll notice the slight syntax difference with `merge()` and `concat()` being able to be called via the Pandas module and `merge()` and `join()` being able to be called on a DataFrame instance).   
There are some cases where one of these might be better than another in terms of writing less code or performing some kind of data combination in an easier way. The major differences between these, though, largely depend on what they do by default when you try to combine different data. By default, `merge()` looks to join on common columns, `join()` on common indices, and `concat()` by just appending on a given axis.

You can find more detail about the differences between all three of these in the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). We'll look at some examples below. 

In [None]:
# We'll go back to our wine data set. Who doesn't love wine?
import pandas as pd
wine_df = pd.read_csv('data/winequality-red.csv', delimiter=';')
wine_df.head()

In [None]:
# A glance at the values of the quality of wine in the DataFrame
wine_df.quality.unique()

In [None]:
# get_dummies is a method called on the pandas module - you simply pass in a Pandas Series 
# or DataFrame, and it will convert a categorical variable into dummy/indicator variables. 
# The idea of dummy coding is to convert each category into a new column, and assign a 1 or 0 to the column.
# We will cover this topic in more depth later.
quality_dummies = pd.get_dummies(wine_df.quality, prefix='quality')
quality_dummies.head()

### Join()
Now let's look at the `join()` method. Remember, this joins on indices by default and is called on a dataframe instance. This means that we can simply join our quality dummies dataframe back to our original wine dataframe with the following code:

In [None]:
joined_df = wine_df.join(quality_dummies)
joined_df.head() 

The arguments of `.join` are the following: 
````
DataFrame.join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
````
With `how` we can specify which join method we want to use.

If we want to join using a specific common column, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.

```
df.set_index('column_name').join(other.set_index('column_name'))
```

See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)


The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

Merge method | SQL Join Name | Description
---|---|---
left| LEFT OUTER JOIN | Use keys from left frame only
right | RIGHT OUTER JOIN | Use keys from right frame only
outer | FULL OUTER JOIN | Use union of keys from both frames
inner | INNER JOIN | Use intersection of keys from both frames


You can also think of it as set theory and use Venn diagrams to illustrate what happens in each method.

![Join Methods](./images/join_types.png)

### Merge()
Let's look at the `merge()` method. Merge combines dataframes on column columns by default and can be used via the pandas module AND called on a dataframe instance.

The arguments of `.merge` are the following: 
````
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False,   
suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
````
See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).

In [None]:
# Since in both dataframes, we need a common column.
# Let's use the index column as the one to merge on:
wine_df_ind = wine_df.reset_index()
quality_dummies_ind = quality_dummies.reset_index()

In [None]:
# check result - you will see a new column called index in the dataframe
wine_df_ind.head()

In [None]:
# check result - you will see a new column called index in the dataframe
quality_dummies_ind.head()

In [None]:
# Join the quality_dummies df on the wine_df instance on the common column 'index'
joined_df2a = wine_df_ind.merge(quality_dummies_ind, on='index')
joined_df2a.head()

In [None]:
# Join the two dataframes via the pandas module on the common column 'index'
joined_df2b = pd.merge(wine_df_ind, quality_dummies_ind, on='index')
joined_df2b.head()

### Concat()

Let's now look at concat.
`````
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None,    
verify_integrity=False, sort=False, copy=True)
`````
Different from join and merge, which by default operate on columns, concat can define whether to operate on columns or rows.

See the documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.concat.html).

In [None]:
joined_df3 = pd.concat([quality_dummies, wine_df], axis=1)
joined_df3.head()

In the images below, you can see the differences, if axis is set as 0 or 1.

**Concat with axis=0:**
![Concat Axis 0](./images/concat_axis_0.png)

---

**Concat with axis=1:**
![Concat Axis 1](./images/concat_axis_1.png)

(The pictures were part of [this](https://towardsdatascience.com/python-pandas-dataframe-join-merge-and-concatenate-84985c29ef78) blog post.)

### More examples on combining dataframes

Let's read in a different data set, since we're looking at combining multiple data sources.

In [None]:
red_wines_df = pd.read_csv('data/winequality-red.csv', delimiter=';')
white_wines_df = pd.read_csv('data/winequality-white.csv', delimiter=';')

We want to compare red and white wines regarding their mean fixed acidity per quality category.

In [None]:
# check out the included columns in the red wine dataset
red_wines_df.columns

In [None]:
# check out the included columns in the white wine dataset
white_wines_df.columns

In [None]:
# Let's build a new dataset which shows the mean fixed acidity per quality category for the red wines
red_wines_quality_df = red_wines_df.groupby('quality').mean()['fixed acidity'].reset_index()
red_wines_quality_df.head()

In [None]:
# ... and the same for the white wines
white_wines_quality_df = white_wines_df.groupby('quality').mean()['fixed acidity'].reset_index()
white_wines_quality_df.head()

In [None]:
# In order to compare red and white wines better, let's combine the two newly created dataframes. 
pd.merge(red_wines_quality_df, white_wines_quality_df, on=['quality'], suffixes=[' red', ' white'])

Let's try out to generate the table above using the methods `.join()` and `.concat()`.

In [None]:
# Since we want to join on a common column rather than on the indices, we need the .set_index function.
red_wines_quality_df.set_index('quality').join(white_wines_quality_df.set_index('quality'), lsuffix = ' red', rsuffix = ' white').reset_index()

In [None]:
# Also for using pd.concat we need to set quality as index, otherwise we would have the column quality two times in our dataframe.
# Try it out without the .set_index('quality') fo see the difference.
# Can you think of a different solution for the naming of the columns so that we get the same result as above?
pd.concat([red_wines_quality_df.set_index('quality'), white_wines_quality_df.set_index('quality')], axis = 1, join="inner", keys=['red', 'white']).reset_index()

## Check your understanding

1. Please join the two given dataframes (df1 and df2) along rows and merge with the third (df3) dataframe along the common column id.


In [None]:
df1 = pd.DataFrame({
        'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],
         'name': ['Erika Raaf', 'Nadja Berens', 'Florentin Kleist', 'Dorothea Eibl', 'Gerhard Bihlmeier'], 
        'subject': ['Math', 'Biology', 'Biology', 'English', 'Philosophy']})
df2 = pd.DataFrame({
        'student_id': ['S6', 'S7', 'S8'],
        'name': ['Jens Hüls', 'Vera Kagan', 'Paula Brodersen'], 
        'subject': ['Math', 'Math', 'Social Science']})
df3 = pd.DataFrame({
        'student_id': ['S1', 'S2', 'S3', 'S4', 'S5', 'S7', 'S8', 'S9', 'S10', 'S11', 'S12', 'S13'],
        'marks': [23, 45, 12, 67, 21, 55, 33, 14, 56, 83, 88, 12]})

```Python
result_data = pd.concat([df1, df2])
final_merged_data = pd.merge(result_data, df3, on='student_id')
final_merged_data
```


2. You have received some weather data (temperature) of the last year. For each month the average temperature was measured, only for a few months the maximum temperature could be measured. Anyway, you want to combine these two data without losing any information.

(Extra question: Can you fill in the average max. Temperature for the missing values in the Column `Max TemperatureF`)

In [None]:
weather_mean_data = {'Mean TemperatureF': [53.1, 70., 34.93548387, 28.71428571, 32.35483871, 72.87096774, 70.13333333, 35., 62.61290323, 39.8, 55.4516129 , 63.76666667],
                     'Month': ['Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep']}
weather_max_data = {'Max TemperatureF': [68, 89, 91, 84], 'Month': ['Jan', 'Apr', 'Jul', 'Oct']}
