<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/030__Combining_Data_With_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 4/6: DATA CLEANING AND ANALYSIS

# MISSION 2: Combining Data With Pandas

Learn how to combine data with pandas.



## 1. Introduction

In the last mission, we worked with just one data set, the 2015 World Happiness Report, to explore data aggregation. However, it's very common in practice to work with more than one data set at a time.

Often, you'll find that you need additional data to perform analysis or you'll find that you have the data, but need to pull it from mulitiple sources. In this mission, we'll learn a couple of different techniques for combining data using pandas to easily handle situations like these.

We'll use what we learned in the last mission to analyze the 2015, 2016, and 2017 World Happiness Reports. Specifically, we'll look to answer the following question:

*Did world happiness increase, decrease, or stay about the same from 2015 to 2017?*

As a reminder, these reports assign each country a happiness score based on a poll question that asks respondents to rank their life on a scale of 0 - 10, so "world happiness" refers to this definition specifically.


- `Country` - Name of the country
- `Region` - Name of the region the country belongs to
- `Happiness Rank` - The rank of the country, as determined by its happiness score
- `Happiness Score` - A score assigned to each country based on the answers to a poll question that asks respondents to rate their happiness on a scale of 0-10

Let's start by reading the 2015, 2016, and 2017 reports into a pandas dataframe and adding a `Year` column to each to make it easier to distinguish between them.



In [None]:
# Import files directly using Google Colab
# Download the files from the links below:
# World_Happiness_2015.csv: https://drive.google.com/file/d/1iZ8_lHkMx7pI22s4ECfpNHKnOohyPfvU/view?usp=sharing
# World_Happiness_2016.csv: https://drive.google.com/file/d/1yi1YYJEJwzYMXZ1YsjdSVANNj_pCm3jI/view?usp=sharing
# World_Happiness_2017.csv: https://drive.google.com/file/d/1UjcEvCr5hj67-ZoBHwLmdOrxHKxMGGqR/view?usp=sharing

from google.colab import files
upload = files.upload()

Saving World_Happiness_2017.csv to World_Happiness_2017 (1).csv
Saving World_Happiness_2016.csv to World_Happiness_2016 (1).csv
Saving World_Happiness_2015.csv to World_Happiness_2015 (1).csv


In [None]:
# Import pandas and numpy libraries
import pandas as pd
import numpy as np

In [None]:
 # Read the csv files
 happiness2015 = pd.read_csv("World_Happiness_2015.csv")
 happiness2016 = pd.read_csv("World_Happiness_2016.csv")
 happiness2017 = pd.read_csv("World_Happiness_2017.csv")

**Instructions:**

Add a column called `Year` to each dataframe with the corresponding year. For example, the `Year` column in `happiness2015` should contain the value `2015` for each row.

In [None]:
happiness2015['Year'] = 2015
happiness2016['Year'] = 2016
happiness2017['Year'] = 2017

## 2. Combining Dataframes with the Concat Function

Let's start by exploring the `pd.concat()` [function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html). The `concat()` function combines dataframes one of two ways:

1. Stacked: Axis = 0 (This is the default option.)
![img](https://s3.amazonaws.com/dq-content/344/Concat_Updated.svg)

2. Side by Side: Axis = 1
![img](https://s3.amazonaws.com/dq-content/344/Concat_Axis1.svg)
Since `concat` is a function, not a method, we use the syntax below:
![img](https://s3.amazonaws.com/dq-content/344/Concat_syntax.svg)

In the next exercise, we'll use the `concat()` function to combine subsets of `happiness2015` and `happiness2016` and then debrief the results on the following screen.

Below are the subsets we'll be working with:




In [None]:
head_2015 = happiness2015[['Country','Happiness Score', 'Year']].head(3)
head_2015

Unnamed: 0,Country,Happiness Score,Year
0,Switzerland,7.587,2015
1,Iceland,7.561,2015
2,Denmark,7.527,2015


In [None]:
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)
head_2016

Unnamed: 0,Country,Happiness Score,Year
0,Denmark,7.526,2016
1,Switzerland,7.509,2016
2,Iceland,7.501,2016


Let's use the `concat()` function to combine `head_2015` and `head_2016` next.



**Instructions:**

We've already saved the subsets from `happiness2015` and `happiness2016` to the variables `head_2015` and `head_2016`.

- Use the `pd.concat()` function to combine `head_2015` and `head_2016` along axis = 0. Remember to pass the `head_2015` and `head_2016` into the function as a list. Assign the result to `concat_axis0`.
- Use the `pd.concat()` function to combine `head_2015` and `head_2016` along axis = 1. Remember to pass `head_2015` and `head_2016` into the function as a list and set the `axis` parameter equal to `1`. Assign the result to `concat_axis1`.
- Use the variable inspector to view `concat_axis0` and `concat_axis1`.
 - Assign the number of rows in `concat_axis0` to a variable called `question1`.
 - Assign the number of rows in `concat_axis1` to a variable called `question2`.

## 3. Combining Dataframes with the Concat Function Continued

When you reviewed the results from the last exercise, you probably noticed that we merely pushed the dataframes together vertically or horizontally - none of the values, column names, or indexes changed. For this reason, when you use the `concat()` function to combine dataframes with the same shape and index, you can think of the function as "gluing" dataframes together.

![img](https://s3.amazonaws.com/dq-content/344/Glue.svg)

However, what happens if the dataframes have different shapes or columns? Let's confirm the `concat()` function's behavior when we combine dataframes that don't have the same shape in the next exercise.

We'll work with the following subsets:



In [None]:
head_2015 = happiness2015[['Year','Country','Happiness Score', 'Standard Error']].head(4)
head_2015

Unnamed: 0,Year,Country,Happiness Score,Standard Error
0,2015,Switzerland,7.587,0.03411
1,2015,Iceland,7.561,0.04884
2,2015,Denmark,7.527,0.03328
3,2015,Norway,7.522,0.0388


In [None]:
head_2016 = happiness2016[['Country','Happiness Score', 'Year']].head(3)
head_2016

Unnamed: 0,Country,Happiness Score,Year
0,Denmark,7.526,2016
1,Switzerland,7.509,2016
2,Iceland,7.501,2016


Notice in the subsets above that `head_2015` contains one column that `head_2016` does not - the `Standard Error` column. Let's confirm what happens when we concatenate them next.

**Instructions:**

We've already created the `head_2015` and `head_2016` variables.

- Use the `pd.concat()` function to combine `head_2015` and `head_2016` along axis = 0. Remember to pass the `head_2015` and `head_2016` into the function as a list. Assign the result to `concat_axis0`.
- Use the variable inspector to view `concat_axis0`.
 - Assign the number of rows in `concat_axis0` to a variable called `rows`.
 - Assign the number of columns in `concat_axis0` to a variable called `columns`.

## 4. Combining Dataframes with Different Shapes Using the Concat Function

In the last exercise, we saw that the analogy of "gluing" dataframes together doesn't fully describe what happens when concatenating dataframes of different shapes. Instead, the function combined the data according to the corresponding column names:

![img](https://s3.amazonaws.com/dq-content/344/Concat_DifShapes.svg)

Note that because the `Standard Error` column didn't exist in `head_2016`, `NaN` values were created to signify those values are missing. By default, the `concat` function will keep ALL of the data, no matter if missing values are created.

Also, notice again the indexes of the original dataframes didn't change. If the indexes aren't meaningful, it can be better to reset them. This is especially true when we create duplicate indexes, because they could cause errors as we perform other data cleaning tasks.

Luckily, the `concat` function has a parameter, `ignore_index`, that can be used to clear the existing index and reset it in the result. Let's practice using it next.


**Instructions:**

- Use the `pd.concat()` function to combine `head_2015` and `head_2016` along axis = 0 again. This time, however, set the `ignore_index` parameter to `True` to reset the index in the result. Assign the result to `concat_update_index`.



## 5. Joining Dataframes with the Merge Function

Next, we'll explore the `pd.merge()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) - a function that can execute high performance database-style joins. Note that unlike the `concat` function, the `merge` function only combines dataframes horizontally (axis=1) and can only combine two dataframes at a time. However, it can be valuable when we need to combine very large dataframes quickly and provides more flexibility in terms of how data can be combined, as we'll see in the next couple screens.

With the `merge()` function, we'll combine dataframes on a **key**, a shared index or column. When choosing a key, it's good practice to use keys with unique values to avoid duplicating data.

You can think of keys as creating a link from one dataframe to another using the common values or indexes. For example, in the diagram below, we linked the dataframes using common values in the `Country` columns.

![img](https://s3.amazonaws.com/dq-content/344/Merge_link.svg)

In the diagram below, we use those common country values to join or merge the dataframes.

![img](https://s3.amazonaws.com/dq-content/344/Merge.svg)

We'll explore the `merge` function in the next exercise using just three rows from `happiness2015` and `happiness2016`:

In [None]:
happiness2015[['Country','Happiness Rank','Year']].iloc[2:5]

Unnamed: 0,Country,Happiness Rank,Year
2,Denmark,3,2015
3,Norway,4,2015
4,Canada,5,2015


In [None]:
happiness2016[['Country','Happiness Rank','Year']].iloc[2:5]

Unnamed: 0,Country,Happiness Rank,Year
2,Iceland,3,2016
3,Norway,4,2016
4,Finland,5,2016


We'll use the following syntax:

![img](https://s3.amazonaws.com/dq-content/344/Merge_syntax.svg)


Let's practice using the `merge()` function next.


**Instructions:**

We've already saved three rows from `happiness2015` and `happiness2016` to variables named `three_2015` and `three_2016`.

- Use the `pd.merge()` function to join `three_2015` and `three_2016` on the `Country` column. Assign the result to `merged`.

In [None]:
three_2015 = happiness2015[['Country','Happiness Rank','Year']].iloc[2:5]
three_2016 = happiness2016[['Country','Happiness Rank','Year']].iloc[2:5]
merged = pd.merge(left=three_2015, right=three_2016, on='Country')

## 6. Joining on Columns with the Merge Function

Joining `three_2015` and `three_2016` in the last exercise resulted in a dataframe with just one row:

In [None]:
pd.merge(left=three_2015, right=three_2016, on='Country')

Unnamed: 0,Country,Happiness Rank_x,Year_x,Happiness Rank_y,Year_y
0,Norway,4,2015,4,2016


Let's look back to `three_2015` and `three_2016` to understand why. Since we joined the dataframes on the `Country` column, or used it as the key, the `merge()` function looked to match elements in the `Country` column in BOTH dataframes.

![img](https://s3.amazonaws.com/dq-content/344/Join_columns.svg)

The one country returned in `merged` was "Norway", the only element that appeared in the `Country` column in BOTH `three_2015` and `three_2016`.

This way of combining, or *joining*, data is called an *inner* join. An inner join returns only the intersection of the keys, or the elements that appear in both dataframes with a common key.

The term "join" originates from SQL (or structured query language), a language used to work with databases. If you're a SQL user, you'll recognize the following concepts. If you've never used SQL, don't worry! No prior knowledge is neccessary for this mission, but we will learn SQL later in this path.

There are actually four different types of joins:

1. Inner: only includes elements that appear in both dataframes with a common key
2. Outer: includes all data from both dataframes
3. Left: includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key; the result retains all columns from both of the original dataframes
4. Right: includes all of the rows from the "right" dataframe along with any rows from the "left" dataframe with a common key; the result retains all columns from both of the original dataframes

If the definition for *outer* joins sounds familiar, it's because we've already seen examples of outer joins! Recall that when we combined data using the `concat` function, it kept all of the data from all dataframes, no matter if missing values were created.

Since it's much more common to use inner and left joins for database-style joins, we'll focus on these join types for the remainder of the mission, but encourage you to explore the other options on your own.

Let's experiment with changing the join type next.

**Instructions:**

- Update `merged` to use a left join instead of an inner join. Set the `how` parameter to `'left'` in `merge()`. Assign the result to `merged_left`.
- Update `merged_left` so that the `left` parameter equals `three_2016` and the `right` parameter equals `three_2015`. Assign the result to `merged_left_updated`.
-  Based on the results of this exercise, when using a left join, does changing the dataframe assigned to the `left` and `right` parameters change the result? Try to answer this question before moving onto the next screen.

## 7. Left Joins with the Merge Function

Let's summarize what we learned in the last exercise:

1. Changing the join type from an inner join to a left join resulted in a dataframe with more rows and created `NaN`s.
2. When using a left join, interchanging the dataframes assigned to the `left` and `right` parameters changes the results.
Let's look into the results in more detail. First, let's look at the case in which the "left" dataframe is `three_2015` and the "right" dataframe is `three_2016`:

In [None]:
pd.merge(left=three_2015, right=three_2016, how='left', on='Country')

Unnamed: 0,Country,Happiness Rank_x,Year_x,Happiness Rank_y,Year_y
0,Denmark,3,2015,,
1,Norway,4,2015,4.0,2016.0
2,Canada,5,2015,,


Recall that a left join includes all of the rows from the "left" dataframe along with any rows from the "right" dataframe with a common key.
![img](https://s3.amazonaws.com/dq-content/344/Left_join.svg)

Since the `Country` column was used as the key, only countries that appear in BOTH dataframes have a value in every column. "Norway" was the only value in the `Country` column in BOTH dataframes, so it's the only row with a value in every column.

When we interchanged the "left" and the "right" dataframes, the values changed:



In [None]:
pd.merge(left=three_2016, right=three_2015, how='left', on='Country')

Unnamed: 0,Country,Happiness Rank_x,Year_x,Happiness Rank_y,Year_y
0,Iceland,3,2016,,
1,Norway,4,2016,4.0,2015.0
2,Finland,5,2016,,


This time, we kept all of the rows from `three_2016`. "Norway" was still the only value in the `Country` column in BOTH dataframes, so it's the only row with a value in every column.

![img](https://s3.amazonaws.com/dq-content/344/Left_join_update.svg)

In summary, we'd use a left join when we don't want to drop any data from the left dataframe.

Note that a right join works the same as a left join, except it includes all of the rows from the "right" dataframe. Since it's far more common in practice to use a left join, we won't cover right joins in detail.

You may have also noticed above that the `merge` function added a suffix of either `_x` or `_y` to columns of the same name to distinguish between them.


||Country |	Happiness | Rank_x	| Year_x |	Happiness | Rank_y|	Year_y|
|---|---|---|---|---|---|---|---|

Let's update those suffixes next to make our results easier to read.


**Instructions:**

- Update `merged` to use the suffixes `_2015` and `_2016`. Set the suffixes parameter to `('_2015', '_2016')` in `merge()`. Assign the result to `merged_suffixes`.
- Update `merged_updated` to use the suffixes `_2015` and `_2016`. Notice that the "left" dataframe is `three_2016` and the "right" dataframe is `three_2015`. Assign the result to `merged_updated_suffixes`.

## 8. Join on Index with the Merge Function

## 9. Challenge: Combine Data and Create a Visualization