# Working with Multiple DataFrames
## Introduction: Multiple DataFrames

In order to efficiently store data, we often spread related information across multiple tables.<br>
<br>
For instance, imagine that we own an e-commerce business and we want to track the products that have been ordered from our website.<br>
<br>
We could have one table with all of the following information:

* `order_id`
* `customer_id`
* `customer_name`
* `customer_address`
* `customer_phone_number`
* `product_id`
* `product_description`
* `product_price`
* `quantity`
* `timestamp`

However, a lot of this information would be repeated. If the same customer makes multiple orders, that customer's name, address, and phone number will be reported multiple times. If the same product is ordered by multiple customers, then the product price and description will be repeated. This will make our orders table big and unmanageable.<br>
<br>
So instead, we can split our data into three tables:

* `orders` would contain the information necessary to describe an order: `order_id`, `customer_id`, `product_id`, `quantity`, and `timestamp`
* `products` would contain the information to describe each product: `product_id`, `product_description` and `product_price`
* `customers` would contain the information for each customer: `customer_id`, `customer_name`, `customer_address`, and `customer_phone_number`

In this lesson, we will learn the Pandas commands that help us work with data stored in multiple tables.

***

### Exercise

In the cell below, we've loaded in three DataFrames: `orders`, `products`, and `customers`.

In [1]:
import pandas as pd

orders = pd.read_csv('orders.csv')

products = pd.read_csv('products.csv')

customers = pd.read_csv('customers.csv')

1. Start by inspecting `orders`:

In [2]:
orders.head()

Unnamed: 0,order_id,customer_id,product_id,quantity,timestamp
0,1,2,3,1,2017-01-01
1,2,2,2,3,2017-01-01
2,3,3,1,1,2017-01-01
3,4,3,2,2,2017-02-01
4,5,3,3,3,2017-02-01


2. Now inspect `products`:

In [3]:
products.head()

Unnamed: 0,product_id,description,price
0,1,thing-a-ma-jig,5
1,2,whatcha-ma-call-it,10
2,3,doo-hickey,7
3,4,gizmo,3


3. Now inspect `customers`:

In [4]:
customers.head()

Unnamed: 0,customer_id,customer_name,address,phone_number
0,1,John Smith,123 Main St.,212-123-4567
1,2,Jane Doe,456 Park Ave.,949-867-5309
2,3,Joe Schmo,798 Broadway,112-358-1321


***

## Inner Merge I

Suppose we have the following three tables that describe our eCommerce business:

* `orders` — a table with information on each transaction:

|order_id|customer_id|product_id|quantity|timestamp|
|:-------|:----------|:---------|:-------|:--------|
|1|2|3|1|2017-01-01|
|2|2|2|3|2017-01-01|
|3|3|1|1|2017-01-01|
|4|3|2|2|2017-02-01|
|5|3|3|3|2017-02-01|
|6|1|4|2|2017-03-01|
|7|1|1|1|2017-02-02|
|8|1|4|1|2017-02-02|

* `products` — a table with product IDs, descriptions, and prices:

|product_id|description|price|
|:---------|:----------|:----|
|1|thing-a-ma-jig|5|
|2|whatcha-ma-call-it|10|
|3|doo-hickey|7|
|4|gizmo|3|

* `customers` — a table with customer names and contact information:

|customer_id|customer_name|address|phone_number|
|:----------|:------------|:------|:-----------|
|1|John Smith|123 Main St.|212-123-4567|
|2|Jane Doe|456 Park Ave.|949-867-5309|
|3|Joe Schmo|798 Broadway|112-358-1321|

If we just look at the <b>`orders`</b> table, we can not really tell what's happened in each order. However, if we refer to the other tables, we can get a more complete picture.<br>
<br>
Let us examine the order with an `order_id` of `1`. It was purchased by Customer $2$. To find out the customer's name, we look at the <b>`customers`</b> table and look for the item with a `customer_id` value of $2$. We can see that Customer $2$'s name is Jane Doe and that she lives at 456 Park Ave.<br>
<br>
Doing this kind of matching is called <b>merging</b> two DataFrames.

***

### Exercise

1. Examine the `orders` and `products` tables.<br>
<br>
What is the `description` of the product that was ordered in Order $3$?<br>
<br>
Give your answer as a string assigned to the variable `order_3_description`.

In [5]:
orders.head()

Unnamed: 0,order_id,customer_id,product_id,quantity,timestamp
0,1,2,3,1,2017-01-01
1,2,2,2,3,2017-01-01
2,3,3,1,1,2017-01-01
3,4,3,2,2,2017-02-01
4,5,3,3,3,2017-02-01


In [6]:
products.head()

Unnamed: 0,product_id,description,price
0,1,thing-a-ma-jig,5
1,2,whatcha-ma-call-it,10
2,3,doo-hickey,7
3,4,gizmo,3


In [7]:
customers.head()

Unnamed: 0,customer_id,customer_name,address,phone_number
0,1,John Smith,123 Main St.,212-123-4567
1,2,Jane Doe,456 Park Ave.,949-867-5309
2,3,Joe Schmo,798 Broadway,112-358-1321


In [8]:
order_3_description = 'thing-a-ma-jig'

2. Examine the `orders` and `customers` tables.<br>
<br>
What is the `phone_number` of the customer in Order $5$?<br>
<br>
Give your answer as a string assigned to the variable `order_5_phone_number`.

In [9]:
order_5_phone_number = '112-358-1321'

***

## Inner Merge II

It is easy to do this kind of matching for one row, but hard to do it for multiple rows.<br>
<br>
Luckily, Pandas can efficiently do this for the entire table. We use the `.merge` method.<br>
<br>
The `.merge` method looks for columns that are common between two DataFrames and then looks for rows where those column's values are the same. It then combines the matching rows into a single row in a new table.<br>
<br>
We can call the pd.merge method with two tables like this:<br>
<br>
`new_df = pd.merge(orders, customers)`<br>
<br>
This will match up all of the customer information to the orders that each customer made.

***

### Exercise

1. You are an analyst at Cool T-Shirts Inc. You are going to help them analyze some of their sales data.<br>
<br>
There are two DataFrames defined in the file <b>script.py</b>:

* `sales` contains the monthly revenue for Cool T-Shirts Inc. It has two columns: `month` and `revenue`.<br>
* `targets` contains the goals for monthly revenue for each month. It has two columns: `month` and `target`.<br>
<br>
Create a new DataFrame `sales_vs_targets` which contains the merge of `sales` and `targets`.

In [10]:
sales = pd.read_csv('sales.csv')

sales

Unnamed: 0,month,revenue
0,January,300
1,February,290
2,March,310
3,April,325
4,May,475
5,June,495


In [11]:
targets = pd.read_csv('targets.csv')

targets

Unnamed: 0,month,target
0,January,310
1,February,270
2,March,300
3,April,350
4,May,475
5,June,500


In [12]:
sales_vs_targets = pd.merge(sales, targets)

2. Display `sales_vs_targets` using `print`.

In [13]:
sales_vs_targets

Unnamed: 0,month,revenue,target
0,January,300,310
1,February,290,270
2,March,310,300
3,April,325,350
4,May,475,475
5,June,495,500


3. Cool T-Shirts Inc. wants to know the months when they crushed their targets.<br>
<br>
Select the rows from `sales_vs_targets` where `revenue` is greater than `target`. Save these rows to the variable `crushing_it`.

In [14]:
crushing_it = sales_vs_targets[sales_vs_targets.revenue > sales_vs_targets.target]

crushing_it

Unnamed: 0,month,revenue,target
1,February,290,270
2,March,310,300


***

## Inner Merge III

In addition to using `pd.merge`, each DataFrame has its own `merge` method. For instance, if you wanted to merge orders with customers, you could use:<br>
<br>
`new_df = orders.merge(customers)`<br>
<br>
This produces the same DataFrame as if we had called `pd.merge(orders, customers)`.<br>
<br>
We generally use this when we are joining more than two DataFrames together because we can "chain" the commands. The following command would merge `orders` to `customers`, and then the resulting DataFrame to `products`:<br>
<br>
`big_df = orders.merge(customers).merge(products)`

***

### Exercise

1. We have some more data from Cool T-Shirts Inc. The number of men's and women's t-shirts sold per month is in a file called `men_women_sales.csv`. Load this data into a DataFrame called `men_women`.

In [15]:
men_women = pd.read_csv('men_women_sales.csv')

men_women

Unnamed: 0,month,men,women
0,January,30,35
1,February,29,35
2,March,31,29
3,April,32,28
4,May,47,50
5,June,49,45


2. Merge all three DataFrames (`sales`, `targets`, and `men_women`) into one big DataFrame called `all_data`.

In [16]:
all_data = sales.merge(targets).merge(men_women)

3. Display `all_data` using `print`.

In [17]:
all_data

Unnamed: 0,month,revenue,target,men,women
0,January,300,310,30,35
1,February,290,270,29,35
2,March,310,300,31,29
3,April,325,350,32,28
4,May,475,475,47,50
5,June,495,500,49,45


4. Cool T-Shirts Inc. thinks that they have more revenue in months where they sell more women's t-shirts.<br>
<br>
Select the rows of `all_data` where:
* `revenue` is greater than `target`<br>
<br>
AND<br>
<br>
* `women` is greater than `men`<br>
<br>
Save your answer to the variable `results`.

In [18]:
results = all_data[(all_data.revenue > all_data.target) & (all_data.women > all_data.men)]

results

Unnamed: 0,month,revenue,target,men,women
1,February,290,270,29,35


***

## Merge on Specific Columns

In the previous example, the `merge` function "knew" how to combine tables based on the columns that were the same between two tables. For instance, `products` and `orders` both had a column called `product_id`. This will not always be true when we want to perform a merge.<br>
<br>
Generally, the `products` and `customers` DataFrames would not have the columns `product_id` or `customer_id`. Instead, they would both be called `id` and it would be implied that the id was the `product_id` for the `products` table and `customer_id` for the `customers` table. They would look like this:

### Customers

|customer_id|customer_name|address|phone_number|
|:----------|:------------|:------|:-----------|
|1|John Smith|123 Main St.|212-123-4567|
|2|Jane Doe|456 Park Ave.|949-867-5309|
|3|Joe Schmo|798 Broadway|112-358-1321|

### Products

|product_id|description|price|
|:---------|:----------|:----|
|1|thing-a-ma-jig|5|
|2|whatcha-ma-call-it|10|
|3|doo-hickey|7|
|4|gizmo|3|

**How would this affect our merges?**<br>
<br>
Because the `id` columns would mean something different in each table, our default merges would be wrong.<br>
<br>
One way that we could address this problem is to use `.rename` to rename the columns for our merges. In the example below, we will rename the column `id` to `customer_id`, so that orders and customers have a common column for the merge.<br>
<br>
`pd.merge(`<br>
    `orders,`<br>
    `customers.rename(columns={'id': 'customer_id'}))`

***

### Exercise

1. Merge `orders` and `products` using `rename`. Save your results to the variable `orders_products`.

In [19]:
orders = pd.read_csv('orders2.csv')
orders

Unnamed: 0,id,product_id,customer_id,quantity,timestamp
0,1,3,2,1,2017-01-01
1,2,2,2,3,2017-01-01
2,3,1,3,1,2017-01-01
3,4,2,3,2,2016-02-01
4,5,3,3,3,2017-02-01
5,6,4,1,2,2017-03-01
6,7,1,1,1,2017-02-02
7,8,4,1,1,2017-02-02


In [20]:
products = pd.read_csv('products2.csv')
products

Unnamed: 0,id,description,price
0,1,thing-a-ma-jig,5
1,2,whatcha-ma-call-it,10
2,3,doo-hickey,7
3,4,gizmo,3


In [21]:
orders_products = pd.merge(
    orders,
    products.rename(columns={'id': 'product_id'}))

2. Display `orders_products` using print.

In [22]:
orders_products

Unnamed: 0,id,product_id,customer_id,quantity,timestamp,description,price
0,1,3,2,1,2017-01-01,doo-hickey,7
1,5,3,3,3,2017-02-01,doo-hickey,7
2,2,2,2,3,2017-01-01,whatcha-ma-call-it,10
3,4,2,3,2,2016-02-01,whatcha-ma-call-it,10
4,3,1,3,1,2017-01-01,thing-a-ma-jig,5
5,7,1,1,1,2017-02-02,thing-a-ma-jig,5
6,6,4,1,2,2017-03-01,gizmo,3
7,8,4,1,1,2017-02-02,gizmo,3


***

## Merge on Specific Columns II

In the previous exercise, we learned how to use `rename` to merge two DataFrames whose columns don't match.<br>
<br>
If we don't want to do that, we have another option. We could use the keywords `left_on` and `right_on` to specify which columns we want to perform the merge on. In the example below, the "left" table is the one that comes first (`orders`), and the "right" table is the one that comes second (`customers`). This syntax says that we should match the `customer_id` from orders to the `id` in customers.<br>
<br>
`pd.merge(`<br>
    `orders,`<br>
    `customers,`<br>
    `left_on='customer_id',`<br>
    `right_on='id')`<br>
<br>
If we use this syntax, we'll end up with two columns called `id`, one from the first table and one from the second. Pandas won't let you have two columns with the same name, so it will change them to `id_x` and `id_y`.<br>
<br>
It will look like this:<br>

|id_x|customer_id|product_id|quantity|timestamp|id_y|customer_name|address|phone_number|
|:---|:----------|:---------|:-------|:--------|:---|:------------|:------|:-----------|
|1|2|3|1|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|2|2|2|3|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|3|3|1|1|2017-01-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|4|3|2|2|2016-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|5|3|3|3|2017-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|6|1|4|2|2017-03-01 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|7|1|1|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|8|1|4|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|

The new column names `id_x` and `id_y` are not very helpful for us when we read the table. We can help make them more useful by using the keyword suffixes. We can provide a list of suffixes to use instead of "_x" and "_y".<br>
<br>
For example, we could use the following code to make the suffixes reflect the table names:<br>
<br>
`pd.merge(`<br>
    `orders,`<br>
    `customers,`<br>
    `left_on='customer_id',`<br>
    `right_on='id',`<br>
    `suffixes=['_order', '_customer']`<br>
`)`<br>
<br>
The resulting table would look like this:<br>

|id_order|customer_id|product_id|quantity|timestamp|id_customer|customer_name|address|phone_number|
|:-------|:----------|:---------|:-------|:--------|:----------|:------------|:------|:-----------|
|1|2|3|1|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|2|2|2|3|2017-01-01 00:00:00|2|Jane Doe|456 Park Ave|949-867-5309|
|3|3|1|1|2017-01-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|4|3|2|2|2016-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|5|3|3|3|2017-02-01 00:00:00|3|Joe Schmo|789 Broadway|112-358-1321|
|6|1|4|2|2017-03-01 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|7|1|1|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|
|8|1|4|1|2017-02-02 00:00:00|1|John Smith|123 Main St.|212-123-4567|

***

### Exercise

1. Merge `orders` and `products` using `left_on` and `right_on`. Use the suffixes `_orders` and `_products`. Save your results to the variable `orders_products`.

In [23]:
orders_products = pd.merge(
    orders,
    products,
    left_on='product_id',
    right_on='id',
    suffixes=['_orders', '_products']
)

2. Display `orders_products` using `print`.

In [24]:
orders_products

Unnamed: 0,id_orders,product_id,customer_id,quantity,timestamp,id_products,description,price
0,1,3,2,1,2017-01-01,3,doo-hickey,7
1,5,3,3,3,2017-02-01,3,doo-hickey,7
2,2,2,2,3,2017-01-01,2,whatcha-ma-call-it,10
3,4,2,3,2,2016-02-01,2,whatcha-ma-call-it,10
4,3,1,3,1,2017-01-01,1,thing-a-ma-jig,5
5,7,1,1,1,2017-02-02,1,thing-a-ma-jig,5
6,6,4,1,2,2017-03-01,4,gizmo,3
7,8,4,1,1,2017-02-02,4,gizmo,3


***

## Mismatched Merges

In our previous examples, there were always matching values when we were performing our merges. What happens when that is not true?<br>
<br>
Let us imagine that our `products` table is out of date and is missing the newest product: Product 5. What happens when someone orders it?

***

### Exercise

1. We have just released a new product with `product_id` equal to `5`. People are ordering this product, but we have not updated the products table.<br>
<br>
In the cell below, you will find two DataFrames: `products` and `orders`. Inspect these DataFrames.<br>
<br>
Notice that the third order in orders is for the mysterious new product, but that there is no `product_id` 5 in `products`.<br>

In [25]:
orders = pd.read_csv('orders3.csv')
products = pd.read_csv('products2.csv')

In [26]:
orders

Unnamed: 0,id,product_id,customer_id,quantity,timestamp
0,1,3,2,1,2017-01-01
1,2,2,2,3,2017-01-01
2,3,5,1,1,2017-01-01
3,4,2,3,2,2016-02-01
4,5,3,3,3,2017-02-01


In [27]:
products

Unnamed: 0,id,description,price
0,1,thing-a-ma-jig,5
1,2,whatcha-ma-call-it,10
2,3,doo-hickey,7
3,4,gizmo,3


2. Merge `orders` and `products` and save it to the variable `merged_df`.<br>
<br>
Inspect `merged_df` using. What happened to `order_id` 3?

In [28]:
merged_df = pd.merge(
    orders,
    products,
    left_on='product_id',
    right_on='id',
    suffixes=['_orders', '_products']
)

merged_df

Unnamed: 0,id_orders,product_id,customer_id,quantity,timestamp,id_products,description,price
0,1,3,2,1,2017-01-01,3,doo-hickey,7
1,5,3,3,3,2017-02-01,3,doo-hickey,7
2,2,2,2,3,2017-01-01,2,whatcha-ma-call-it,10
3,4,2,3,2,2016-02-01,2,whatcha-ma-call-it,10


***

# Outer Merge

In the previous exercise, we saw that when we merge two DataFrames whose rows do not match perfectly, we lose the unmatched rows.<br>
<br>
This type of merge (where we only include matching rows) is called an <i>inner merge</i>. There are other types of merges that we can use when we want to keep information from the unmatched rows.<br>
<br>
Suppose that two companies, Company A and Company B have just merged. They each have a list of customers, but they keep slightly different data. Company A has each customer’s name and email. Company B has each customer’s name and phone number. They have some customers in common, but some are different.<br>

***

`company_a`

|name|email|
|:---|:----|
|Sally Sparrow|sally.sparrow@gmail.com|
|Peter Grant|pgrant@yahoo.com|
|Leslie May|leslie_may@gmail.com|

`company_b`

|name|phone|
|:---|:----|
|Peter Grant|212-345-6789|
|Leslie May|626-987-6543|
|Aaron Burr|303-456-7891|

***

If we wanted to combine the data from both companies without losing the customers who are missing from one of the tables, we could use an <i>Outer Join</i>. An <i>Outer Join</i> would include all rows from both tables, even if they do not match. Any missing values are filled in with `None` or `nan` (which stands for "Not a Number").<br>
<br>
`pd.merge(company_a, company_b, how='outer')`<br>
<br>
The resulting table would look like this:

|name|email|phone|
|:---|:----|:----|
|Sally Sparrow|sally.sparrow@gmail.com|`nan`|
|Peter Grant|pgrant@yahoo.com|212-345-6789|
|Leslie May|leslie_may@gmail.com|626-987-6543|
|Aaron Burr|`nan`|303-456-7891|

***

### Exercise

1. There are two hardware stores in town: Store A and Store B. Store A's inventory is in DataFrame `store_a` and Store B's inventory is in DataFrame `store_b`. They have decided to merge into one big Super Store!<br>
<br>
Combine the inventories of Store A and Store B using an outer merge. Save the results to the variable `store_a_b_outer`.

In [29]:
store_a = pd.read_csv('store_a.csv')
store_a

Unnamed: 0,item,store_a_inventory
0,hammer,12
1,screwdriver,15
2,nails,200
3,screws,350
4,saw,6
5,duct tape,150
6,wrench,12
7,pvc pipe,54


In [30]:
store_b = pd.read_csv('store_b.csv')
store_b

Unnamed: 0,item,store_b_inventory
0,hammer,6
1,nails,250
2,saw,6
3,duct tape,150
4,pvc pipe,54
5,rake,10
6,shovel,15
7,wooden dowels,192


In [31]:
store_a_b_outer = pd.merge(store_a, store_b, how='outer')

2. Display `store_a_b_outer`.<br>
<br>
Which values are `nan` or `None`?<br>
<br>
What does that mean?

In [32]:
store_a_b_outer

Unnamed: 0,item,store_a_inventory,store_b_inventory
0,hammer,12.0,6.0
1,screwdriver,15.0,
2,nails,200.0,250.0
3,screws,350.0,
4,saw,6.0,6.0
5,duct tape,150.0,150.0
6,wrench,12.0,
7,pvc pipe,54.0,54.0
8,rake,,10.0
9,shovel,,15.0


<b>A</b>: The items displayed as `nan` mean that the item is not available in the respectives store inventory.<br>
The following items are missing in `store_a`

* rake
* shovel
* wooden dowels

The following items are missing in `store_b`
* screwdriver
* screws
* wrench

***

## Left and Right Merge

Let us return to the merge of Company A and Company B.

### Left Merge

Suppose we want to identify which customers are missing phone information. We would want a list of all customers who have `email`, but do not have `phone`.<br>
<br>
We could get this by performing a <i>Left Merge</i>. A <i>Left Merge</i> includes all rows from the first (left) table, but only rows from the second (right) table that match the first table.<br>
<br>
For this command, the order of the arguments matters. If the first DataFrame is `company_a` and we do a left join, we will only end up with rows that appear in `company_a`.<br>
<br>
By listing `company_a` first, we get all customers from Company A, and only customers from Company B who are <i>also</i> customers of Company A.<br>
<br>
`pd.merge(company_a, company_b, how='left')`<br>
<br>
The result would look like this:

|name|email|phone|
|:---|:----|:----|
|Sally Sparrow|sally.sparrow@gmail.com|`None`|
|Peter Grant|pgrant@yahoo.com|212-345-6789|
|Leslie May|leslie_may@gmail.com|626-987-6543|

Now Let us say we want a list of all customers who have `phone` but no `email`. We can do this by performing a <i>Right Merge</i>.

###  Right Merge

Right merge is the exact opposite of left merge. Here, the merged table will include all rows from the second (right) table, but only rows from the first (left) table that match the second table.<br>
<br>
By listing `company_a` first and `company_b` second, we get all customers from Company B, and only customers from Company A who are also customers of Company B.<br>
<br>
`pd.merge(company_a, company_b, how="right")`<br>
<br>
The result would look like this:

|name|email|phone|
|:---|:----|:----|
|Peter Grant|pgrant@yahoo.com|212-345-6789|
|Leslie May|leslie_may@gmail.com|626-987-6543|
|Aaron Burr|None|303-456-7891|

***

### Exercise

1. Let us return to the two hardware stores, Store A and Store B. They are not quite sure if they want to merge into a big Super Store just yet.<br>
<br>
Store A wants to find out what products they carry that Store B does not carry. Using a left merge, combine `store_a` to `store_b` and save the results to `store_a_b_left`.<br>
<br>
The items with null in `store_b_inventory` are carried by Store A, but not Store B.

In [33]:
store_a_b_left = pd.merge(store_a, store_b, how='left')

store_a_b_left

Unnamed: 0,item,store_a_inventory,store_b_inventory
0,hammer,12,6.0
1,screwdriver,15,
2,nails,200,250.0
3,screws,350,
4,saw,6,6.0
5,duct tape,150,150.0
6,wrench,12,
7,pvc pipe,54,54.0


2. Now, Store B wants to find out what products they carry that Store A does not carry. Use a left join, to combine the two DataFrames but in the reverse order (i.e., `store_b` followed by `store_a`) and save the results to the variable `store_b_a_left`.<br>
<br>
Which items are not carried by Store A, but are carried by Store B?

In [34]:
store_b_a_left = pd.merge(store_b, store_a, how='left')

store_b_a_left

Unnamed: 0,item,store_b_inventory,store_a_inventory
0,hammer,6,12.0
1,nails,250,200.0
2,saw,6,6.0
3,duct tape,150,150.0
4,pvc pipe,54,54.0
5,rake,10,
6,shovel,15,
7,wooden dowels,192,


***

## Concatenate DataFrames

Sometimes, a dataset is broken into multiple tables. For instance, data is often split into multiple CSV files so that each download is smaller.<br>
<br>
When we need to reconstruct a single DataFrame from multiple smaller DataFrames, we can use the method <b>`pd.concat([df1, df2, df2, ...])`</b>. This method only works if all of the columns are the same in all of the DataFrames.<br>
<br>
For instance, suppose that we have two DataFrames:<br>
<br>
<b>`df1`</b>

|name|email|
|:---|:----|
|Katja Obinger|k.obinger@gmail.com|
|Alison Hendrix|alisonH@yahoo.com|
|Cosima Niehaus|cosi.niehaus@gmail.com|
|Rachel Duncan|rachelduncan@hotmail.com|

<b>`df2`</b>

|name|email|
|:---|:----|
|Jean Gray|jgray@netscape.net|
|Scott Summers|ssummers@gmail.com|
|Kitty Pryde|kitkat@gmail.com|
|Charles Xavier|cxavier@hotmail.com|

If we want to combine these two DataFrames, we can use the following command:<br>
<br>
`pd.concat([df1, df2])`<br>
<br>
That would result in the following DataFrame:

|name|email|
|:---|:----|
|Katja Obinger|k.obinger@gmail.com|
|Alison Hendrix|alisonH@yahoo.com|
|Cosima Niehaus|cosi.niehaus@gmail.com|
|Rachel Duncan|rachelduncan@hotmail.com|
|Jean Gray|jgray@netscape.net|
|Scott Summers|ssummers@gmail.com|
|Kitty Pryde|kitkat@gmail.com|
|Charles Xavier|cxavier@hotmail.com|

***

1. An ice cream parlor and a bakery have decided to merge.<br>
<br>
The bakery's menu is stored in the DataFrame `bakery`, and the ice cream parlor's menu is stored in DataFrame `ice_cream`.<br>
<br>
Create their new menu by concatenating the two DataFrames into a DataFrame called `menu`.

In [35]:
bakery = pd.read_csv('bakery.csv')
bakery

Unnamed: 0,item,price
0,cookie,2.5
1,brownie,3.5
2,slice of cake,4.75
3,slice of cheesecake,4.75
4,slice of pie,5.0


In [36]:
ice_cream = pd.read_csv('ice_cream.csv')
ice_cream

Unnamed: 0,item,price
0,scoop of chocolate ice cream,3.0
1,scoop of vanilla ice cream,2.95
2,scoop of strawberry ice cream,3.05
3,scoop of cookie dough ice cream,3.25


In [37]:
menu = pd.concat([bakery, ice_cream])

2. Display `menu`.

In [38]:
menu

Unnamed: 0,item,price
0,cookie,2.5
1,brownie,3.5
2,slice of cake,4.75
3,slice of cheesecake,4.75
4,slice of pie,5.0
0,scoop of chocolate ice cream,3.0
1,scoop of vanilla ice cream,2.95
2,scoop of strawberry ice cream,3.05
3,scoop of cookie dough ice cream,3.25


***

## Review

This lesson introduced some methods for combining multiple DataFrames:

* Creating a DataFrame made by matching the common columns of two DataFrames is called a `merge`
* We can specify which columns should be matches by using the keyword arguments `left_on` and `right_on`
* We can combine DataFrames whose rows don’t all match using `left`, `right`, and `outer` merges and the `how` keyword argument
* We can stack or concatenate DataFrames with the same columns using `pd.concat`

***

### Exercise

1. Cool T-Shirts Inc. just created a website for ordering their products. They want you to analyze two datasets for them:<br>
* `visits` contains information on all visits to their landing page<br>
* `checkouts` contains all users who began to checkout on their website<br>
<br>
Inspect each DataFrame.

In [39]:
visits = pd.read_csv('visits.csv',
                        parse_dates=[1])

visits

Unnamed: 0,user_id,visit_time
0,319350b4-9951-47ef-b3a7-6b252099905f,2017-02-21 07:16:00
1,7435ec9f-576d-4ebd-8791-361b128fca77,2017-05-16 08:37:00
2,0b061e73-f709-42fa-8d1a-5f68176ff154,2017-04-12 19:32:00
3,9133d6f0-e68b-4c8d-bafd-ff2825e8dafe,2017-08-18 04:32:00
4,08d13edb-071c-4cfb-9ee4-8f377d0e932a,2017-07-08 06:24:00
...,...,...
95,442efd1c-8d7b-4d6a-83be-8f2a9e08b34f,2017-02-19 11:20:00
96,5679519b-a901-4970-8656-dbf60ffb618d,2017-07-20 04:23:00
97,26deb2d5-1d7e-4774-bf6e-1df2ee9ee59d,2017-09-06 07:29:00
98,fff8f87a-e4a2-4f2c-b3d4-93a4ece95c4f,2017-06-06 23:42:00


In [40]:
checkouts = pd.read_csv('checkouts.csv',
                        parse_dates=[1])

checkouts

Unnamed: 0,user_id,checkout_time
0,fe90a9f4-960a-4a0d-9160-e562adb79365,2017-11-09 09:25:00
1,1a35b7eb-f603-407d-91be-a2c3304066fd,2017-08-15 21:25:00
2,e2c24ee0-7fdf-4400-abde-b36378fe5ce6,2017-07-04 15:39:00
3,10dbd3c5-d610-44e9-9994-110a7950b6b4,2017-08-09 21:07:00
4,f028e9dd-77d0-4002-83f6-372a4837fda6,2017-10-27 08:57:00
...,...,...
75,7fe800cc-46e8-427c-a7af-f27198d305a1,2017-01-18 13:14:00
76,43db76fc-d522-450d-a371-ef2a683d5bfd,2017-03-26 21:31:00
77,851b52d1-31e9-468d-834f-5363fee108ac,2017-09-21 22:24:00
78,7435ec9f-576d-4ebd-8791-361b128fca77,2017-05-16 08:55:00


2. We want to know the amount of time from a user's initial visit to the website to when they start to check out.<br>
<br>
Use `merge` to combine `visits` and `checkouts` and save it to the variable `v_to_c`.

In [41]:
v_to_c = pd.merge(visits, checkouts)

v_to_c

Unnamed: 0,user_id,visit_time,checkout_time
0,319350b4-9951-47ef-b3a7-6b252099905f,2017-02-21 07:16:00,2017-02-21 07:27:00
1,319350b4-9951-47ef-b3a7-6b252099905f,2017-02-21 07:16:00,2017-02-21 07:40:00
2,7435ec9f-576d-4ebd-8791-361b128fca77,2017-05-16 08:37:00,2017-05-16 08:49:00
3,7435ec9f-576d-4ebd-8791-361b128fca77,2017-05-16 08:37:00,2017-05-16 08:55:00
4,08d13edb-071c-4cfb-9ee4-8f377d0e932a,2017-07-08 06:24:00,2017-07-08 06:32:00
...,...,...,...
75,23a8d1be-3f5c-4b59-aed7-c7f19c51612b,2017-08-11 13:49:00,2017-08-11 14:11:00
76,8d9ac96c-16be-418e-8df4-1a6202d0b36e,2017-10-07 10:23:00,2017-10-07 10:24:00
77,8d9ac96c-16be-418e-8df4-1a6202d0b36e,2017-10-07 10:23:00,2017-10-07 10:52:00
78,5679519b-a901-4970-8656-dbf60ffb618d,2017-07-20 04:23:00,2017-07-20 04:24:00


3. In order to calculate the time between visiting and checking out, define a column of `v_to_c` called `time` by subtracting the checkout time from the visit time.

In [42]:
v_to_c['time'] = v_to_c.checkout_time - \
                 v_to_c.visit_time
 
v_to_c

Unnamed: 0,user_id,visit_time,checkout_time,time
0,319350b4-9951-47ef-b3a7-6b252099905f,2017-02-21 07:16:00,2017-02-21 07:27:00,00:11:00
1,319350b4-9951-47ef-b3a7-6b252099905f,2017-02-21 07:16:00,2017-02-21 07:40:00,00:24:00
2,7435ec9f-576d-4ebd-8791-361b128fca77,2017-05-16 08:37:00,2017-05-16 08:49:00,00:12:00
3,7435ec9f-576d-4ebd-8791-361b128fca77,2017-05-16 08:37:00,2017-05-16 08:55:00,00:18:00
4,08d13edb-071c-4cfb-9ee4-8f377d0e932a,2017-07-08 06:24:00,2017-07-08 06:32:00,00:08:00
...,...,...,...,...
75,23a8d1be-3f5c-4b59-aed7-c7f19c51612b,2017-08-11 13:49:00,2017-08-11 14:11:00,00:22:00
76,8d9ac96c-16be-418e-8df4-1a6202d0b36e,2017-10-07 10:23:00,2017-10-07 10:24:00,00:01:00
77,8d9ac96c-16be-418e-8df4-1a6202d0b36e,2017-10-07 10:23:00,2017-10-07 10:52:00,00:29:00
78,5679519b-a901-4970-8656-dbf60ffb618d,2017-07-20 04:23:00,2017-07-20 04:24:00,00:01:00


4. Calculate the average time to checkout.

In [43]:
print(v_to_c.time.mean())

0 days 00:15:24.750000
