# Concatenating Tables with Set-Like Operations

One of the two way of combining two tables is to stack one table on top of the other.  When stacking two tables on top of one another, we need to decide

1. If we combine columns based on position or name (and if combining by name, what do we do with mismatches?)
2. How to decide which rows to keep.  In this case, we will take some guidance from SQL clauses.

In [106]:
import pandas as pd
from dfply import *

## Three Types of Operations

* **Union:** Keeps rows from either table.
* **Intersection:** Only keeps common columns
* **Set Difference/Except:** Keep rows from the left table *except* those in the right table.

## Set Operations in Action 

<img src="./img/table_verbs_set.gif" width=800>

## All Operations Match by Position

All operations

* Match columns by position
* Require same number/type of columns

## Distinct Versus All

**UNION/INTERSECT/SET DIFFERENE** are **DISTINCT**
    * Only keeps distinct rows, removing duplicates.
**UNION ALL/INTERSECT ALL/SET DIFFERENCE ALL**
    * Keeps duplicate rows

**Note:** `pyspark` also includes `unionFromName`, which will match columns by name and doesn't require them to be in the same order.

## Example - Auto Sales in Spark

In [107]:
sales_may = pd.read_csv('./data/auto_sales_may.csv')
sales_may

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9


In [108]:
sales_apr = pd.read_csv('./data/auto_sales_apr.csv')
sales_apr

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9


# Concatenating Tables with Set-Like Operations in `pyspark`

Now let's look at combining tables with `union`, `intersect`, and `except` in `pyspark`.

## Unions with `dfply`

Use `left_table >> union(right_table)`

In [109]:
sales_may >> union(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9


## `dfply.union` is distinct

Since Ann have the same sales each month, her row only included one row.  Note that we can use `keep='last'` to `keep='first'` to determine which row is kept.

In [110]:
sales_may >> union(sales_apr, keep='last')

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9
0,Ann,22,18,15,12
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9


## Making `union_all`

We can use `pd.concat` to perform a `UNION ALL`

In [111]:
pd.concat([sales_apr, sales_may], ignore_index=True)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9
4,Ann,22,18,15,12
5,Bob,19,12,17,20
6,Yolanda,19,8,32,15
7,Xerxes,12,23,18,9


## Making a `dfply.union_all`

In [112]:
@dfpipe
def union_all(left_df, right_df, ignore_index=True):
    return pd.concat([left_df, right_df], ignore_index=ignore_index)

In [113]:
sales_may >> union_all(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9
4,Ann,22,18,15,12
5,Bob,20,14,6,24
6,Yolanda,19,10,28,17
7,Xerxes,11,27,17,9


## Adding a month column

Another way to keep both of Ann's sales rows is adding a month column (which we should probably do anyway).

In [10]:
sales_may >> mutate(month = 'May') >> union(sales_apr >> mutate(month = 'April'))

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck,month
0,Ann,22,18,15,12,May
1,Bob,19,12,17,20,May
2,Yolanda,19,8,32,15,May
3,Xerxes,12,23,18,9,May
0,Ann,22,18,15,12,April
1,Bob,20,14,6,24,April
2,Yolanda,19,10,28,17,April
3,Xerxes,11,27,17,9,April


## Finding common rows with `dfply.intersect`

In [11]:
sales_may >> intersect(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12


## Finding rows unique to the left table.

Use `left_table >> dfply.set_diff(right_table)`

In [12]:
sales_may >> set_diff(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9


## <font color="red"> Exercise 1 </font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Use `glob` to get all 6 file paths.
2. Use a regular expression to create a `lambda` function that pulls the month from the files.
3. Read the 6 data frames into a `dict` with keys equal to the month name and values containing the corresponding data frame.
4. Write a helper function that adds a month column each dictionary.  Use a dictionary comprehension to apply this helper to each `df`.
5. Use the accumulator pattern and `dfply.union` to combine these 6 data frames into one combined `df`
6. Inspect the head and shape of the resulting `df`

In [18]:
1. #reading files using glob
from glob import glob
files = glob('./data/uber/uber-trip-data/uber-*.csv')
files = files[1:]
files

['./data/uber/uber-trip-data/uber-raw-data-apr14.csv',
 './data/uber/uber-trip-data/uber-raw-data-aug14.csv',
 './data/uber/uber-trip-data/uber-raw-data-sep14.csv',
 './data/uber/uber-trip-data/uber-raw-data-jul14.csv',
 './data/uber/uber-trip-data/uber-raw-data-jun14.csv',
 './data/uber/uber-trip-data/uber-raw-data-may14.csv']

In [20]:
2. #lambda function for file name
import re
pattern = r'./data/uber/uber-trip-data/uber-raw-data-(\w{3})\d+\.csv$'
month_name = lambda file: re.compile(pattern).match(file).group(1)
month_name

<function __main__.<lambda>(file)>

In [25]:
file_name_list = [month_name(file) for file in files]
   
file_name_list

['apr', 'aug', 'sep', 'jul', 'jun', 'may']

In [26]:
3. # month name and file dictionary
import pandas as pd
def read_files(file_list):
    file_dict = {month_name(file): pd.read_csv(file) for file in file_list} 
    return file_dict

months_dict = read_files(files)
months_dict

{'apr':                  Date/Time      Lat      Lon    Base
 0         4/1/2014 0:11:00  40.7690 -73.9549  B02512
 1         4/1/2014 0:17:00  40.7267 -74.0345  B02512
 2         4/1/2014 0:21:00  40.7316 -73.9873  B02512
 3         4/1/2014 0:28:00  40.7588 -73.9776  B02512
 4         4/1/2014 0:33:00  40.7594 -73.9722  B02512
 5         4/1/2014 0:33:00  40.7383 -74.0403  B02512
 6         4/1/2014 0:39:00  40.7223 -73.9887  B02512
 7         4/1/2014 0:45:00  40.7620 -73.9790  B02512
 8         4/1/2014 0:55:00  40.7524 -73.9960  B02512
 9         4/1/2014 1:01:00  40.7575 -73.9846  B02512
 10        4/1/2014 1:19:00  40.7256 -73.9869  B02512
 11        4/1/2014 1:48:00  40.7591 -73.9684  B02512
 12        4/1/2014 1:49:00  40.7271 -73.9803  B02512
 13        4/1/2014 2:11:00  40.6463 -73.7896  B02512
 14        4/1/2014 2:25:00  40.7564 -73.9167  B02512
 15        4/1/2014 2:31:00  40.7666 -73.9531  B02512
 16        4/1/2014 2:43:00  40.7580 -73.9761  B02512
 17        4/1/2014 3

In [29]:
from dfply import *

def add_month(file, name):
    added = file >> mutate(month = name)
    return added
    


In [30]:

new_dict = {name: add_month(df,name)for name, df in months_dict.items() }

In [32]:
from functools import reduce
combined_df = reduce(lambda x,y:x>>union(y),new_dict.values())

In [33]:
combined_df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,month
0,4/1/2014 0:11:00,40.769,-73.9549,B02512,apr
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512,apr
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512,apr
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512,apr
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512,apr


(4451746, 5)

## Up Next

Stuff