<a href="https://colab.research.google.com/github/AKerby/dsci_325_module_7_more_data_management_in_python/blob/main/Joining%20Tables%20in%20Python%20Part%202.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Colab Prep

Execute the following code cells to whenever you open/restart the notebook in Google Colab.

In [None]:
!pip install "polars[all]"

In [None]:
!wget https://github.com/WSU-DataScience/dsci_325_module_7_more_data_management_in_python/raw/main/sample_data.zip

In [None]:
!unzip ./sample_data.zip

In [None]:
!pip install more_polars

# Concatenating Tables with Set-Like Operations

One of the two way of combining two tables is to stack one table on top of the other.  When stacking two tables on top of one another, we need to decide

1. If we combine columns based on position or name (and if combining by name, what do we do with mismatches?)
2. How to decide which rows to keep.  In this case, we will take some guidance from SQL clauses.

## Three Types of Operations

* **Union:** Keeps rows from either table.
* **Intersection:** Only keeps common columns
* **Set Difference/Except:** Keep rows from the left table *except* those in the right table.

## Set Operations in Action

<img src="https://github.com/AKerby/dsci_325_module_7_more_data_management_in_python/blob/main/img/table_verbs_set.gif?raw=1" width=800>

## All Operations Match by Position

All operations

* Match columns by position
* Require same number/type of columns

## Distinct Versus All

* **UNION/INTERSECT/SET DIFFERENE** are **DISTINCT**
    * Only keeps distinct rows, removing duplicates.
* **UNION ALL/INTERSECT ALL/SET DIFFERENCE ALL**
    * Keeps duplicate rows

In [None]:
import polars as pl

In [None]:
sales_apr = pl.read_csv('./sample_data/auto_sales_apr.csv')
sales_apr

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15,12
"""Bob""",19,12,17,20
"""Yolanda""",19,8,32,15
"""Xerxes""",12,23,18,9


In [None]:
sales_may = pl.read_csv('./sample_data/auto_sales_may.csv')
sales_may

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15,12
"""Bob""",20,14,6,24
"""Yolanda""",19,10,28,17
"""Xerxes""",11,27,17,9


In [None]:
sales_june = pl.read_csv('./sample_data/auto_sales_june.csv')
sales_june

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",18,19,17,12
"""Bob""",20,15,8,23
"""Yolanda""",21,9,22,19
"""Xerxes""",12,25,19,8


## Unions with `polars`

* Use `vstack` to perform a union on 2 tables.
* Use `pl.concat` to perform a union of 3+ tables.

In [None]:
sales_apr.vstack(sales_may)

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15,12
"""Bob""",19,12,17,20
"""Yolanda""",19,8,32,15
"""Xerxes""",12,23,18,9
"""Ann""",22,18,15,12
"""Bob""",20,14,6,24
"""Yolanda""",19,10,28,17
"""Xerxes""",11,27,17,9


## `df.vstack` is NOT distinct

You need to perform a `unique` after the union to get distinct rows.

In [None]:
sales_may.vstack(sales_apr).unique()

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15,12
"""Bob""",20,14,6,24
"""Yolanda""",19,10,28,17
"""Xerxes""",11,27,17,9
"""Bob""",19,12,17,20
"""Yolanda""",19,8,32,15
"""Xerxes""",12,23,18,9


In [None]:
sales_may.vstack(sales_apr).unique(keep='last')

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Bob""",20,14,6,24
"""Yolanda""",19,10,28,17
"""Xerxes""",11,27,17,9
"""Ann""",22,18,15,12
"""Bob""",19,12,17,20
"""Yolanda""",19,8,32,15
"""Xerxes""",12,23,18,9


## Columns are stacked by column location/order!

In [None]:
(sales_apr.select(['Salesperson','Compact','Sedan','SUV','Truck'])
.vstack(sales_may.select(['Salesperson','SUV','Truck','Compact','Sedan']))
)

SchemaError: cannot vstack: because column names in the two DataFrames do not match for left.name='Compact' != right.name='SUV'

## Adding a month column

Another way to keep both of Ann's sales rows is adding a month column (which we should probably do anyway).

In [None]:
pl.Config.with_columns_kwargs = True

(sales_may
 .with_columns(month = 'May')
 .vstack(sales_apr
         .with_columns(month = 'April')
        )
)

Salesperson,Compact,Sedan,SUV,Truck,month
str,i64,i64,i64,i64,str
"""Ann""",22,18,15,12,"""May"""
"""Bob""",20,14,6,24,"""May"""
"""Yolanda""",19,10,28,17,"""May"""
"""Xerxes""",11,27,17,9,"""May"""
"""Ann""",22,18,15,12,"""April"""
"""Bob""",19,12,17,20,"""April"""
"""Yolanda""",19,8,32,15,"""April"""
"""Xerxes""",12,23,18,9,"""April"""


## No `INTERSECT` or `DIFFERENCE` in `polars`

As of Fall 2022, `polars` lacks the either of these set operations.

## Combining multiple files using vstack

The first method of performing a union on more than two tables function is to dot-chain `vstack`.

#### Combining the raw files

In [None]:
(sales_apr
 .vstack(sales_may)
 .vstack(sales_june)
)

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15,12
"""Bob""",19,12,17,20
"""Yolanda""",19,8,32,15
"""Xerxes""",12,23,18,9
"""Ann""",22,18,15,12
"""Bob""",20,14,6,24
"""Yolanda""",19,10,28,17
"""Xerxes""",11,27,17,9
"""Ann""",18,19,17,12
"""Bob""",20,15,8,23


#### Adding a month column to each

This approach becomes messy when processing each table, as the processing of all subsequent table most be nested inside `vstack`.

In [None]:
(sales_apr
 .with_columns(month = 'April')
 .vstack(sales_may
         .with_columns(month = 'May')
        )
 .vstack(sales_june
         .with_columns(month = 'June')
        )
)

Salesperson,Compact,Sedan,SUV,Truck,month
str,i64,i64,i64,i64,str
"""Ann""",22,18,15,12,"""April"""
"""Bob""",19,12,17,20,"""April"""
"""Yolanda""",19,8,32,15,"""April"""
"""Xerxes""",12,23,18,9,"""April"""
"""Ann""",22,18,15,12,"""May"""
"""Bob""",20,14,6,24,"""May"""
"""Yolanda""",19,10,28,17,"""May"""
"""Xerxes""",11,27,17,9,"""May"""
"""Ann""",18,19,17,12,"""June"""
"""Bob""",20,15,8,23,"""June"""


## Combining multiple files using concatenate

Another way to perform unions on many files is the function `pl.concat` allows stacking any number of files with the same columns.

#### Combining the raw files

In [None]:
pl.concat([sales_apr, sales_may, sales_june])

Salesperson,Compact,Sedan,SUV,Truck
str,i64,i64,i64,i64
"""Ann""",22,18,15,12
"""Bob""",19,12,17,20
"""Yolanda""",19,8,32,15
"""Xerxes""",12,23,18,9
"""Ann""",22,18,15,12
"""Bob""",20,14,6,24
"""Yolanda""",19,10,28,17
"""Xerxes""",11,27,17,9
"""Ann""",18,19,17,12
"""Bob""",20,15,8,23


#### Adding a month column to each

In [None]:
pl.concat([sales_apr.with_columns(month = 'April'),
           sales_may.with_columns(month = 'May'),
           sales_june.with_columns(month = 'June'),
           ])

Salesperson,Compact,Sedan,SUV,Truck,month
str,i64,i64,i64,i64,str
"""Ann""",22,18,15,12,"""April"""
"""Bob""",19,12,17,20,"""April"""
"""Yolanda""",19,8,32,15,"""April"""
"""Xerxes""",12,23,18,9,"""April"""
"""Ann""",22,18,15,12,"""May"""
"""Bob""",20,14,6,24,"""May"""
"""Yolanda""",19,10,28,17,"""May"""
"""Xerxes""",11,27,17,9,"""May"""
"""Ann""",18,19,17,12,"""June"""
"""Bob""",20,15,8,23,"""June"""


## <font color="red"> Exercise 7.4.1</font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Read the April-August data frames.
2. Add the month column each data frame
3. Use `df.vstack` to combine these data frames into one combined `df`
4. Use `pd.concat` to combine these data frames into one combined `df`

In [None]:
# Your code here