# Joining Tables

In [None]:
import warnings
warnings.filterwarnings('ignore')

from reframe import Relation
r1 = Relation('/home/faculty/yasiro01/pub/R1.csv',sep=',')
r2 = Relation('/home/faculty/yasiro01/pub/R2.csv',sep=',')


In [None]:
r1

In [None]:
r2

## Cartesian Product

So far we have restricted ourselves to operators that operate on one table (relation) at a time. This is logical in the sense that our operators create relations! However, we know that a typical database contains many tables, which in fact may be related. So, how do we do queries using mulitple tables?

The first step toward applying the operators we have learned so far to multiple tables is to merge the tables together. We do this using the cartesian product.

A cartesian product creates one table out of two tables by creating every possible combination of each row in table A with each row in table B, forming a new relation with A+B columns, and A\*B rows!


### Relation 1

COL_1 | COL_2
------|------
 A | B
 C | D
 E | F
 

### Relation 2

COL_3 | COL_4
------|-------
 1 | A
 2 | C
 3 | A
 

### Cartesian Product: Result

COL_1 | COL_2 | COL_3 | COL_4
------|-------|-------|------
 A | B | 1 | A
 A | B | 2 | C
 A | B | 3 | A
 C | D | 1 | A
 C | D | 2 | C
 C | D | 3 | A
 E | F | 1 | A
 E | F | 2 | C
 E | F | 3 | A
 

### Cartesian Product: Challenge

Of course this can create an **enormous** table, so the cartesian product is always followed by a query where we limit the number of rows by comparing a column in relation 1 against a column in relation 2.


### Cartesian Product: Query

COL_1 == COL_4

COL_1 | COL_2 | COL_3 | COL_4
------|-------|-------|------
 A | B | 1 | A
 A | B | 3 | A
 C | D | 2 | C


## Natural Join

The natural join or `njoin` operator takes the pattern of cartesian product followed by query, and wraps it all into one operation subject to the following:

* The query condition tests for equality
* The query condition of equality applies to all columns with the same name in both relations

You can see this in the following diagram, where we have two relations. Both have a column named C1.

The resulting relation has a single C1 column where only the rows where C1 holds the same value in both relations. The other values from the row are filled in with the values from the matching rows.


## SQL


In [None]:
%load_ext sql


In [None]:
%%sql

postgresql://yasiro01:@localhost/jtest


### Cartesian Product: SQL


In [None]:
%%sql



### Natural Join: SQL

In [None]:
%%sql



### Natural Join vs Cartesian Product

To return to the cartesian product example, note that it is not 100% equivalent as the cartesian product retains and renames the second copy of column1.


In [None]:
%%sql



## Natural Join Example

Now lets look at a more real example. From our city and country tables we have a problem:

* the column name is in both relations, but means different things
* the column population is in both relations but means different things
* the column we would like to join on is the **countrycode** column, but it is called code in the country relation and countrycode in the city relation

We can remedy this in relational algebra by using the rename operator.


In [None]:
city = Relation('/home/faculty/yasiro01/pub/city.csv')
country = Relation('/home/faculty/yasiro01/pub/country.csv')

In [None]:
city.head()

In [None]:
country.head()

### Find all cities in Norway

Relational Algebra


### Find all cities in Norway

Structured Query Language


In [None]:
%sql postgresql://yasiro01:@localhost/world


In [None]:
%%sql



In [None]:
%%sql



## Movie Database

The natural join operator works very well on the movie database as it has two columns with the same name in both the moviecast table and the release_date table.


### Find the names of all of the lead actors in the  movies released in October of 2015 in Norway

In [None]:
moviecast = Relation('/home/faculty/yasiro01/pub/cast.csv',sep=',')
release_date = Relation('/home/faculty/yasiro01/pub/release_dates.csv',sep=',')

In [None]:
moviecast.head()

In [None]:
release_date.head()

When you run the above query it takes a while, because it is doing a very large join. However we can make it run more quickly by reducing the size of the relations involved in the join by using a query on each.

In [None]:
%sql postgresql://yasiro01:@localhost/movies


In [None]:
%%sql

