# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2017**

In [1]:
using DataFrames # load package

## Joining DataFrames

### Preparing DataFrames for a join

In [2]:
x = DataFrame(ID=[1,2,3,4,missing], name = ["Alice", "Bob", "Conor", "Dave","Zed"])
y = DataFrame(id=[1,2,5,6,missing], age = [21,22,23,24,99])
x,y

(5×2 DataFrames.DataFrame
│ Row │ ID      │ name  │
├─────┼─────────┼───────┤
│ 1   │ 1       │ Alice │
│ 2   │ 2       │ Bob   │
│ 3   │ 3       │ Conor │
│ 4   │ 4       │ Dave  │
│ 5   │ [90mmissing[39m │ Zed   │, 5×2 DataFrames.DataFrame
│ Row │ id      │ age │
├─────┼─────────┼─────┤
│ 1   │ 1       │ 21  │
│ 2   │ 2       │ 22  │
│ 3   │ 5       │ 23  │
│ 4   │ 6       │ 24  │
│ 5   │ [90mmissing[39m │ 99  │)

In [3]:
rename!(x, :ID=>:id) # names of columns on which we want to join must be the same

Unnamed: 0,id,name
1,1,Alice
2,2,Bob
3,3,Conor
4,4,Dave
5,missing,Zed


### Standard joins: inner, left, right, outer, semi, anti

In [4]:
join(x, y, on=:id) # :inner join by default, missing is joined

Unnamed: 0,id,name,age
1,1,Alice,21
2,2,Bob,22
3,missing,Zed,99


In [5]:
join(x, y, on=:id, kind=:left)

Unnamed: 0,id,name,age
1,1,Alice,21
2,2,Bob,22
3,3,Conor,missing
4,4,Dave,missing
5,missing,Zed,99


In [6]:
join(x, y, on=:id, kind=:right)

Unnamed: 0,id,name,age
1,1,Alice,21
2,2,Bob,22
3,missing,Zed,99
4,5,missing,23
5,6,missing,24


In [7]:
join(x, y, on=:id, kind=:outer)

Unnamed: 0,id,name,age
1,1,Alice,21
2,2,Bob,22
3,3,Conor,missing
4,4,Dave,missing
5,missing,Zed,99
6,5,missing,23
7,6,missing,24


In [8]:
join(x, y, on=:id, kind=:semi)

Unnamed: 0,id,name
1,1,Alice
2,2,Bob
3,missing,Zed


In [9]:
join(x, y, on=:id, kind=:anti)

Unnamed: 0,id,name
1,3,Conor
2,4,Dave


### Cross join

In [10]:
# cross-join does not require on argument
# it produces a Cartesian product or arguments
function expand_grid(;xs...) # a simple replacement for expand.grid in R
    reduce((x,y) -> join(x, DataFrame(Pair(y...)), kind=:cross),
           DataFrame(Pair(xs[1]...)), xs[2:end])
end

expand_grid(a=[1,2], b=["a","b","c"], c=[true,false])

Unnamed: 0,a,b,c
1,1,a,True
2,1,a,False
3,1,b,True
4,1,b,False
5,1,c,True
6,1,c,False
7,2,a,True
8,2,a,False
9,2,b,True
10,2,b,False


### Complex cases of joins

In [11]:
x = DataFrame(id1=[1,1,2,2,missing,missing],
              id2=[1,11,2,21,missing,99],
              name = ["Alice", "Bob", "Conor", "Dave","Zed", "Zoe"])
y = DataFrame(id1=[1,1,3,3,missing,missing],
              id2=[11,1,31,3,missing,999],
              age = [21,22,23,24,99, 100])
x,y

(6×3 DataFrames.DataFrame
│ Row │ id1     │ id2     │ name  │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ 1       │ Alice │
│ 2   │ 1       │ 11      │ Bob   │
│ 3   │ 2       │ 2       │ Conor │
│ 4   │ 2       │ 21      │ Dave  │
│ 5   │ [90mmissing[39m │ [90mmissing[39m │ Zed   │
│ 6   │ [90mmissing[39m │ 99      │ Zoe   │, 6×3 DataFrames.DataFrame
│ Row │ id1     │ id2     │ age │
├─────┼─────────┼─────────┼─────┤
│ 1   │ 1       │ 11      │ 21  │
│ 2   │ 1       │ 1       │ 22  │
│ 3   │ 3       │ 31      │ 23  │
│ 4   │ 3       │ 3       │ 24  │
│ 5   │ [90mmissing[39m │ [90mmissing[39m │ 99  │
│ 6   │ [90mmissing[39m │ 999     │ 100 │)

In [12]:
join(x, y, on=[:id1, :id2]) # joining on two columns

Unnamed: 0,id1,id2,name,age
1,1,1,Alice,22
2,1,11,Bob,21
3,missing,missing,Zed,99


In [13]:
join(x, y, on=[:id1], makeunique=true) # with duplicates all combinations are produced (here :inner join)

Unnamed: 0,id1,id2,name,id2_1,age
1,1,1,Alice,11,21
2,1,1,Alice,1,22
3,1,11,Bob,11,21
4,1,11,Bob,1,22
5,missing,missing,Zed,missing,99
6,missing,missing,Zed,999,100
7,missing,99,Zoe,missing,99
8,missing,99,Zoe,999,100


In [14]:
join(x, y, on=[:id1], kind=:semi) # but not by :semi join (as it would duplicate rows)

Unnamed: 0,id1,id2,name
1,1,1,Alice
2,1,11,Bob
3,missing,missing,Zed
4,missing,99,Zoe
