# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5))

Unnamed: 0,x1,x2,x3,x4,x5
1,0.738683,0.575963,0.221199,0.961925,0.171644
2,0.247084,0.0548955,0.842629,0.358732,0.849723
3,0.428472,0.677116,0.712318,0.780728,0.0472096


In [3]:
y = DataFrame(x)
x === y # no copyinng performed

true

In [4]:
y = copy(x)
x === y # not the same object

false

In [5]:
all(x[i] === y[i] for i in ncol(x)) # but the columns are the same

true

In [6]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating arrays or assigning columns, except ranges

Unnamed: 0,x,y
1,1,1
2,2,2
3,3,3


In [7]:
y === df[:y] # the same object

true

In [8]:
typeof(x), typeof(df[:x]) # range is converted to a vector

(UnitRange{Int64}, Array{Int64,1})

### Do not modify the parent of `GroupedDataFrame`

In [9]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

DataFrames.GroupedDataFrame  2 groups with keys: Symbol[:id]
First Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 1  │ 1 │
│ 2   │ 1  │ 3 │
│ 3   │ 1  │ 5 │
⋮
Last Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 2  │ 2 │
│ 2   │ 2  │ 4 │
│ 3   │ 2  │ 6 │

In [10]:
x[1:3, 1]=[2,2,2]
g # well - it is wrong now, g is only a view

DataFrames.GroupedDataFrame  2 groups with keys: Symbol[:id]
First Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 2  │ 1 │
│ 2   │ 2  │ 3 │
│ 3   │ 1  │ 5 │
⋮
Last Group:
3×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 2  │ 2 │
│ 2   │ 2  │ 4 │
│ 3   │ 2  │ 6 │

### Remember that you can filter columns of a `DataFrame` using booleans

In [11]:
srand(1)
x = DataFrame(rand(5, 5))

Unnamed: 0,x1,x2,x3,x4,x5
1,0.236033,0.210968,0.555751,0.209472,0.0769509
2,0.346517,0.951916,0.437108,0.251379,0.640396
3,0.312707,0.999905,0.424718,0.0203749,0.873544
4,0.00790928,0.251662,0.773223,0.287702,0.278582
5,0.488613,0.986666,0.28119,0.859512,0.751313


In [12]:
x[x[:x1] .< 0.25] # well - we have filtered columns not rows by accident as you can select columns using booleans

Unnamed: 0,x1,x4
1,0.236033,0.209472
2,0.346517,0.251379
3,0.312707,0.0203749
4,0.00790928,0.287702
5,0.488613,0.859512


In [13]:
x[x[:x1] .< 0.25, :] # probably this is what we wanted

Unnamed: 0,x1,x2,x3,x4,x5
1,0.236033,0.210968,0.555751,0.209472,0.0769509
2,0.00790928,0.251662,0.773223,0.287702,0.278582


### Column selection for DataFrame creates aliases unless explicitly copied

In [14]:
x = DataFrame(a=1:3)
x[:b] = x[1] # alias
x[:c] = x[:, 1] # also alias
x[:d] = x[1][:] # copy
x[:e] = copy(x[1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0,a,b,c,d,e
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0,a,b,c,d,e
1,100,100,100,1,1
2,2,2,2,2,2
3,3,3,3,3,3
