# DataFrames

Julia has a general representation for `Tables` with different implementations. The most
used package is [DataFrames.jl](https://dataframes.juliadata.org/). It takes advantage
of Julia syntax to define certain operation.

## Define a DataFrame

In [1]:
using DataFrames
import Random

Random.seed!(11)
n = 100
data = DataFrame(id = 1:n, x = rand(n), y = randn(n))

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.498434,0.527515
2,2,0.389721,1.69565
3,3,0.26454,1.67104
4,4,0.719424,-0.253313
5,5,0.676602,1.90599
6,6,0.184079,1.0981
7,7,0.46979,-0.490607
8,8,0.568002,-1.63391
9,9,0.51084,0.526118
10,10,0.350315,-1.75928


## Subsetting

Let's select a column without doing a copy:

In [2]:
data.id

100-element Vector{Int64}:
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
   ⋮
  92
  93
  94
  95
  96
  97
  98
  99
 100

In [3]:
data[!, :id]

100-element Vector{Int64}:
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
   ⋮
  92
  93
  94
  95
  96
  97
  98
  99
 100

Modifications on these columns will be reflected in the original data.

In [4]:
xaux = data.x

100-element Vector{Float64}:
 0.49843398493613866
 0.3897206442798764
 0.26454039637865223
 0.7194237171475113
 0.6766024102635798
 0.1840786120276101
 0.4697903183183523
 0.5680018585914868
 0.5108399240545426
 0.3503150251639139
 ⋮
 0.5055020246950553
 0.20199332879192944
 0.4411645679023518
 0.5312642809575425
 0.0985509072108206
 0.6776076972596555
 0.10551392206669663
 0.2581188423585902
 0.8888936999885325

In [5]:
xaux[1] = 100

100

In [6]:
first(data, 2)

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,100.0,0.527515
2,2,0.389721,1.69565


A copy of a column is created as folows:

In [7]:
xaux = data[:, :x]

100-element Vector{Float64}:
 100.0
   0.3897206442798764
   0.26454039637865223
   0.7194237171475113
   0.6766024102635798
   0.1840786120276101
   0.4697903183183523
   0.5680018585914868
   0.5108399240545426
   0.3503150251639139
   ⋮
   0.5055020246950553
   0.20199332879192944
   0.4411645679023518
   0.5312642809575425
   0.0985509072108206
   0.6776076972596555
   0.10551392206669663
   0.2581188423585902
   0.8888936999885325

In [8]:
xaux[1] = 10

10

In [9]:
first(data, 2)

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,100.0,0.527515
2,2,0.389721,1.69565


More specific subsetting can be done with:

In [10]:
data.id[1:10]

10-element Vector{Int64}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

In [11]:
data.id[90:end]

11-element Vector{Int64}:
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100

In [12]:
data[1:10, 1:2]

Row,id,x
Unnamed: 0_level_1,Int64,Float64
1,1,100.0
2,2,0.389721
3,3,0.26454
4,4,0.719424
5,5,0.676602
6,6,0.184079
7,7,0.46979
8,8,0.568002
9,9,0.51084
10,10,0.350315


You can also create a view that references to you DataFrame:

In [13]:
subdata = view(data, 1:10, 1:3)
subdata.x[1] = rand()
first(data, 2)

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.28589,0.527515
2,2,0.389721,1.69565


## Transform variables

To perform operations over columns, we use the `Pair` syntax:

In [14]:
transform(data, :x => (z -> z .^ 2))

Row,id,x,y,x_function
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.0817333
2,2,0.389721,1.69565,0.151882
3,3,0.26454,1.67104,0.0699816
4,4,0.719424,-0.253313,0.51757
5,5,0.676602,1.90599,0.457791
6,6,0.184079,1.0981,0.0338849
7,7,0.46979,-0.490607,0.220703
8,8,0.568002,-1.63391,0.322626
9,9,0.51084,0.526118,0.260957
10,10,0.350315,-1.75928,0.122721


In [15]:
typeof(:x => (z -> z .^ 2))

Pair{Symbol, Main.var"##235".var"#3#4"}

We can explicitly provide the output name:

In [16]:
transform(data, :x => (z -> z .^ 2) => :x2)

Row,id,x,y,x2
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.0817333
2,2,0.389721,1.69565,0.151882
3,3,0.26454,1.67104,0.0699816
4,4,0.719424,-0.253313,0.51757
5,5,0.676602,1.90599,0.457791
6,6,0.184079,1.0981,0.0338849
7,7,0.46979,-0.490607,0.220703
8,8,0.568002,-1.63391,0.322626
9,9,0.51084,0.526118,0.260957
10,10,0.350315,-1.75928,0.122721


We can also vectorize any function with `ByRow`:

In [17]:
transform(data, :x => ByRow(sqrt))

Row,id,x,y,x_sqrt
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.534687
2,2,0.389721,1.69565,0.624276
3,3,0.26454,1.67104,0.514335
4,4,0.719424,-0.253313,0.848188
5,5,0.676602,1.90599,0.822558
6,6,0.184079,1.0981,0.429044
7,7,0.46979,-0.490607,0.685413
8,8,0.568002,-1.63391,0.753659
9,9,0.51084,0.526118,0.714731
10,10,0.350315,-1.75928,0.591874


Notice that the previous operations did not modify the original dataset. You can modify
your original dataset using the in-place function `transform!`:

In [18]:
transform!(data, :x => (z -> z .^ 2) => :x2)

Row,id,x,y,x2
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.0817333
2,2,0.389721,1.69565,0.151882
3,3,0.26454,1.67104,0.0699816
4,4,0.719424,-0.253313,0.51757
5,5,0.676602,1.90599,0.457791
6,6,0.184079,1.0981,0.0338849
7,7,0.46979,-0.490607,0.220703
8,8,0.568002,-1.63391,0.322626
9,9,0.51084,0.526118,0.260957
10,10,0.350315,-1.75928,0.122721


In [19]:
data

Row,id,x,y,x2
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.0817333
2,2,0.389721,1.69565,0.151882
3,3,0.26454,1.67104,0.0699816
4,4,0.719424,-0.253313,0.51757
5,5,0.676602,1.90599,0.457791
6,6,0.184079,1.0981,0.0338849
7,7,0.46979,-0.490607,0.220703
8,8,0.568002,-1.63391,0.322626
9,9,0.51084,0.526118,0.260957
10,10,0.350315,-1.75928,0.122721


The function `select` works in a similar way, but only includes the desired variables:

In [20]:
select(data, :x => (z -> z .^ 2) => :x2)

Row,x2
Unnamed: 0_level_1,Float64
1,0.0817333
2,0.151882
3,0.0699816
4,0.51757
5,0.457791
6,0.0338849
7,0.220703
8,0.322626
9,0.260957
10,0.122721


In [21]:
select(data, :x, :y)

Row,x,y
Unnamed: 0_level_1,Float64,Float64
1,0.28589,0.527515
2,0.389721,1.69565
3,0.26454,1.67104
4,0.719424,-0.253313
5,0.676602,1.90599
6,0.184079,1.0981
7,0.46979,-0.490607
8,0.568002,-1.63391
9,0.51084,0.526118
10,0.350315,-1.75928


In [22]:
typeof(r"^x")

Regex

In [23]:
select(data, r"^x")

Row,x,x2
Unnamed: 0_level_1,Float64,Float64
1,0.28589,0.0817333
2,0.389721,0.151882
3,0.26454,0.0699816
4,0.719424,0.51757
5,0.676602,0.457791
6,0.184079,0.0338849
7,0.46979,0.220703
8,0.568002,0.322626
9,0.51084,0.260957
10,0.350315,0.122721


Let's add new columns:

In [24]:
insertcols!(data, :z => rand(100))

Row,id,x,y,x2,z
Unnamed: 0_level_1,Int64,Float64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.0817333,0.415746
2,2,0.389721,1.69565,0.151882,0.145416
3,3,0.26454,1.67104,0.0699816,0.968211
4,4,0.719424,-0.253313,0.51757,0.122135
5,5,0.676602,1.90599,0.457791,0.650245
6,6,0.184079,1.0981,0.0338849,0.908385
7,7,0.46979,-0.490607,0.220703,0.368075
8,8,0.568002,-1.63391,0.322626,0.63738
9,9,0.51084,0.526118,0.260957,0.853006
10,10,0.350315,-1.75928,0.122721,0.260641


In [25]:
first(data, 5)

Row,id,x,y,x2,z
Unnamed: 0_level_1,Int64,Float64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.0817333,0.415746
2,2,0.389721,1.69565,0.151882,0.145416
3,3,0.26454,1.67104,0.0699816,0.968211
4,4,0.719424,-0.253313,0.51757,0.122135
5,5,0.676602,1.90599,0.457791,0.650245


Let's remove columns:

In [26]:
select!(data, Not(:x2))

Row,id,x,y,z
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.415746
2,2,0.389721,1.69565,0.145416
3,3,0.26454,1.67104,0.968211
4,4,0.719424,-0.253313,0.122135
5,5,0.676602,1.90599,0.650245
6,6,0.184079,1.0981,0.908385
7,7,0.46979,-0.490607,0.368075
8,8,0.568002,-1.63391,0.63738
9,9,0.51084,0.526118,0.853006
10,10,0.350315,-1.75928,0.260641


In [27]:
first(data, 5)

Row,id,x,y,z
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.415746
2,2,0.389721,1.69565,0.145416
3,3,0.26454,1.67104,0.968211
4,4,0.719424,-0.253313,0.122135
5,5,0.676602,1.90599,0.650245


Another simple operation is to rename columns:

In [28]:
rename(data, :x => :xnew, :z => :znew)

Row,id,xnew,y,znew
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.415746
2,2,0.389721,1.69565,0.145416
3,3,0.26454,1.67104,0.968211
4,4,0.719424,-0.253313,0.122135
5,5,0.676602,1.90599,0.650245
6,6,0.184079,1.0981,0.908385
7,7,0.46979,-0.490607,0.368075
8,8,0.568002,-1.63391,0.63738
9,9,0.51084,0.526118,0.853006
10,10,0.350315,-1.75928,0.260641


In [29]:
rename(data, [:x, :z] .=> [:xnew, :znew])

Row,id,xnew,y,znew
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.415746
2,2,0.389721,1.69565,0.145416
3,3,0.26454,1.67104,0.968211
4,4,0.719424,-0.253313,0.122135
5,5,0.676602,1.90599,0.650245
6,6,0.184079,1.0981,0.908385
7,7,0.46979,-0.490607,0.368075
8,8,0.568002,-1.63391,0.63738
9,9,0.51084,0.526118,0.853006
10,10,0.350315,-1.75928,0.260641


In [30]:
rename(var -> var * "_new", data)

Row,id_new,x_new,y_new,z_new
Unnamed: 0_level_1,Int64,Float64,Float64,Float64
1,1,0.28589,0.527515,0.415746
2,2,0.389721,1.69565,0.145416
3,3,0.26454,1.67104,0.968211
4,4,0.719424,-0.253313,0.122135
5,5,0.676602,1.90599,0.650245
6,6,0.184079,1.0981,0.908385
7,7,0.46979,-0.490607,0.368075
8,8,0.568002,-1.63391,0.63738
9,9,0.51084,0.526118,0.853006
10,10,0.350315,-1.75928,0.260641


Remember to use `rename!` to actually make the changes on the original dataset.

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*