# Introduction to DataFrames
**This is from the excellent Julia tutorial by [Bogumił Kamiński](http://bogumilkaminski.pl/about/)** <br>
Entire tutorial available [here](https://github.com/bkamins/Julia-DataFrames-Tutorial)

In any data science application it is critical to understand our data. Two approaches are important here:
- viewing the data and metadescriptors of the data
- visualisation (discussed in Part 3 of this tutorial)

In this notebook we consider how to extract basic information on the DataFrames to better understand our data.

In [26]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [27]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [28]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [29]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information on how to customize shown statistics).

In [30]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,A,1.5,1,1.5,2,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"
3,C,,a,,b,0,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [31]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,Type
1,A,1.5,1.0,1.5,2.0,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [32]:
names(x)

3-element Vector{String}:
 "A"
 "B"
 "C"

you can also get column names with a given `eltype`:

In [33]:
names(x, String)

1-element Vector{String}:
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [34]:
propertynames(x)

3-element Vector{Symbol}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [35]:
eltype.(eachcol(x))

3-element Vector{Type}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [36]:
y = DataFrame(rand(1:10, 1000, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,9,2,8,8,7,7,5,5,6
2,7,5,2,8,2,3,9,6,10,8
3,9,3,2,5,4,1,1,8,10,2
4,1,8,9,6,6,3,10,9,3,5
5,3,9,3,4,5,4,6,6,9,4
6,2,9,5,5,2,6,9,8,3,5
7,9,5,3,4,1,5,2,9,10,9
8,2,5,9,3,8,2,10,1,1,3
9,3,9,7,5,8,5,9,5,2,3
10,6,5,7,9,3,3,4,7,10,7


and then we can use `first` to peek into its first few rows

In [37]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,9,2,8,8,7,7,5,5,6
2,7,5,2,8,2,3,9,6,10,8
3,9,3,2,5,4,1,1,8,10,2
4,1,8,9,6,6,3,10,9,3,5
5,3,9,3,4,5,4,6,6,9,4


and `last` to see its bottom rows.

In [38]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,2,5,3,2,6,2,1,9,2,10
2,1,6,3,5,2,2,7,10,2,1
3,7,4,3,2,4,9,3,2,10,6


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [39]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,9,2,8,8,7,7,5,5,6


In [40]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,7,4,3,2,4,9,3,2,10,6


### Displaying large data frames

Create a wide and tall data frame:

In [41]:
df = DataFrame(rand(100, 100), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.45552,0.555433,0.740774,0.0596569,0.736545,0.939493,0.366589,0.902446,0.00850931,0.485007,0.304666,0.552089,0.894406,0.932637,0.677685,0.374379,0.458715,0.482847,0.935491
2,0.672901,0.184044,0.0518569,0.203401,0.738239,0.181238,0.784923,0.767198,0.385785,0.553629,0.13561,0.422574,0.593631,0.835718,0.552881,0.612443,0.50615,0.817292,0.321651
3,0.61181,0.583987,0.777144,0.152937,0.529241,0.336159,0.717097,0.597794,0.534722,0.387473,0.332585,0.854206,0.974176,0.931126,0.0312932,0.762219,0.679493,0.655252,0.514661
4,0.964767,0.648263,0.920944,0.431347,0.298726,0.650545,0.304841,0.450064,0.30747,0.885946,0.253616,0.5964,0.152767,0.11609,0.105655,0.786122,0.144385,0.19331,0.127736
5,0.828437,0.241369,0.7174,0.932986,0.420209,0.922786,0.106899,0.250276,0.146775,0.219766,0.0567867,0.81503,0.38704,0.395717,0.0572475,0.942152,0.72545,0.918649,0.52505
6,0.943227,0.964661,0.376658,0.637101,0.676966,0.744883,0.453424,0.379667,0.897841,0.544112,0.177826,0.116279,0.925145,0.969287,0.296276,0.0156786,0.511717,0.0735523,0.51169
7,0.714929,0.974504,0.178833,0.0190573,0.593729,0.564033,0.84106,0.627525,0.622828,0.194022,0.45045,0.370131,0.217474,0.83226,0.626027,0.139177,0.214245,0.948488,0.568373
8,0.481384,0.746923,0.775168,0.979524,0.0842161,0.693254,0.312373,0.507279,0.240905,0.301335,0.51099,0.699555,0.486382,0.0841138,0.674899,0.618077,0.677575,0.565751,0.441655
9,0.917328,0.738383,0.93366,0.629119,0.454977,0.500225,0.999938,0.582699,0.677117,0.958552,0.179419,0.938972,0.356777,0.823014,0.0529046,0.629719,0.252546,0.565918,0.768903
10,0.818866,0.339808,0.0103051,0.363723,0.665906,0.842247,0.727994,0.831841,0.601772,0.543042,0.108069,0.393683,0.303953,0.87759,0.540998,0.0237603,0.850595,0.350017,0.283235


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [42]:
ENV["LINES"] = 10

10

In [43]:
ENV["COLUMNS"] = 200

200

In [44]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.45552,0.555433,0.740774,0.0596569,0.736545,0.939493,0.366589,0.902446,0.00850931,0.485007,0.304666,0.552089,0.894406,0.932637,0.677685,0.374379,0.458715,0.482847,0.935491
2,0.672901,0.184044,0.0518569,0.203401,0.738239,0.181238,0.784923,0.767198,0.385785,0.553629,0.13561,0.422574,0.593631,0.835718,0.552881,0.612443,0.50615,0.817292,0.321651
3,0.61181,0.583987,0.777144,0.152937,0.529241,0.336159,0.717097,0.597794,0.534722,0.387473,0.332585,0.854206,0.974176,0.931126,0.0312932,0.762219,0.679493,0.655252,0.514661
4,0.964767,0.648263,0.920944,0.431347,0.298726,0.650545,0.304841,0.450064,0.30747,0.885946,0.253616,0.5964,0.152767,0.11609,0.105655,0.786122,0.144385,0.19331,0.127736
5,0.828437,0.241369,0.7174,0.932986,0.420209,0.922786,0.106899,0.250276,0.146775,0.219766,0.0567867,0.81503,0.38704,0.395717,0.0572475,0.942152,0.72545,0.918649,0.52505
6,0.943227,0.964661,0.376658,0.637101,0.676966,0.744883,0.453424,0.379667,0.897841,0.544112,0.177826,0.116279,0.925145,0.969287,0.296276,0.0156786,0.511717,0.0735523,0.51169
7,0.714929,0.974504,0.178833,0.0190573,0.593729,0.564033,0.84106,0.627525,0.622828,0.194022,0.45045,0.370131,0.217474,0.83226,0.626027,0.139177,0.214245,0.948488,0.568373
8,0.481384,0.746923,0.775168,0.979524,0.0842161,0.693254,0.312373,0.507279,0.240905,0.301335,0.51099,0.699555,0.486382,0.0841138,0.674899,0.618077,0.677575,0.565751,0.441655
9,0.917328,0.738383,0.93366,0.629119,0.454977,0.500225,0.999938,0.582699,0.677117,0.958552,0.179419,0.938972,0.356777,0.823014,0.0529046,0.629719,0.252546,0.565918,0.768903
10,0.818866,0.339808,0.0103051,0.363723,0.665906,0.842247,0.727994,0.831841,0.601772,0.543042,0.108069,0.393683,0.303953,0.87759,0.540998,0.0237603,0.850595,0.350017,0.283235


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlier, here are various ways to grab one of its columns as a `Vector`.

In [45]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [46]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [47]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [48]:
x[:, 1] # note that this creates a copy

2-element Vector{Int64}:
 1
 2

In [53]:
x[:, 1] === x[:, 1]

true

To grab one row as a `DataFrame`, we can index as follows.

In [54]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [55]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [56]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [57]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [61]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [62]:
x[!, Not(2)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,A,C
Unnamed: 0_level_1,Int64,String
1,1,a
2,2,b


In [63]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [64]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [65]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [66]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples of how `Cols` and `Between` can be used to select columns of a data frame.

In [67]:
x = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.259531,0.201792,0.946575,0.108957,0.527328
2,0.842374,0.539274,0.378469,0.382346,0.792338
3,0.823052,0.595051,0.935966,0.179118,0.170161
4,0.479652,0.597522,0.291508,0.110414,0.668503


In [68]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.201792,0.946575,0.108957
2,0.539274,0.378469,0.382346
3,0.595051,0.935966,0.179118
4,0.597522,0.291508,0.110414


In [69]:
x[:, Cols("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.259531,0.201792,0.946575,0.108957
2,0.842374,0.539274,0.378469,0.382346
3,0.823052,0.595051,0.935966,0.179118
4,0.479652,0.597522,0.291508,0.110414


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [70]:
@view x[1:2, 1]

2-element view(::Vector{Float64}, 1:2) with eltype Float64:
 0.25953065622770866
 0.8423742160105971

In [71]:
@view x[1,1]

0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.25953065622770866

In [72]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.259531,0.201792


In [73]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.259531,0.201792
2,0.842374,0.539274


### Adding new columns to a data frame

In [74]:
df = DataFrame()

using `setproperty!`

In [75]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [76]:
df.a === x # no copy is performed

true

using `setindex!`

In [77]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [78]:
df.b === x # no copy

true

In [79]:
df.c === x # copy

false

In [80]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [81]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [82]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [83]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [84]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [85]:
collect(pairs(eachcol(df)))

5-element Vector{Pair{Symbol, AbstractVector{T} where T}}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.

### Comparisons

In [86]:
using DataFrames

In [87]:
df = DataFrame(rand(2,3), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.134011,0.541065,0.995412
2,0.506748,0.295761,0.0493961


In [88]:
df2 = copy(df)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.134011,0.541065,0.995412
2,0.506748,0.295761,0.0493961


In [89]:
df == df2 # compares column names and contents

true

create a minimally different data frame and use `isapprox` for comparison

In [90]:
df3 = df2 .+ eps()

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.134011,0.541065,0.995412
2,0.506748,0.295761,0.0493961


In [91]:
df == df3

false

In [92]:
isapprox(df, df3)

true

In [93]:
isapprox(df, df3, atol = eps()/2)

false

`missings` are handled as in Julia Base

In [94]:
df = DataFrame(a=missing)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Missing
1,missing


In [95]:
df == df

missing

In [96]:
df === df

true

In [97]:
isequal(df, df)

true