# Introduction to DataFrames
**This is from the excellent Julia tutorial by [Bogumił Kamiński](http://bogumilkaminski.pl/about/)** <br>
Entire tutorial available [here](https://github.com/bkamins/Julia-DataFrames-Tutorial)

Julia has a library to handle tabular data. DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool, especially for those coming to Julia from R or Python.

DataFrames.jl plays a central role in the Julia Data ecosystem, and has tight integrations with a range of different libraries. DataFrames.jl isn't the only tool for working with tabular data in Julia – as noted below, there are some other great libraries for certain use-cases – but it provides great data wrangling functionality through a familiar interface.

Full details on working with DataFrames are provided in the [documentation](https://dataframes.juliadata.org/stable/) and we will work to give a more intutive understanding of DataFrames and applying to real world data by creating and manipulating a variety of arrays. 

As we'll see later in the tutorial, one of the key benefits of DataFrames in Julia is the number of libraries that it interfaces with. This can be extremely useful when we apply statistical and machine learning analysis on our data.

Let's get started by loading the `DataFrames` package.

In [1]:
using DataFrames, Random

## Constructors and conversion

### Constructors

In this section, you'll see many ways to create a `DataFrame` using the `DataFrame()` constructor.

First, we could create an empty DataFrame,

In [2]:
DataFrame()

Or we could call the constructor using keyword arguments to add columns to the `DataFrame`.

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]), fixed=1)

Unnamed: 0_level_0,A,B,C,fixed
Unnamed: 0_level_1,Int64,Float64,String,Int64
1,1,0.960295,jKs,1
2,2,0.321839,cBj,1
3,3,0.0685316,dnn,1


note in column `:fixed` that scalars get automatically broadcasted.

We can create a `DataFrame` from a dictionary, in which case keys from the dictionary will be sorted to create the `DataFrame` columns.

In [4]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'], "fixed" => Ref([1,1]))
DataFrame(x)

Unnamed: 0_level_0,A,B,C,fixed
Unnamed: 0_level_1,Int64,Bool,Char,Array…
1,1,1,a,"[1, 1]"
2,2,0,b,"[1, 1]"


This time we used `Ref` to protect a vector from being treated as a column and forcing broadcasting it into every row of `:fixed` column (note that the `[1,1]` vector is aliased in each row).

Rather than explicitly creating a dictionary first, as above, we could pass `DataFrame` arguments with the syntax of dictionary key-value pairs. 

Note that in this case, we use `Symbol`s to denote the column names and arguments are not sorted. For example, `:A`, the symbol, produces `A`, the name of the first column here:

In [5]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Bool,Char
1,1,1,a
2,2,0,b


Although, in general, using `Symbol`s rather than strings to denote column names is preferred (as it is faster) DataFrames.jl accepts passing strings as column names, so this also works:

In [6]:
DataFrame("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Bool,Char
1,1,1,a
2,2,0,b


You can also pass a vector of pairs, which is useful if it is constructed programatically:

In [7]:
DataFrame([:A => [1,2], :B => [true, false], :C => ['a', 'b'], :fixed => "const"])

Unnamed: 0_level_0,A,B,C,fixed
Unnamed: 0_level_1,Int64,Bool,Char,String
1,1,1,a,const
2,2,0,b,const


Here we create a `DataFrame` from a vector of vectors, and each vector becomes a column.

In [8]:
DataFrame([rand(3) for i in 1:3], :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.492692,0.238344,0.0804804
2,0.947955,0.342079,0.992415
3,0.724401,0.590962,0.252757


In [9]:
DataFrame([rand(3) for i in 1:3], [:x1, :x2, :x3])

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.325778,0.698033,0.411067
2,0.753222,0.0172799,0.0997519
3,0.126513,0.567825,0.771204


In [10]:
DataFrame([rand(3) for i in 1:3], ["x1", "x2", "x3"])

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.709795,0.918608,0.584487
2,0.347258,0.924728,0.282317
3,0.402179,0.299403,0.908819


As you can see you either pass a vector of column names as a second argument or `:auto` in which case column names are generated automatically.

In particular it is not allowed to pass a vector of scalars to `DataFrame` constructor.

In [11]:
DataFrame([1, 2, 3])

LoadError: ArgumentError: 'Vector{Int64}' iterates 'Int64' values, which doesn't satisfy the Tables.jl `AbstractRow` interface

Instead use a transposed vector if you have a vector of single values (in this way you effectively pass a two dimensional array to the constructor which is supported the same way as in vector of vectors case).

In [12]:
DataFrame(permutedims([1, 2, 3]), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


You can also pass a vector of `NamedTuple`s to construct a `DataFrame`:

In [4]:
v = [(a=1, b=2), (a=3, b=4)]
DataFrame(v)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Int64
1,1,2
2,3,4


Alternatively you can pass a `NamedTuple` of vectors:

In [5]:
n = (a=1:3, b=11:13)
DataFrame(n)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Int64
1,1,11
2,2,12
3,3,13


Here we create a `DataFrame` from a matrix,

In [6]:
DataFrame(rand(3,4), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.437358,0.225007,0.691457,0.523883
2,0.984384,0.0424498,0.633248,0.78515
3,0.26922,0.204477,0.195621,0.649654


and here we do the same but also pass column names.

In [7]:
DataFrame(rand(3,4), Symbol.('a':'d'))

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.13163,0.923317,0.475829,0.136971
2,0.742835,0.0779672,0.980853,0.835251
3,0.529157,0.449248,0.651371,0.107638


or

In [8]:
DataFrame(rand(3,4), string.('a':'d'))

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.0629672,0.93016,0.983825,0.687891
2,0.459148,0.115949,0.0422231,0.706136
3,0.946613,0.95218,0.983836,0.769913


This is how you can create a data frame with no rows, but with predefined columns and their types:

In [9]:
DataFrame(A=Int[], B=Float64[], C=String[])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String


Finally, we can create a `DataFrame` by copying an existing `DataFrame`.

Note that `copy` also copies the vectors.

In [10]:
x = DataFrame(a=1:2, b='a':'b')
y = copy(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, false)

Calling `DataFrame` on a `DataFrame` object works like `copy`.

In [11]:
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, false)

You can avoid copying of columns of a data frame (if it is possible) by passing `copycols=false` keyword argument:

In [12]:
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x, copycols=false)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, true)

The same rule applies to other constructors

In [13]:
a = [1, 2, 3]
df1 = DataFrame(a=a)
df2 = DataFrame(a=a, copycols=false)
df1.a === a, df2.a === a

(false, true)

You can create a similar uninitialized `DataFrame` based on an original one:

In [14]:
x = DataFrame(a=1, b=1.0)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,1,1.0


In [15]:
similar(x)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,5393135616,2.59033e-318


number of rows in a new DataFrame can be passed as a second argument

In [16]:
similar(x, 0)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64


In [17]:
similar(x, 2)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,4724355824,2.33198e-314
2,4725700592,2.24785e-314


You can also create a new `DataFrame` from `SubDataFrame` or `DataFrameRow` (discussed in detail later in the tutorial; in particular although `DataFrameRow` is considered a 1-dimensional object similar to a `NamedTuple` it gets converted to a 1-row `DataFrame` for convinience)

In [18]:
sdf = view(x, [1,1], :)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,1,1.0
2,1,1.0


In [19]:
typeof(sdf)

SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}

In [20]:
DataFrame(sdf)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,1,1.0
2,1,1.0


In [21]:
dfr = x[1, :]

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,1,1.0


In [22]:
DataFrame(dfr)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Float64
1,1,1.0


### Conversion to a matrix

Let's start by creating a `DataFrame` with two rows and two columns.

In [23]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


We can create a matrix by passing this `DataFrame` to `Matrix` or `Array`.

In [24]:
Matrix(x)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

In [25]:
Array(x)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

This would work even if the `DataFrame` had some `missing`s:

In [26]:
x = DataFrame(x=1:2, y=[missing,"B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String?
1,1,missing
2,2,B


In [27]:
Matrix(x)

2×2 Matrix{Any}:
 1  missing
 2  "B"

In the two previous matrix examples, Julia created matrices with elements of type `Any`. We can see more clearly that the type of matrix is inferred when we pass, for example, a `DataFrame` of integers to `Matrix`, creating a 2D `Array` of `Int64`s:

In [28]:
x = DataFrame(x=1:2, y=3:4)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,3
2,2,4


In [29]:
Matrix(x)

2×2 Matrix{Int64}:
 1  3
 2  4

In this next example, Julia correctly identifies that `Union` is needed to express the type of the resulting `Matrix` (which contains `missing`s).

In [30]:
x = DataFrame(x=1:2, y=[missing,4])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64?
1,1,missing
2,2,4


In [31]:
Matrix(x)

2×2 Matrix{Union{Missing, Int64}}:
 1   missing
 2  4

Note that we can't force a conversion of `missing` values to `Int`s!

In [32]:
Matrix{Int}(x)

LoadError: ArgumentError: cannot convert a DataFrame containing missing values to Matrix{Int64} (found for column y)

### Conversion to `NamedTuple` related tabular structures

First define some data frame

In [33]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


Now we convert a `DataFrame` into a `NamedTuple` of vectors

In [34]:
ct = Tables.columntable(x)

(x = [1, 2], y = ["A", "B"])

Next we convert it into a vector of `NamedTuples`

In [35]:
rt = Tables.rowtable(x)

2-element Vector{NamedTuple{(:x, :y), Tuple{Int64, String}}}:
 (x = 1, y = "A")
 (x = 2, y = "B")

We can perform the conversions back to a `DataFrame` using a standard constructor call:

In [36]:
DataFrame(ct)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


In [37]:
DataFrame(rt)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


### Iterating data frame by rows or columns

Sometimes it is useful to create a wrapper around a `DataFrame` that produces its rows or columns.

For iterating columns you can use the `eachcol` function.

In [38]:
ec = eachcol(x)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


`DataFrameColumns` object behaves as a vector (note though it is not `AbstractVector`)

In [39]:
ec isa AbstractVector

false

In [40]:
ec[1]

2-element Vector{Int64}:
 1
 2

but you can also index into it using column names:

In [41]:
ec["x"]

2-element Vector{Int64}:
 1
 2

similarly `eachrow` creates a `DataFrameRows` object that is a vector of its rows

In [42]:
er = eachrow(x)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


`DataFrameRows` is an `AbstractVector`

In [43]:
er isa AbstractVector

true

In [44]:
er[end]

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
2,2,B


Note that both data frame and also `DataFrameColumns` and `DataFrameRows` objects are not type stable (they do not know the types of their columns). This is useful to avoid compilation cost if you have very wide data frames with heterogenous column types.

However, often (especially if a data frame is narrows) it is useful to create a lazy iterator that produces `NamedTuple`s for each row of the `DataFrame`. Its key benefit is that it is type stable (so it is useful when you want to perform some operations in a fast way on a small subset of columns of a `DataFrame` - this strategy is often used internally by DataFrames.jl package):

In [45]:
nti = Tables.namedtupleiterator(x)

Tables.NamedTupleIterator{Tables.Schema{(:x, :y), Tuple{Int64, String}}, Tables.RowIterator{NamedTuple{(:x, :y), Tuple{Vector{Int64}, Vector{String}}}}}(Tables.RowIterator{NamedTuple{(:x, :y), Tuple{Vector{Int64}, Vector{String}}}}((x = [1, 2], y = ["A", "B"]), 2))

In [46]:
for row in enumerate(nti)
    @show row
end

row = (1, (x = 1, y = "A"))
row = (2, (x = 2, y = "B"))


similarly to the previous options you can easily convert `NamedTupleIterator` back to a `DataFrame`.

In [47]:
DataFrame(nti)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


### Handling of duplicate column names

We can pass the `makeunique` keyword argument to allow passing duplicate names (they get deduplicated)

In [48]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

Unnamed: 0_level_0,a,a_2,a_1
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


Otherwise, duplicates are not allowed.

In [49]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3)

LoadError: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

Observe that currently `nothing` is not printed when displaying a `DataFrame` in Jupyter Notebook:

In [50]:
df = DataFrame(x=[1, nothing], y=[nothing, "a"], z=[missing, "c"])

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Union…,Union…,String?
1,1.0,,missing
2,,a,c


Finally you can use `empty` and `empty!` functions to remove all rows from a data frame:

In [51]:
empty(df)

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Union…,Union…,String?


In [52]:
df

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Union…,Union…,String?
1,1.0,,missing
2,,a,c


In [53]:
empty!(df)

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Union…,Union…,String?


In [54]:
df

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Union…,Union…,String?
