# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

In [1]:
using DataFrames # load package

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

In [2]:
using CSV # reading and writing CSV files
using JLD # Julia native binary format

In [3]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c']) # create a simple DataFrame for testing purposes


Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [4]:
eltypes(x)

4-element Array{Type,1}:
 Bool                           
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Char, Missings.Missing}  

In [5]:
CSV.write("x.csv", x) # save it to disk; make sure x.csv does not conflict with some file in your working directory

CSV.Sink{Void,DataType}(    CSV.Options:
        delim: ','
        quotechar: '"'
        escapechar: '\\'
        missingstring: ""
        dateformat: nothing
        decimal: '.'
        truestring: 'true'
        falsestring: 'false', IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1), "x.csv", 8, true, String["A", "B", "C", "D"], 4, false, Val{false})

In [6]:
print(read("x.csv", String)) # this is how it was saved

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


In [7]:
# we can load it back; disable memory mapping so that on Windows the file can be deleted in the same session
y = CSV.read("x.csv", use_mmap=false)

Unnamed: 0,A,B,C,D
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [8]:
eltypes(y) # all columns allow Missing by default; also column types have changed

4-element Array{Type,1}:
 Union{Bool, Missings.Missing}  
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Missings.Missing, String}

In [9]:
save("x.jld", "x", x) # save to a file in a binary format; make sure that x.jld does not exist in your working directory

In [10]:
y = load("x.jld", "x") # this is identical to x

Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [11]:
eltypes(y)

4-element Array{Type,1}:
 Bool                           
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Char, Missings.Missing}  

In [12]:
# again bigdf.csv and bigdf.jld files will be created so be careful
bigdf = DataFrame(Bool, 10^3, 10^2) # 10^3 rows, 10^5 columns
@time CSV.write("bigdf.csv", bigdf)
@time save("bigdf.jld", "bigdf", bigdf)
getfield.(stat.(["bigdf.csv", "bigdf.jld"]), :size) #  you can expect JLD to be faster, use compress=true to reduce file size

  0.782157 seconds (688.90 k allocations: 30.828 MiB, 1.08% gc time)
  0.018250 seconds (203.61 k allocations: 3.339 MiB)


2-element Array{Int64,1}:
 595307
 154487

In [13]:
# clean up - do not run unless you are sure that it will not erase your important files
foreach(rm, ["x.csv", "x.jld", "bigdf.csv", "bigdf.jld"])