# Loading and Saving Data

to/from csv, jld2, xlsx, hdf5, and mat files.

There are also packages for reading R data files ([RData.jl](https://github.com/JuliaData/RData.jl)), numpy data files ([NPZ.jl](https://github.com/fhs/NPZ.jl)), [JSON](https://github.com/JuliaIO/JSON.jl), [Arrow](https://github.com/apache/arrow-julia) and more, but they are not covered in this tutorial.

The focus is on typical financial data sets, with dates and data. This mixture of types often needs conversion into Julia dates and integers/floating point numbers. In case your data set is simpler, skip those parts of the code.

## Load Packages and Extra Functions

The packages are loaded in the respective sections below. This allows you to run parts of this notebook without having to install all packages.

The data files created by this notebook are written to and loaded from the subfolder `Results`.

In [1]:
using Printf, Dates
include("src/printmat.jl")

if !isdir("Results")
    error("create the subfolder Results before running this program, perhaps with mkdir()")
end

# csv Files

## csv: loading

The csv ("comma-separated values") format provides a simple and robust method for moving data and it can be read by most software.


For *reading* a data file delimited by comma (,) and where the first line of the file contains variable names, use the following
```
(x,header) = readdlm(FileName,',',header=true)
```
Alternatively, use
```
x = readdlm(FileName,',',skipstart=1)
```
to disregard the first line. Extra arguments control the type of data (Float64, Int, etc), suppression of comment lines and more. 

When reading a file with mixed data types, you typically get data of the `Any` type, so an explicit conversion might be useful.

If you need more powerful write/read routines, try the [CSV.jl](https://github.com/JuliaData/CSV.jl) package.

In [2]:
using DelimitedFiles

(x,header) = readdlm("Data/CsvFile.csv",',',header=true)  #read csv file
printblue("x is: ",summary(x),"\n")
printmat(x;colNames=vec(header))            #printmat() wants a vector of column names

printblue("There are mixed types and missings in this matrix. Maybe convert?")

[34m[1mx is: [22m[39m[34m[1m6×3 Matrix{Any}[22m[39m

                GSPC      DTB3
02/01/1979    96.730          
03/01/1979               9.310
04/01/1979    98.580     9.310
05/01/1979    99.130     9.340
08/01/1979    98.800     9.360
09/01/1979    99.330     9.270

[34m[1mThere are mixed types and missings in this matrix. Maybe convert?[22m[39m


## csv: convert Dates and Missing Values (extra)

The function `datesFix()` converts dates to Julia dates, while the function `readdlmFix()` replaces all non-numbers by `NaN` (use with `Float64`) or optionally `missing` (use when data is not `Float64`) and finally converts the whole matrix to `Float64` (or a type you can configure).

Most financial data is floating point, so we typically choose to convert to `NaN` rather than to `missing` as the former is much more easily handled in many computations.

In [3]:
"""
    datesFix(z,fmt="dd/mm/yyyy")

Convert to Julia dates, or `Time()` and `DateTime()`
"""
datesFix(z,fmt="dd/mm/yyyy") =  Date.(string.(z),fmt)


"""
    readdlmFix(x,Typ=Float64,missVal=NaN)

Change elements with missing data (' ') to either NaN or missing. 
`x` is the input matrix, `Typ` is the type of the output (Float64, Int, etc) and 
`missval` is your choice of either `NaN` or `missing`
"""
function readdlmFix(x,Typ=Float64,missVal=NaN)
    y = replace(z->!isa(z,Number) ? missVal : z,x)
    ismissing(missVal) && (Typ = Union{Missing,Typ})   #allow missing
    y = convert.(Typ,y)
    return y
end;

In [4]:
dN = datesFix(x[:,1],"dd/mm/yyyy")      #to proper Julia Date(s)
x2 = readdlmFix(x[:,2:end])             #convert to Float64 and replace empty by NaN

println("after datesFix() and readdlmFix():\n")
printmat(dN,x2;colNames=vec(header))    #printmat() wants a vector of column names

after datesFix() and readdlmFix():

                GSPC      DTB3
1979-01-02    96.730       NaN
1979-01-03       NaN     9.310
1979-01-04    98.580     9.310
1979-01-05    99.130     9.340
1979-01-08    98.800     9.360
1979-01-09    99.330     9.270



## csv: saving

To *write* csv data, the simplest approach is to create the matrix you want to save and then run
```
writedlm(FileName,matrix)
```
Alternatively, to write several matrices to the file (stacked vertically), use
```
fh = open(Filename, "w")
    writedlm(fh,matrix1,',')
    writedlm(fh,matrix2,',')
close(fh)
```

Another syntax which does the same thing is 
```
open(Filename, "w") do fh
    writedlm(...
end
```

If you only need/want limited precision in the file, round the numbers first, for instance, `round.(matrix1,digits=5))`.

In [5]:
x   = [Date(1990) 1.1 1.2 1.3;
       Date(1991) 2.1 2.2 2.3]
header = ["" "X" "Y" "Z"]

xx = [header; x]                           #to save
writedlm("Results/NewCsvFile.csv",xx,',')  #write csv file

printblue("NewCsvFile.csv has been created in the subfolder Results. Check it out.")

[34m[1mNewCsvFile.csv has been created in the subfolder Results. Check it out.[22m[39m


# jld2

## jld2: loading

jld2 files  can store very different types of data: dates, integers, floats, strings, dictionaries, etc. It is a dialect of hdf5, designed to save different Julia objects. This avoids most type conversions.

The basic syntax of the [JLD2.jl](https://github.com/JuliaIO/JLD2.jl) package is 
```
(A,B) = load(FileName,"A","B")        #load some data
xx = load(FileName)                   #load all data into a Dict()
save(FileName,"Aaaa",A,"B",B)         #save data
jldsave(FileName;Aaaa=A,B)            #also to save data
```
(It also possible to use the same syntax as for HDF5 below, except that we use ```jldopen``` instead of ```h5open```.)

In [6]:
using FileIO, JLD2

xx = load("Data/JldFile.jld2")                    #load entire file
println("The variables are: ",keys(xx),"\n")      #list contents of the file
(dN,x) = (xx["d"],xx["x"])

println("loaded from jld2 file:\n")
printmat([dN x])
 
#(x,dN) = load("Data/JldFile.jld2","x","d");        #alternative way to read some of the data

The variables are: ["B", "C", "x", "d"]

loaded from jld2 file:

2019-05-14     1.100     1.200     1.300
2019-05-15     2.100     2.200     2.300



## jld2: saving

In [7]:
x   = [1.1 1.2 1.3;                                    #to save in a JLD2 file
       2.1 2.2 2.3]
d   = [Date(2019,5,14);                                #Julia dates
       Date(2019,5,15)]
B   = 1
C   = "Nice cat"

save("Results/NewJldFile.jld2","xxxx",x,"d",d,"B",B,"C",C)      #write jld2 file
#jldsave("Results/NewJldFile.jld2";xxxx=x,d,B,C)                #alternative way

println("NewJldFile.jld2 has been created in the subfolder Results")

NewJldFile.jld2 has been created in the subfolder Results


# xlsx

## xlsx: loading

The [XLSX.jl](https://github.com/felipenoris/XLSX.jl) package allows you to read and write xlsx files.

When reading from a sheet with mixed data types, you again get data of the `Any` type, which needs conversion (see the cell further below). Otherwise, try the `XLSX.gettable()` method.

In [8]:
using XLSX

data1 = XLSX.readxlsx("Data/xlsxFile.xlsx")   #reading the entire file
x     = data1["Data!A2:C5"]                    #extracting a part of the sheet "Data"
println("x is: ",summary(x),"\n")
#x = XLSX.readdata("Data/XlsFile.xlsx","Data","B2:C6")   #does the same thing

println("part of the xlsx file:")
printmat(x)

#xx1 = XLSX.gettable(data1["Data"],infer_eltypes=true)   #to get types automatically
#(a1,a2,a3) = xx1.data;                                  #extract "columns" from xx1
#printmat(a1,a2,a3)

x is: 4×3 Matrix{Any}

part of the xlsx file:
1950-01-06    16.980  -999.990
1950-01-09    17.080  -999.990
1950-01-10    17.030     7    
1950-01-11    17.090     8    



## xlsx: convert dates and missing values (extra)

using the same functions as for csv. (See above. You need to run that cell once before using the functions.)

Noice that the xlsx file uses `-999.99` to indicate a missing value. This approach is fairly common approach. For that reason we replace those with `NaN`.

In [9]:
dN = datesFix(x[:,1],"yyyy-mm-dd")      #to proper Julia Date(s)
x2 = readdlmFix(x[:,2:end])             #convert to Float64 and replace empty by NaN
replace!(x2,-999.99=>NaN)               #replace -999.99 by NaN

printmat(dN,x2)

1950-01-06    16.980       NaN
1950-01-09    17.080       NaN
1950-01-10    17.030     7.000
1950-01-11    17.090     8.000



## xlsx: saving

In [10]:
x = [Date(1980) 1.1 1.2 1.3;                          #writing a matrix to an xlsx file
     Date(1981) 2.1 2.2 2.3]
y = [11 12]

XLSX.openxlsx("Results/NewxlsxFile.xlsx",mode="w") do xf
  xf[1]["C2"] = x  #write to first sheet, matrix with upper left corner in cell C2
  XLSX.addsheet!(xf,"SecondSheet")  #create 2nd sheet
  xf[2]["A1"] = y                   #write to 2nd sheet
end

printblue("NewxlsxFile.xlsx has been created in the subfolder Results")

[34m[1mNewxlsxFile.xlsx has been created in the subfolder Results[22m[39m


# hdf5 (extra)

## hdf5: loading

hdf5 files are used in many computer languages. They can store different types of data: integers, floats, strings, but *not* Julia Dates.

The basic syntax of the [HDF5.jl](https://github.com/JuliaIO/HDF5.jl) package for reading is
```
fh = h5open(FileName,"r")   #open for reading
    (x,y) = read(fh,"x","y")
close(fh)
```
Dates can, for instance, be saved in a `Tx3` matrix (year, month day), and then converted to Julia dates by the `Date()` function.

(There is also the `h5read(filename,"x")` command that reads one variable from a file.)

In [11]:
using HDF5

fh = h5open("Data/H5File.h5","r")                     #open for reading
    println("\nVariables in h5 file: ",keys(fh))
    (x,B,ymd) = read(fh,"x","B","ymd")                #load some of the data
close(fh)

dN = Date.(ymd[:,1],ymd[:,2],ymd[:,3])                #reconstructing dates from a 3-column matrix [year,month,day]

println("\ndates and x from h5 file is")
printmat([dN x])


Variables in h5 file: ["B", "C", "x", "ymd"]

dates and x from h5 file is
2019-05-14     1.100     1.200     1.300
2019-05-15     2.100     2.200     2.300



## hdf5: saving

To save dates, save either a matrix `[y m d]`or a date value from `Dates.value(date)`. See the tutorial on `Dates` for more details.

To save data, the syntax is
```
fh = h5open(FileName,"w")   #open for writing
    write(fh,"x",x,"y",y)
close(fh)
```

(There is also the `h5write(filename,"x",x)` command that writes one variable to a file.)

In [12]:
x   = [1.1 1.2 1.3;
       2.1 2.2 2.3]
ymd = [2019 5 14;
       2019 5 15]

B   = 1
C   = "Nice cat"

fh = h5open("Results/NewH5File.h5","w")    #open file for writing
    write(fh,"x",x,"ymd",ymd,"B",B,"C",C)
close(fh)                                  #close file

printblue("NewH5File.h5 has been created in the subfolder Results")

[34m[1mNewH5File.h5 has been created in the subfolder Results[22m[39m


# mat files (extra)

## mat: loading

The [MAT.jl](https://github.com/JuliaIO/MAT.jl) package allows you to load/save (Matlab) mat files, which is a dialect of HDF5.

`matread()` reads in a file to a dictionary, while `matrwrite()` writes a dictionary. (There is also a `matopen()` function that can be used with `write()` similar to the HDF5 approach discussed above.)

In [13]:
using MAT

D = matread("Data/MatFile.mat")       #gives a Dict
printblue("variables in mat file: ",keys(D),"\n")
    
(dM,x) = (D["dM"],D["x"])             #extract some variables from Dict

printmat(dM,x)

printblue("The dates are from matlab's datenum(). Maybe convert?")

[34m[1mvariables in mat file: [22m[39m[34m[1m["B", "C", "dM", "x"][22m[39m

737559         1.100     1.200     1.300
737560         2.100     2.200     2.300

[34m[1mThe dates are from matlab's datenum(). Maybe convert?[22m[39m


## mat: converting dates

The next cell reuses function from the tutorial on Dates, in order to convert between matlab and Julia dates.

In [14]:
mlNum2jlDate(mlnum) = Date(rata2datetime(round(Int,mlnum) - 366))    #to Julia Date from matlab datenum
jlDate2mlNum(jldate) = datetime2rata(jldate) + 366.0;

In [15]:
d = mlNum2jlDate.(dM)                #Matlab datenum to Julia date

println("\ndates and x from mat file is (after converting dates)")
printmat([d x])


dates and x from mat file is (after converting dates)
2019-05-14     1.100     1.200     1.300
2019-05-15     2.100     2.200     2.300



## mat: writing

In [16]:
x = [1.1 1.2 1.3;
     2.1 2.2 2.3]
d = [Date(2019,5,14);                         #Julia dates
     Date(2019,5,15)]
B = 1
C = "Nice cat"

dM = jlDate2mlNum.(d) #Julia Date to Matlab's datenum(), Float64

D = Dict("x"=>x,"B"=>B,"dM"=>dM,"C"=>C)
matwrite("Results/NewMatFile.mat",D)

printblue("\nNewMatFile.mat has been created in the subfolder Results")


[34m[1mNewMatFile.mat has been created in the subfolder Results[22m[39m
