In [32]:
using DelimitedFiles
using CSV
using XLSX
using DataFrames

# Downloading Data
Julia provides the `download()` function to download data from the web. Note that this uses `wget`, `curl` or `fetch` (whatever you have installed already), so you need one of these to use it. By default it downloads the file into the working directory, which in the case of this notebook is the same dir as the notebook:

In [2]:
download("https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv",
    "programming_languages.csv")

"programming_languages.csv"

Note that we can call terminal commands from within Julia using the `;` prefix

In [4]:
;head programming_languages.csv

year,language
1951,Regional Assembly Language
1952,Autocode
1954,IPL
1955,FLOW-MATIC
1957,FORTRAN
1957,COMTRAN
1958,LISP
1958,ALGOL 58
1959,FACT


# Reading & Writing Data

## Delimited Files
We can read and write any delimited text file using the `DelimitedFiles` package, which is part of the base Julia install. This provides the `readdlm()` function to parse delimited files. There are three important things to note about this function:
1. The file name must be wrapped in quotes (because the method expects a string), but the delimiter must be wrapped in apostrophes (because the method expects a char)
2. It returns 2 objects, the data and the header separately
3. It has a lot of flexibility which allows it to work with any kind of delimited file, but this comes at a cost of slowness for CSVs compared to the dedicated `CSV` package (see below)

In [18]:
data, header = readdlm("programming_languages.csv", ',', header=true)

(Any[1951 "Regional Assembly Language"; 1952 "Autocode"; … ; 2012 "Julia"; 2014 "Swift"], AbstractString["year" "language"])

In [20]:
data

73×2 Matrix{Any}:
 1951  "Regional Assembly Language"
 1952  "Autocode"
 1954  "IPL"
 1955  "FLOW-MATIC"
 1957  "FORTRAN"
 1957  "COMTRAN"
 1958  "LISP"
 1958  "ALGOL 58"
 1959  "FACT"
 1959  "COBOL"
 1959  "RPG"
 1962  "APL"
 1962  "Simula"
    ⋮  
 2003  "Scala"
 2005  "F#"
 2006  "PowerShell"
 2007  "Clojure"
 2009  "Go"
 2010  "Rust"
 2011  "Dart"
 2011  "Kotlin"
 2011  "Red"
 2011  "Elixir"
 2012  "Julia"
 2014  "Swift"

Once we have some data, we can use the `writedlm()` function to write it to file. Again, there are three important things to note about this function:
1. The filename must be in quotes, and the delimiter must be in apostrophes
2. For big files it is much slower than the `CSV` package
3. There is no option to specify a header within an argument call, so you must combine the header with the data within the argument that specifies the data to write e.g. `[header ; data]`

In [21]:
writedlm("proglang.csv", [header ; data], ',')

In [22]:
;head proglang.csv

year,language
1951,Regional Assembly Language
1952,Autocode
1954,IPL
1955,FLOW-MATIC
1957,FORTRAN
1957,COMTRAN
1958,LISP
1958,ALGOL 58
1959,FACT


## CSV Files
CSV files are really common, so Julia has the dedicated `CSV` package which is optimised for CSVs (but doesn't work with other file types). Its `read()` function takes two arguments, first the filename and second the sink. The sink is an object that the data will be returned in, and is usually a `DataFrame`, but in certain cases other objects are more useful e.g.reading directly into an SQLite database instance.

In [28]:
proglang = CSV.read("programming_languages.csv", DataFrame)

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String31
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


We can write data with the `write()` function:

In [29]:
CSV.write("proglang2.csv", proglang)

"proglang2.csv"

In [30]:
;head proglang2.csv

year,language
1951,Regional Assembly Language
1952,Autocode
1954,IPL
1955,FLOW-MATIC
1957,FORTRAN
1957,COMTRAN
1958,LISP
1958,ALGOL 58
1959,FACT


## XLSX Files
XLSX files can be more complicated to parse than simple text files, so the `XLSX` package provides dedicated functions for reading and writing them. The most generic function for reading XLSX files is `readxlsx()`, which returns a `XLSX.XLSXFile` object containing all sheets. Each sheet is a `XLSX.XLSXWorksheet` object, but if we know the sheet name and column/row span, we can extract the values as a Julia array::

In [54]:
proglang = XLSX.readxlsx("programming_languages.xlsx")
println("The initial return is a $(typeof(proglang)) object")
println("This can be indexed to find the first sheet, which is a $(typeof(proglang[1])) object")
println("The values in the first sheet can be unpacked into an array:")
println(proglang["Sheet1!A1:B74"])

The initial return is a XLSX.XLSXFile object
This can be indexed to find the first sheet, which is a XLSX.Worksheet object
The values in the first sheet can be unpacked into an array:
Any["year" "language"; 1951 "Regional Assembly Language"; 1952 "Autocode"; 1954 "IPL"; 1955 "FLOW-MATIC"; 1957 "FORTRAN"; 1957 "COMTRAN"; 1958 "LISP"; 1958 "ALGOL 58"; 1959 "FACT"; 1959 "COBOL"; 1959 "RPG"; 1962 "APL"; 1962 "Simula"; 1962 "SNOBOL"; 1963 "CPL"; 1964 "Speakeasy"; 1964 "BASIC"; 1964 "PL/I"; 1966 "JOSS"; 1967 "BCPL"; 1968 "Logo"; 1969 "B"; 1970 "Pascal"; 1970 "Forth"; 1972 "C"; 1972 "Smalltalk"; 1972 "Prolog"; 1973 "ML"; 1975 "Scheme"; 1978 "SQL "; 1980 "C++ "; 1983 "Ada"; 1984 "Common Lisp"; 1984 "MATLAB"; 1984 "dBase III"; 1985 "Eiffel"; 1986 "Objective-C"; 1986 "LabVIEW "; 1986 "Erlang"; 1987 "Perl"; 1988 "Tcl"; 1988 "Wolfram Language "; 1989 "FL "; 1990 "Haskell"; 1991 "Python"; 1991 "Visual Basic"; 1993 "Lua"; 1993 "R"; 1994 "CLOS "; 1995 "Ruby"; 1995 "Ada 95"; 1995 "Java"; 1995 "Delphi 

We can convert the values in a `XLSX.Worksheet` to a `DataFrame` by wrapping in the `DataFrame()` constructor, but we need to pass the `:auto` argument as well so that the column names are inferred

In [57]:
DataFrame(proglang["Sheet1!A1:B74"], :auto)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Any,Any
1,year,language
2,1951,Regional Assembly Language
3,1952,Autocode
4,1954,IPL
5,1955,FLOW-MATIC
6,1957,FORTRAN
7,1957,COMTRAN
8,1958,LISP
9,1958,ALGOL 58
10,1959,FACT


If we know the sheet that we want to read, it's easier to use the `readtable()` function to return a `XLSX.DataTable` object.

In [49]:
proglang = XLSX.readtable("programming_languages.xlsx", "Sheet1")
println(typeof(proglang))

XLSX.DataTable


The `XLSX.DataTable` can then be easily converted into a `DataFrame` by wrapping it in the `DataFrame()` constructor:

In [46]:
DataFrame(proglang)

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Any,Any
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


We can write XLSX files with the `XLSX.writetable()` function. This creates a XLSX file with one sheet containing our data, and is easiest to do with our data as a `DataFrame`. We pass the function 3 arguments:
1. Name of the XLSX file as a string
2. Values in each column, which can be obtained with `DataFrames.eachcol()`
3. Names of each column, which can be obtained with `DataFrames.names()`

In [60]:
proglang = DataFrame(XLSX.readtable("programming_languages.xlsx", "Sheet1"))
XLSX.writetable("proglang.xlsx", collect(DataFrames.eachcol(proglang)), DataFrames.names(proglang))

In [62]:
DataFrames.names(proglang)

2-element Vector{String}:
 "year"
 "language"