# Data

This notebook talks about using Julia to load and collect data. Being able to download, load, and collect data is a really important skill for any data scientist and we will learn the tools that Julia uses to be able to make this happen.

### Import Statements

In [1]:
using BenchmarkTools
using DataFrames
using DelimitedFiles
using CSV
using XLSX

┌ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
└ @ Base loading.jl:1423


## Getting Data

You can download data from the web easily using Julia's `download` function. You can also preface your Julia command with a `;` to use your normal command-line arguments that you would use to download data in a terminal.

In [3]:
P = download("https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv", "programming_languages.csv")

"programming_languages.csv"

## Reading Data From Text Files

You can use the `DelimitedFiles` package to read in text files that are on your computer/in your working directory.

In [5]:
P, H = readdlm("programming_languages.csv", ','; header=true)

(Any[1951 "Regional Assembly Language"; 1952 "Autocode"; … ; 2012 "Julia"; 2014 "Swift"], AbstractString["year" "language"])

When running this command, you will see that `H` is the headers and `P` is the raw data

In [6]:
P

73×2 Matrix{Any}:
 1951  "Regional Assembly Language"
 1952  "Autocode"
 1954  "IPL"
 1955  "FLOW-MATIC"
 1957  "FORTRAN"
 1957  "COMTRAN"
 1958  "LISP"
 1958  "ALGOL 58"
 1959  "FACT"
 1959  "COBOL"
 1959  "RPG"
 1962  "APL"
 1962  "Simula"
    ⋮  
 2003  "Scala"
 2005  "F#"
 2006  "PowerShell"
 2007  "Clojure"
 2009  "Go"
 2010  "Rust"
 2011  "Dart"
 2011  "Kotlin"
 2011  "Red"
 2011  "Elixir"
 2012  "Julia"
 2014  "Swift"

In [7]:
H

1×2 Matrix{AbstractString}:
 "year"  "language"

We can also use this pacakge to write our Julia objects, matrices, lists, etc. to text files using the `writedlm` function. The parameters are
* name of the text file you are writing to
* the object containing the data you want to write
* the delimiter

In [8]:
writedlm("programming_languages_dlm.txt", P, '-')

A more powerful way that you can read CSV files is to use Julia's `CSV` package. By default, the `CSV` package automatically imports the data into a DataFrame which is a more convenient format for performing data cleaning and EDA. Use this pacage (or the `XSLX` package) over using `DelimitedFiles`.

In [14]:
C = CSV.read("programming_languages.csv", DataFrame);
@show typeof(C)
C[1:10,:]

typeof(C) = DataFrame


Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String31
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


Data frames in Julia are really cool because each column is an attribute to the DataFrame object! This means that we can do things like this:

In [17]:
C.year

73-element Vector{Int64}:
 1951
 1952
 1954
 1955
 1957
 1957
 1958
 1958
 1959
 1959
 1959
 1962
 1962
    ⋮
 2003
 2005
 2006
 2007
 2009
 2010
 2011
 2011
 2011
 2011
 2012
 2014

You can get the column names of the data frame by calling the `names` function. You can also get some descriptive statistics of your data frame by calling the `describe` function:

In [19]:
@show names(C)
describe(C)

names(C) = ["year", "language"]


Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,year,1982.99,1951,1986.0,2014,0,Int64
2,language,,ALGOL 58,,dBase III,0,String31


For funsies, here is a benchmark comparing the `DelimitedFiles` and `CSV` packages:

In [20]:
@btime P, H = readdlm("programming_languages.csv", ','; header=true);
@btime C = CSV.read("programming_languages.csv", DataFrame)

  202.600 μs (2438 allocations: 111.08 KiB)
  256.700 μs (260 allocations: 46.30 KiB)


Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String31
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL
