# Julia Academy

## Data Science Course

# 1. Working with Data in Julia

**Huda Nassar**

**Source:** https://github.com/JuliaAcademy/DataScience/blob/main/01.%20Data.ipynb

In this notebook we will look at different ways of getting and storing data in memory for Julia


In [1]:
using BenchmarkTools
using DataFrames
using DelimitedFiles
using CSV
using XLSX

### Downloading data from the internet

We can use the `download` function which does the equivalent of `wget`. We'll download a simple CSV file of programming languages and the year in which they came out.

In [2]:
pl_url = download("https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv",
                  "programming_languages.csv")

"programming_languages.csv"

We can use `;head` shell command to view the file we just downloaded: 

In [3]:
;head programming_languages.csv

year,language
1951,Regional Assembly Language
1952,Autocode
1954,IPL
1955,FLOW-MATIC
1957,FORTRAN
1957,COMTRAN
1958,LISP
1958,ALGOL 58
1959,FACT


### Reading data from textfiles

There are several options for reading data from a text file. We'll start with the `DelimitedFiles` package from the standard library. 

The following function will return a tuple of (data, header):

In [4]:
data_matrix, header = readdlm("programming_languages.csv", ',', header=true)

(Any[1951 "Regional Assembly Language"; 1952 "Autocode"; … ; 2012 "Julia"; 2014 "Swift"], AbstractString["year" "language"])

In [5]:
header

1×2 Array{AbstractString,2}:
 "year"  "language"

In [6]:
writedlm("programminglanguages_dlm.txt", data_matrix, '-')

A more powerful package is the `CSV.jl`, which is the recommended library.

In [7]:
data_df = CSV.read("programming_languages.csv", DataFrame)

@show typeof(data_df)
data_df[1:10,:]

typeof(data_df) = DataFrame


Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Int64,String
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


A dataframe is a richer format than the default array format

In [8]:
names(data_df)

2-element Array{String,1}:
 "year"
 "language"

In [9]:
describe(data_df)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,year,1982.99,1951,1986.0,2014,0,Int64
2,language,,ALGOL 58,,dBase III,0,String


We can write a CSV file easily:

In [10]:
CSV.write("programminglanguages_CSV.csv", DataFrame(data_matrix))

"programminglanguages_CSV.csv"

### Reading XLSX files

Another type of file type is an Excel spreadsheet. We can use the `XLSX.jl` package. This is useful because Excel files can contain multiple sheets, are likely to having missing values and also have specific function cells.

In [11]:
sprdsht = XLSX.readdata("data/zillow_data_download_april2020.xlsx", #file name
                        "Sale_counts_city", #sheet name
                        "A1:F9" #cell range
                        )

9×6 Array{Any,2}:
      "RegionID"  "RegionName"    …      "2008-03"      "2008-04"
  6181            "New York"             missing        missing
 12447            "Los Angeles"      1446           1705
 39051            "Houston"          2926           3121
 17426            "Chicago"          2910           3022
  6915            "San Antonio"   …  1479           1529
 13271            "Philadelphia"     1609           1795
 40326            "Phoenix"          1310           1519
 18959            "Las Vegas"        1618           1856

However if we don't specify a cell range then things will take longer.

In [12]:
sprdsht_full = XLSX.readtable("data/zillow_data_download_april2020.xlsx","Sale_counts_city");

This object is a tuple of two items. The first is a vector of vectors, where each vector correponds to a column, and the second is the header with the column names:

In [13]:
sprdsht_full[1]

148-element Array{Any,1}:
 Any[6181, 12447, 39051, 17426, 6915, 13271, 40326, 18959, 54296, 38128  …  396952, 397236, 398030, 398104, 398357, 398712, 398716, 399081, 737789, 760882]
 Any["New York", "Los Angeles", "Houston", "Chicago", "San Antonio", "Philadelphia", "Phoenix", "Las Vegas", "San Diego", "Dallas"  …  "Barnard Plantation", "Windsor Place", "Stockbridge", "Mattamiscontis", "Chase Stream", "Bowdoin College Grant West", "Summerset", "Long Pond", "Hideout", "Ebeemee"]
 Any["New York", "California", "Texas", "Illinois", "Texas", "Pennsylvania", "Arizona", "Nevada", "California", "Texas"  …  "Maine", "Missouri", "Wisconsin", "Maine", "Maine", "Maine", "South Dakota", "Maine", "Utah", "Maine"]
 Any[1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  28750, 28751, 28752, 28753, 28754, 28755, 28756, 28757, 28758, 28759]
 Any[missing, 1446, 2926, 2910, 1479, 1609, 1310, 1618, 772, 1158  …  0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
 Any[missing, 1705, 3121, 3022, 1529, 1795, 1519, 1856, 1057, 1232  …  0, 0, 0, 0

We can easily convert this to a dataframe using the "splat" (`...`) operator which unwraps the tuple:

In [14]:
zillow_data = DataFrame(sprdsht_full...)  # equivalent to DataFrame(sprdsht_full[1], sprdsht_full[2])

Unnamed: 0_level_0,RegionID,RegionName,StateName,SizeRank,2008-03,2008-04,2008-05
Unnamed: 0_level_1,Any,Any,Any,Any,Any,Any,Any
1,6181,New York,New York,1,missing,missing,missing
2,12447,Los Angeles,California,2,1446,1705,1795
3,39051,Houston,Texas,3,2926,3121,3220
4,17426,Chicago,Illinois,4,2910,3022,2937
5,6915,San Antonio,Texas,5,1479,1529,1582
6,13271,Philadelphia,Pennsylvania,6,1609,1795,1709
7,40326,Phoenix,Arizona,7,1310,1519,1654
8,18959,Las Vegas,Nevada,8,1618,1856,1961
9,54296,San Diego,California,9,772,1057,1195
10,38128,Dallas,Texas,10,1158,1232,1240


### Importing Serialized Data

Sometimes data is not stored as a plain text file. In this case we can use some relevant packages for each file type. We'll use a toy example of a very small matrix

In [15]:
using JLD
jld_data = JLD.load("data/mytempdata.jld")

Dict{String,Any} with 1 entry:
  "tempdata" => [2 1446 … 1795 1890; 3 2926 … 3220 3405; 4 2910 … 2937 3224; 5 …

In [16]:
using NPZ
npz_data = npzread("data/mytempdata.npz")

4×5 Array{Int64,2}:
 2  1446  1705  1795  1890
 3  2926  3121  3220  3405
 4  2910  3022  2937  3224
 5  1479  1529  1582  1761

In [17]:
using MAT
matlab_data = matread("data/mytempdata.mat")

Dict{String,Any} with 1 entry:
  "tempdata" => [2 1446 … 1795 1890; 3 2926 … 3220 3405; 4 2910 … 2937 3224; 5 …

In [18]:
@show typeof(jld_data)
@show typeof(npz_data)
@show typeof(matlab_data)

typeof(jld_data) = Dict{String,Any}
typeof(npz_data) = Array{Int64,2}
typeof(matlab_data) = Dict{String,Any}


Dict{String,Any}

We can also load `rda` files using the `RData.jl` and `RCall.jl` libraries but that requires have an R runtime installed, which I haven't done yet.

### Trying to answer simple questions from the programming languages dataset

Let's try and use the dataset we've loaded to answer some simple questions:

1. Which year was a given language created?
2. How many languages were created in a given year?

We'll start the matrix format data: `data_matrix`

In [19]:
# Q1: Which year was was a given language invented?
function year_created(data_matrix,language::String)
    loc = findfirst(data_matrix[:,2] .== language)
    return data_matrix[loc,1]
end

year_created(data_matrix, "Julia")

2012

But what if we run the function on a language which doesn't exist?

In [20]:
year_created(data_matrix, "W")

LoadError: [91mArgumentError: invalid index: nothing of type Nothing[39m

This returns an error because the variable `loc` is type `Nothing` and that cannot be used as an index (nevermind that it isn't even in the index which would be a different error!)

So we can do basic error handling as follows

In [21]:
function year_created_handle_error(data_matrix,language::String)
    loc = findfirst(data_matrix[:,2] .== language)
    !isnothing(loc) && return data_matrix[loc,1]
    error("Error: Language not found.")
end

year_created_handle_error(data_matrix, "W")

LoadError: [91mError: Language not found.[39m

In [22]:
# Q2: How many languages were created in a given year?
function how_many_per_year(data_matrix,year::Int64)
    year_count = length(findall(data_matrix[:,1].==year))
    return year_count
end

how_many_per_year(data_matrix, 2011)

4

#### Now let's use a dataframe to answer these questions

In [23]:
# Q1: Which year was was a given language invented?
# it's a little more intuitive and you don't need to remember the column ids
function year_created(data_df, language::String)
    loc = findfirst(data_df.language .== language)
    return data_df.year[loc]
end

year_created(data_df, "Julia")

2012

In [24]:
year_created(data_df, "W")

LoadError: [91mArgumentError: invalid index: nothing of type Nothing[39m

In [25]:
function year_created_handle_error(data_df,language::String)
    loc = findfirst(data_df.language .== language)
    !isnothing(loc) && return data_df.year[loc]
    error("Error: Language not found.")
end

year_created_handle_error(data_df, "W")

LoadError: [91mError: Language not found.[39m

In [26]:
# Q2: How many languages were created in a given year?
function how_many_per_year(data_df, year::Int64)
    year_count = length(findall(data_df.year.==year))
    return year_count
end

how_many_per_year(data_df, 2011)

4

### Now let's treat this data as a dictionary.

We'll start by building a typed dictionary from the matrix

In [27]:
data_dict = Dict{Integer, Vector{String}}()

Dict{Integer,Array{String,1}}()

In [28]:
# eg data_dict[67] = ["julia","programming"]

Now let's process the array with the years as keys and the values holding all the programming languages corresponding to that year in a vector of strings.

In [29]:

dict = Dict{Integer,Vector{String}}()
for i = 1:size(data_matrix, 1)
    year, lang = data_matrix[i, :]
    if year in keys(dict)
        dict[year] = push!(dict[year], lang)
        # note that push! isn't optimal here but it is correct and easy to understand
    else
        dict[year] = [lang]
    end
end

However, since the data is sorted by year, we can take advantage of that in a smarter manner:

In [55]:
current_year = data_matrix[1,1]
data_dict[current_year] = [data_matrix[1, 2]]
for (i, next_year) in enumerate(data_matrix[2:end, 1])
    if next_year == current_year
        data_dict[current_year] = push!(data_dict[current_year], data_matrix[i+1, 2])
    else
        current_year = next_year
        data_dict[current_year] = [data_matrix[i+1, 2]]
    end
end

In [56]:
data_dict

Dict{Integer,Array{String,1}} with 45 entries:
  1991 => ["Python", "Visual Basic"]
  1993 => ["Lua", "R"]
  2005 => ["F#"]
  2010 => ["Rust"]
  1983 => ["Ada"]
  1957 => ["FORTRAN", "COMTRAN"]
  1987 => ["Perl"]
  2007 => ["Clojure"]
  1989 => ["FL "]
  1969 => ["B"]
  1952 => ["Autocode"]
  1963 => ["CPL"]
  2003 => ["Groovy", "Scala"]
  1958 => ["LISP", "ALGOL 58"]
  2014 => ["Swift"]
  1951 => ["Regional Assembly Language"]
  1997 => ["Rebol"]
  2000 => ["ActionScript"]
  1967 => ["BCPL"]
  1985 => ["Eiffel"]
  1968 => ["Logo"]
  1955 => ["FLOW-MATIC"]
  1984 => ["Common Lisp", "MATLAB", "dBase III"]
  2009 => ["Go"]
  1966 => ["JOSS"]
  ⋮    => ⋮

In [57]:
# Q1: Which year was was a given language invented?
# now instead of looking in one long vector, we will look in many small vectors
function year_created(data_dict,language::String)
    keys_vec = collect(keys(data_dict))
    lookup = map(keyid -> findfirst(data_dict[keyid].==language), keys_vec)
    # now the lookup vector has `nothing` or a numeric value. We want to find the index of the numeric value.
    return keys_vec[findfirst((!isnothing).(lookup))]
end

year_created(data_dict,"Julia")

2012

In [58]:
# Q2: How many languages were created in a given year?
how_many_per_year(data_dict,year::Int64) = length(data_dict[year])
how_many_per_year(data_dict,2011)

4