### Julia Dataframes

I am captivated by Julia Programming Language. I had previously used Python and R for data science projects. What captured my attentions was the statement :

>It “runs like C but reads like Python” (Perkel, 2019).

Julia has an extenxive ecosystem of packages. I will explore Julia package called DataFrames.jl

Let's load DataFrames.jl with using:

In [4]:
using DataFrames

function weight_friends()
    name = ["Andrew", "Ross", "Bob", "Marc"]
    weights = [180, 191.5, 200, 167.8]
    DataFrame(; name, weights)
end

df = weight_friends()

Unnamed: 0_level_0,name,weights
Unnamed: 0_level_1,String,Float64
1,Andrew,180.0
2,Ross,191.5
3,Bob,200.0
4,Marc,167.8


In [5]:
df.name

4-element Vector{String}:
 "Andrew"
 "Ross"
 "Bob"
 "Marc"

### CSV

Comma-separated values (CSV) files is one of the practical ways to store tables. 

Julia has a CSV.jl package. Let's load it:

We write our dataframe to a file weights.csv and read from it

In [9]:
using CSV

function write_weights_csv()
    path = "weights.csv"
    CSV.write(path, df)
end

path = write_weights_csv()
read(path, String)

"name,weights\nAndrew,180.0\nRoss,191.5\nBob,200.0\nMarc,167.8\n"

We can use CSV.read and specify the format of our output to be dataframe

In [10]:
df = CSV.read(path, DataFrame)

Unnamed: 0_level_0,name,weights
Unnamed: 0_level_1,String7,Float64
1,Andrew,180.0
2,Ross,191.5
3,Bob,200.0
4,Marc,167.8


### RDatasets.jl

The RDatasets package provides an easy way for Julia users to experiment with most of the standard data sets that are available in the core of R as well as datasets included with many of R's most popular packages.

https://github.com/JuliaStats/RDatasets.jl

In [12]:
using RDatasets
iris = dataset("datasets", "iris")

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


### How to Index and Summarize

How to retrieve a vector for SepalLength?

In [13]:
iris.SepalLength

150-element Vector{Float64}:
 5.1
 4.9
 4.7
 4.6
 5.0
 5.4
 4.6
 5.0
 4.4
 4.9
 5.4
 4.8
 4.8
 ⋮
 6.0
 6.9
 6.7
 6.9
 5.8
 6.8
 6.7
 6.7
 6.3
 6.5
 6.2
 5.9

Or, we can index a DataFrame like an Array. The second index is the column indexing:

In [14]:
iris[!, :SepalLength]

150-element Vector{Float64}:
 5.1
 4.9
 4.7
 4.6
 5.0
 5.4
 4.6
 5.0
 4.4
 4.9
 5.4
 4.8
 4.8
 ⋮
 6.0
 6.9
 6.7
 6.9
 5.8
 6.8
 6.7
 6.7
 6.3
 6.5
 6.2
 5.9

We can also use df[:, :SepalLength] which copies the column :SepalLength. 

df[!, :SepalLength] is better as it does an in-place modification.

In [15]:
iris[:, :SepalLength]

150-element Vector{Float64}:
 5.1
 4.9
 4.7
 4.6
 5.0
 5.4
 4.6
 5.0
 4.4
 4.9
 5.4
 4.8
 4.8
 ⋮
 6.0
 6.9
 6.7
 6.9
 5.8
 6.8
 6.7
 6.7
 6.3
 6.5
 6.2
 5.9

**Use first index as row indexing:**

In [16]:
iris[1, :]

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa


#### Slicing

First 4 rows using slicing:

In [18]:
iris[1:4, :Species]

4-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"

### Filter and Subset

We filter rows by using filter(source => f::Function, df). 
* very similar to the function filter(f::Function, V::Vector) from Julia Base module. 
* A good example of how DataFrames.jl uses **multiple dispatch** — define a new method of filter that accepts a DataFrame as argument.

Using function to filter data is powerful

What does the `=>` operator do?

"What does => do in replace!(x, 0=>4)? Does it create a pair, a replacement of all zeros by fours, or what?"

It creates a Pair. In the function replace, a Pair in the second argument position means the multiple dispatch of replace() chooses a version of the replace function where, given a numeric array or string x, all items within x fitting the first part of the Pair are replaced with an instance of the second part of the Pair.

https://stackoverflow.com/questions/52488743/julia-1-0-0-what-does-the-operator-do

This small example should show how "=>" makes a pair

julia> replace("julia", Pair("u", "o"))

"jolia"

julia> replace("julia", "u" => "o")

"jolia"

The => operator is a function that creates a Pair, The replace function then uses this Pair to decide what changes to make to its inputs, but the => itself is completely independent of that

For example, => is also used in the construction of dictionaries: Dict("A"=>1, "B"=>2) creates a dictionary with keys "A" and "B", and values 1 and 2. As you can see, => does not denote a transformation here

In [42]:
typeof(iris.Species)

CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}} (alias for CategoricalArrays.CategoricalArray{String, 1, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}})

#### CategoricalArrays.jl

CategoricalArrays.jl

This package provides tools for working with categorical variables, both with unordered (nominal variables) and ordered categories (ordinal variables), optionally with missing values.

https://github.com/JuliaData/CategoricalArrays.jl

What is the perfect way to convert a categorical array to a numeric array?
https://stackoverflow.com/questions/67469468/julia-what-is-the-perfect-way-to-convert-a-categorical-array-to-a-numeric-array

In [44]:
using CategoricalArrays

v_species = unwrap.(iris.Species)


150-element Vector{String}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [47]:
get_setosa(name::String) = name == "setosa"

# 3 Ways

# filter(get_setosa, v_species) 
# Note that this doesn’t only work for DataFrames, but also for vectors as above:

filter(:Species => ==("setosa"), iris) # A more efficient (faster to execute) way to express the same is:

# Using an anonymous function
#filter(:Species => name -> name == "setosa", iris)

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


julia> filter(row -> row.a != 2, df)

   
A more efficient (faster to execute) way to express the same is:

filter(:a => !=(2), df)

   
As you can see the style is that you pass a Pair or column name and a
predicate function (i.e. a function that produces Bool). This has two
benefits. Firstly, the operation is type stable (thus faster). Secondly, in the
row -> row.a != 2 we define a new anonymous function with each call of
filter, which causes compilation (unless the operation is wrapped in a
function or we predefine the predicate function).

https://www.juliabloggers.com/dataframes-jl-why-do-we-have-both-subset-and-filter-functions/