# Global Workshop on Earth Observation with Julia
- 9-13 January 2023
- Terceira Island, Azores (Timezone: UTC -1; GMT-1), Portugal, AIR Centre

## Julia for beginners
> 10:20 – 12:15 Hands-on session 1

> Lazaro Alonso  
<img src="https://upload.wikimedia.org/wikipedia/commons/9/91/Octicons-mark-github.svg" width=20/> https://github.com/lazarusA 
> - <font color = teal> **Max Planck Institute for Biogeochemistry** </font>
> - Model-Data Integration Group 
- lalonso@bgc-jena.mpg.de

Find me also on social media:

<img src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Mastodon_logotype_%28simple%29_new_hue.svg" width=30/>
<font color = dodgerblue> https://julialang.social/@LazaroAlonso </font>

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4f/Twitter-logo.svg" width=30/>
<font color = dodgerblue>  https://twitter.com/LazarusAlon </font>


#### References:
- https://juliadatascience.io
    > by Jose Storopoli, Rik Huijzer and Lazaro Alonso
    
Here I will try to follow the workflow/logic from this book.
- https://docs.julialang.org
- https://docs.julialang.org/en/v1/manual/performance-tips/

**Disclaimer:** 

I'm just a user of the language like you, hence any developer's perspective or in depth technicalities are out of the scope of this talk. 

**Overview**

#### 10:20-10:40 Language Syntax
- **Variables**
- struct
- Boolean Operators and Numeric comparisons
- Functions (I ❤️ them) [mention filter]
- Multiple Dispatch
- Keyword arguments
- Anonymous Functions
- Conditionals

In [5]:
name = "Julia"
age = 11 # years

11

In [6]:
name

"Julia"

In [7]:
# operations on variables, addition, multiplication
10*age

110

In [29]:
10 + age

21

Q. What type of variables are those? 
> Use `typeof` and find out. 

In [8]:
typeof(age)

Int64

In [9]:
typeof(name)

String

- Variables
- **struct**
- Boolean Operators and Numeric comparisons
- Functions (I ❤️ them) [mention filter]
- Multiple Dispatch
- Keyword arguments
- Anonymous Functions
- Conditionals

Having variables around without any sort of hierarchy or relationships is not ideal. In Julia, we can define that kind of structured data with a struct (also known as a composite type).

_Basic, basic, of course the following is not the whole story, but is enough to get us started_

- https://docs.julialang.org/en/v1/manual/performance-tips/#Type-declarations

In [10]:
struct Language
    name::String
    title::String
    year_of_birth::Int64
    fast::Bool
end

In [11]:
fieldnames(Language)

(:name, :title, :year_of_birth, :fast)

Q. How to use them?
> Instantiate as in:

In [13]:
julia = Language("Julia", "Rapidus", 2012, true)

Language("Julia", "Rapidus", 2012, true)

In [14]:
python = Language("Python", "Letargicus", 1991, false)

Language("Python", "Letargicus", 1991, false)

<font color=red>We can’t change their values once they are instantiated.</font>
But, what if I want to update some of those variables, then we use a `mutable struct`.

In [15]:
mutable struct MutableLanguage
    name::String
    title::String
    year_of_birth::Int64
    fast::Bool
end

In [16]:
julia_mutable = MutableLanguage("Julia", "Rapidus", 2012, true)

MutableLanguage("Julia", "Rapidus", 2012, true)

In [17]:
julia_mutable.title = "Python Obliteratus"

"Python Obliteratus"

In [18]:
julia_mutable

MutableLanguage("Julia", "Python Obliteratus", 2012, true)

- Variables
- struct
- **Boolean Operators and Numeric comparisons**
- Functions (I ❤️ them) [mention filter]
- Multiple Dispatch
- Keyword arguments
- Anonymous Functions
- Conditionals

** ! NOT, && AND, || OR**

In [19]:
!true

false

In [20]:
(false && true) || (!false)

true

In [21]:
(6 isa Int64) && (6 isa Real)

true

In [22]:
# equality
1 == 1

true

In [23]:
# less than
1<2

true

In [26]:
# less than or equal to
3.14 <= 3.14

true

In [27]:
# mix with boolean 
(1 != 10) || (3.14 <= 2.71)

true

- Variables
- struct
- Boolean Operators and Numeric comparisons
- **Functions (I ❤️ them) [mention filter]**
- **Multiple Dispatch**
- Keyword arguments
- Anonymous Functions
- Conditionals

In [28]:
f_name(arg1, arg2) = arg1 + arg2

f_name (generic function with 1 method)

In [30]:
function fuction_name(arg1, arg2)
    return arg1 + arg2
end

fuction_name (generic function with 1 method)

In [32]:
f(x,y) = x + y

f (generic function with 2 methods)

In [33]:
f(1,2)

3

In [34]:
f(2.0, 1.0)

3.0

<font color = red > type declarations </font> Multiple dispatch

- https://docs.julialang.org/en/v1/manual/methods/#Defining-Methods

In [35]:
function round_number(x::Float64)
    return round(x)
end

round_number (generic function with 1 method)

In [36]:
function round_number(x::Int64)
    return x
end

round_number (generic function with 2 methods)

**multiple return values**

In [37]:
function add_multiply(x, y)
    addition = x + y
    multiplication = x * y
    return addition, multiplication
end

add_multiply (generic function with 1 method)

In [38]:
add_multiply(1, 2)

(3, 2)

In [39]:
out = add_multiply(1, 2)

(3, 2)

In [40]:
first(out)

3

In [41]:
last(out)

2

- Variables
- struct
- Boolean Operators and Numeric comparisons
- Functions (I ❤️ them) [mention filter]
- Multiple Dispatch
- **Keyword arguments**
- Anonymous Functions
- Conditionals

In [43]:
function logarithm(x::Real; base::Real=2.7182818284590)
    return log(base, x)
end

logarithm (generic function with 1 method)

In [44]:
logarithm(10)

2.3025850929940845

In [45]:
logarithm(10; base=2)

3.3219280948873626

- Variables
- struct
- Boolean Operators and Numeric comparisons
- Functions (I ❤️ them) [mention filter]
- Multiple Dispatch
- Keyword arguments
- **Anonymous Functions**
- Conditionals

Often we don’t care about the name of the function and want to quickly make one.

In [46]:
map(x -> 2.7182818284590^x, logarithm(2))

2.0

- Variables
- struct
- Boolean Operators and Numeric comparisons
- Functions (I ❤️ them) [mention filter]
- Multiple Dispatch
- Keyword arguments
- Anonymous Functions
- **Conditionals**

Let's compare two numbers, `a` and `b`.

In [51]:
function compare(a, b)
    if a < b
        "a is less than b"
    elseif a > b
        "a is greater than b"
    else
        "a is equal to b"
    end
end

compare (generic function with 1 method)

In [52]:
compare(3.14, 3.14)

"a is equal to b"

A pattern that I use a lot is the following:

In [53]:
function compare_ternary(a, b)
    a < b ? "a is less than b" : a > b ? "a is greater than b" : "a is equal to b"
end

compare_ternary (generic function with 1 method)

In [54]:
compare_ternary(3.14, 3.14)

"a is equal to b"

**The `for` loop ! Use it!**

In [58]:
for i in 1:5
    println(i)
end

1
2
3
4
5


In [59]:
for i ∈ 1:5
    println(i)
end

1
2
3
4
5


#### 10:40-11:00 Native Data Structures
- **String**
- Tuple
- NamedTuple
- UnitRange
- Array
- Pair
- Dict
- Symbol

In [60]:
str1 = "This is a string"

"This is a string"

In [61]:
typeof(str1)

String

In [62]:
str2 = """
    This is a big multiline string with a nested "quotation".
    As you can see.
    It is still a String to Julia.
    """

"This is a big multiline string with a nested \"quotation\".\nAs you can see.\nIt is still a String to Julia.\n"

In [63]:
typeof(str2)

String

In [65]:
# concatenation
hello = "Hello"
goodbye = "Goodbye"

"Goodbye"

In [66]:
hello*goodbye

"HelloGoodbye"

In [67]:
join([hello, goodbye], " ")

"Hello Goodbye"

In [68]:
# String Interpolation
"$hello $goodbye"

"Hello Goodbye"

In [69]:
function compare_interpolate(a, b)
    a < b ? "$a is less than $b" : a > b ? "$a is greater than $b" : "$a is equal to $b"
end

compare_interpolate (generic function with 1 method)

In [70]:
compare_interpolate(3.14, 3.14)

"3.14 is equal to 3.14"

**Functions to manipulate strings**

In [71]:
julia_string = "Julia is an amazing open source programming language"

"Julia is an amazing open source programming language"

In [72]:
contains(julia_string, "Julia")

true

In [73]:
startswith(julia_string, "Julia")

true

In [74]:
endswith(julia_string, "Julia")

false

In [75]:
lowercase(julia_string)

"julia is an amazing open source programming language"

In [76]:
uppercase(julia_string)

"JULIA IS AN AMAZING OPEN SOURCE PROGRAMMING LANGUAGE"

In [77]:
titlecase(julia_string)

"Julia Is An Amazing Open Source Programming Language"

In [78]:
lowercasefirst(julia_string)

"julia is an amazing open source programming language"

In [79]:
replace(julia_string, "amazing" => "awesome")

"Julia is an awesome open source programming language"

In [80]:
split(julia_string, " ")

8-element Vector{SubString{String}}:
 "Julia"
 "is"
 "an"
 "amazing"
 "open"
 "source"
 "programming"
 "language"

**String Conversions**

In [82]:
string(123)

"123"

In [81]:
parse(Int64, "123")

123

- String
- **Tuple**
- NamedTuple
- UnitRange
- Array
- Pair
- Dict
- Symbol

In [83]:
my_tuple = (1, 3.14, "Julia") # immutable struct

(1, 3.14, "Julia")

In [85]:
add_mul = add_multiply(1, 2)

(3, 2)

In [86]:
typeof(add_mul)

Tuple{Int64, Int64}

Mix anonymous functions and tuples

In [87]:
map((x, y) -> x^y, 2, 3)

8

In [88]:
map((x, y, z) -> x^y + z, 2, 3, 1)

9

- String
- Tuple
- **NamedTuple**
- UnitRange
- Array
- Pair
- Dict
- Symbol

In [89]:
my_namedtuple = (i=1, f=3.14, s="Julia")

(i = 1, f = 3.14, s = "Julia")

In [90]:
my_namedtuple.f

3.14

In [93]:
it = 1
ft = 3.14
st = "Julia"

"Julia"

Begin the named tuple construction by specifying first a semicolon ; before the values

In [94]:
my_quick_namedtuple = (; it, ft, st)

(it = 1, ft = 3.14, st = "Julia")

- String
- Tuple
- NamedTuple
- **UnitRange**
- Array
- Pair
- Dict
- Symbol

In [95]:
1:10

1:10

In [96]:
typeof(1:10)

UnitRange{Int64}

In [97]:
0.0:0.2:1.0

0.0:0.2:1.0

In [98]:
typeof(0.0:0.2:1.0)

StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}

- String
- Tuple
- NamedTuple
- UnitRange
- **Array**
- Pair
- Dict
- Symbol

If you want to _materialize_ a range into a collection, you can use the function `collect`

In [100]:
collect(1:5)

5-element Vector{Int64}:
 1
 2
 3
 4
 5

In [101]:
myarray = [1, 2, 3]

3-element Vector{Int64}:
 1
 2
 3

Let’s start with array types. There are several, but we will focus on the following two

- Vector{T}: one-dimensional array. Alias for Array{T, 1}.
- Matrix{T}: two-dimensional array. Alias for Array{T, 2}.

In [102]:
my_vector = Vector{Float64}(undef, 10)

10-element Vector{Float64}:
 2.212206257e-314
 2.2122062886e-314
 2.2122222173e-314
 2.3282579136e-314
 2.3532027585e-314
 2.2122063203e-314
 2.212206352e-314
 2.2122059527e-314
 2.2122063835e-314
 2.212206415e-314

In [103]:
my_matrix = Matrix{Float64}(undef, 10, 2)

10×2 Matrix{Float64}:
 4.88059e-313  8.48798e-314
 5.30499e-313  2.54639e-313
 6.36599e-313  2.75859e-313
 5.51719e-313  7.85138e-313
 5.51719e-313  8.27578e-313
 5.94159e-313  2.97079e-313
 4.03179e-313  3.18299e-313
 2.122e-314    8.70018e-313
 4.24399e-314  8.70018e-313
 4.24399e-314  2.122e-314

**Common arrays**

In [104]:
zeros(3,3)

3×3 Matrix{Float64}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

In [105]:
ones(Int64, 3,3)

3×3 Matrix{Int64}:
 1  1  1
 1  1  1
 1  1  1

This also works: (array literals)

In [106]:
[[1 2]
 [3 4]]

2×2 Matrix{Int64}:
 1  2
 3  4

In [107]:
Float64[[1 2]
        [3 4]]

2×2 Matrix{Float64}:
 1.0  2.0
 3.0  4.0

In [108]:
# mix and match
[ones(Int, 2, 2) zeros(Int, 2, 2)]

2×4 Matrix{Int64}:
 1  1  0  0
 1  1  0  0

In [110]:
# array comprehension
[x^2 for x in 1:5]

5-element Vector{Int64}:
  1
  4
  9
 16
 25

In [112]:
[x*y for x in 1:5 for y in 1:2]

10-element Vector{Int64}:
  1
  2
  2
  4
  3
  6
  4
  8
  5
 10

In [113]:
[x*y for x in 1:5, y in 1:2]

5×2 Matrix{Int64}:
 1   2
 2   4
 3   6
 4   8
 5  10

In [114]:
# conditional
[x^2 for x in 1:5 if isodd(x)]

3-element Vector{Int64}:
  1
  9
 25

**Concatenation**: to chain together.

In [117]:
cat(ones(2), zeros(2), dims=1) # vcat

4-element Vector{Float64}:
 1.0
 1.0
 0.0
 0.0

In [118]:
cat(ones(2), zeros(2), dims=2) # hcat

2×2 Matrix{Float64}:
 1.0  0.0
 1.0  0.0

**Array Inspection**
- What type of elements are inside an array ? 

In [119]:
eltype(myarray)

Int64

In [121]:
length(myarray) # total number of elements

3

In [122]:
ndims(myarray) # number of dimensions

1

In [126]:
size(myarray) # array’s dimensions

(3,)

In [125]:
size(myarray, 1)

3

**Indexing and slicing**

In [129]:
# example vector
ex_vec = [1, 2, 3, 4, 5]

5-element Vector{Int64}:
 1
 2
 3
 4
 5

In [131]:
# example matrix
ex_mat = [[1 2 3]
          [4 5 6]
          [7 8 9]]

3×3 Matrix{Int64}:
 1  2  3
 4  5  6
 7  8  9

In [132]:
ex_vec[3]

3

In [133]:
ex_mat[2,1]

4

In [134]:
ex_vec[end]

5

In [138]:
ex_mat[end, begin]

7

**slicing**

In [139]:
ex_vec[2:4]

3-element Vector{Int64}:
 2
 3
 4

In [140]:
ex_mat[2, :]

3-element Vector{Int64}:
 4
 5
 6

**Manipulations**

In [141]:
ex_mat[2,2] = 100

100

In [142]:
ex_mat

3×3 Matrix{Int64}:
 1    2  3
 4  100  6
 7    8  9

In [143]:
ex_mat[3, :] = [17,16,15]

3-element Vector{Int64}:
 17
 16
 15

In [144]:
ex_mat

3×3 Matrix{Int64}:
  1    2   3
  4  100   6
 17   16  15

`reshape`

In [145]:
six_vector = [1, 2, 3, 4, 5, 6]

6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

In [146]:
three_two_matrix = reshape(six_vector, (3, 2))

3×2 Matrix{Int64}:
 1  4
 2  5
 3  6

In [147]:
reshape(three_two_matrix, (6, ))

6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

**Apply a function over every array element**

In [150]:
log.(ex_mat)

3×3 Matrix{Float64}:
 0.0      0.693147  1.09861
 1.38629  4.60517   1.79176
 2.83321  2.77259   2.70805

Dot `.` operator, broadcasting

In [151]:
ex_mat .+ 100

3×3 Matrix{Int64}:
 101  102  103
 104  200  106
 117  116  115

In [152]:
map(log, ex_mat)

3×3 Matrix{Float64}:
 0.0      0.693147  1.09861
 1.38629  4.60517   1.79176
 2.83321  2.77259   2.70805

In [153]:
map(x -> 3x, ex_mat)

3×3 Matrix{Int64}:
  3    6   9
 12  300  18
 51   48  45

In [154]:
(x -> 3x).(ex_mat)

3×3 Matrix{Int64}:
  3    6   9
 12  300  18
 51   48  45

`mapslices`

In [155]:
mapslices(sum, ex_mat; dims=1)

1×3 Matrix{Int64}:
 22  118  24

In [156]:
mapslices(sum, ex_mat; dims=2)

3×1 Matrix{Int64}:
   6
 110
  48

**Array Iteration**

In [157]:
simple_vector = [1, 2, 3]

empty_vector = Int64[]

for i in simple_vector
    push!(empty_vector, i + 1)
end

empty_vector

3-element Vector{Int64}:
 2
 3
 4

In [158]:
forty_twos = [42, 42, 42]

empty_vector = Int64[]

for i in eachindex(forty_twos)
    push!(empty_vector, i)
end

empty_vector

3-element Vector{Int64}:
 1
 2
 3

- String
- Tuple
- NamedTuple
- UnitRange
- Array
- **Pair**
- **Dict**
- **Symbol**

In [159]:
my_pair = "Julia" => 42

"Julia" => 42

In [160]:
name2number_map = Dict("one" => 1, "two" => 2)

Dict{String, Int64} with 2 entries:
  "two" => 2
  "one" => 1

In [161]:
sym = :some_text

:some_text

**Splat Operator**

In [162]:
add_elements(a, b, c) = a + b + c

add_elements (generic function with 1 method)

In [163]:
my_collection = [1, 2, 3]

3-element Vector{Int64}:
 1
 2
 3

In [164]:
add_elements(my_collection...)

6

In [165]:
add_elements(1:3...)

6

#### 11:00-11:20 using Pkg
- **Project Management**
- DataFrames
- Create a DataFrame
- using CSV
- write and read
- name and slicing

In [166]:
using Pkg

In [167]:
Pkg.status()

[32m[1mStatus[22m[39m `~/.julia/environments/v1.8/Project.toml`
[32m⌃[39m[90m [295af30f] [39mRevise v3.4.0
[36m[1mInfo[22m[39m Packages marked with [32m⌃[39m have new versions available and may be upgradable.


In [168]:
Pkg.activate(".")

[32m[1m  Activating[22m[39m new project at `~/Documents/JuliaEO_Presentation`


In [169]:
Pkg.status()

[32m[1mStatus[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml` (empty project)


In [170]:
Pkg.add("DataFrames")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`


[32m[1m   Resolving[22m[39m package versions...


[32m[1m    Updating[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [a93c6f00] [39m[92m+ DataFrames v1.4.4[39m
[32m[1m    Updating[22m[39m `~/Documents/JuliaEO_Presentation/Manifest.toml`
 [90m [34da2185] [39m[92m+ Compat v4.5.0[39m
 [90m [a8cc5b0e] [39m[92m+ Crayons v4.1.1[39m
 [90m [9a962f9c] [39m[92m+ DataAPI v1.14.0[39m
 [90m [a93c6f00] [39m[92m+ DataFrames v1.4.4[39m
 [90m [864edb3b] [39m[92m+ DataStructures v0.18.13[39m
 [90m [e2d170a0] [39m[92m+ DataValueInterfaces v1.0.0[39m
 [90m [59287772] [39m[92m+ Formatting v0.4.2[39m
 [90m [41ab1584] [39m[92m+ InvertedIndices v1.2.0[39m
 [90m [82899510] [39m[92m+ IteratorInterfaceExtensions v1.0.0[39m
 [90m [b964fa9f] [39m[92m+ LaTeXStrings v1.3.0[39m
 [90m [e1d29d7a] [39m[92m+ Missings v1.1.0[39m
 [90m [bac558e1] [39m[92m+ OrderedCollections v1.4.1[39m
 [90m [2dfb63ee] [39m[92m+ PooledArrays v1.4.2[39m
 [90m [08abe8d2] [39m[92m+ PrettyTables v2.2.2[39m
 [9

In [171]:
Pkg.status()

[32m[1mStatus[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [a93c6f00] [39mDataFrames v1.4.4


In [172]:
using DataFrames

In [173]:
 df = DataFrame(; name=["Sally", "Bob", "Alice", "Hank"],
    grade_2020=[1, 5, 8.5, 4])

Row,name,grade_2020
Unnamed: 0_level_1,String,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0


In [174]:
Pkg.add("CSV")

[32m[1m   Resolving[22m[39m package versions...


[32m[1m   Installed[22m[39m CSV ─ v0.10.9


[32m[1m    Updating[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.10.9[39m
[32m[1m    Updating[22m[39m `~/Documents/JuliaEO_Presentation/Manifest.toml`
 [90m [336ed68f] [39m[92m+ CSV v0.10.9[39m
 [90m [944b1d66] [39m[92m+ CodecZlib v0.7.0[39m
 [90m [48062228] [39m[92m+ FilePathsBase v0.9.20[39m
 [90m [842dd82b] [39m[92m+ InlineStrings v1.3.2[39m
 [90m [69de0a69] [39m[92m+ Parsers v2.5.2[39m
 [90m [91c51154] [39m[92m+ SentinelArrays v1.3.16[39m
 [90m [3bb67fe8] [39m[92m+ TranscodingStreams v0.9.10[39m
 [90m [ea10d353] [39m[92m+ WeakRefStrings v1.4.2[39m
 [90m [76eceee3] [39m[92m+ WorkerUtilities v1.6.1[39m
 [90m [a63ad114] [39m[92m+ Mmap[39m
 [90m [83775a58] [39m[92m+ Zlib_jll v1.2.12+3[39m


[32m[1mPrecompiling[22m[39m project...


[32m  ✓ [39mCSV


  1 dependency successfully precompiled in 8 seconds. 32 already precompiled.


In [175]:
Pkg.status()

[32m[1mStatus[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [336ed68f] [39mCSV v0.10.9
 [90m [a93c6f00] [39mDataFrames v1.4.4


In [176]:
using CSV

In [177]:
CSV.write("grades.csv", df)

"grades.csv"

In [179]:
df = CSV.read("grades.csv", DataFrame)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0


In [180]:
df.name

4-element Vector{String7}:
 "Sally"
 "Bob"
 "Alice"
 "Hank"

In [182]:
df[!, :name] # :grade_2020

4-element Vector{String7}:
 "Sally"
 "Bob"
 "Alice"
 "Hank"

In [183]:
# makes a new copy
df[:, :name]

4-element Vector{String7}:
 "Sally"
 "Bob"
 "Alice"
 "Hank"

In [184]:
df[1, :name]

"Sally"

In [185]:
df[1:2, :name]

2-element Vector{String7}:
 "Sally"
 "Bob"

#### 11:20-11:40 Tabular data
- **filter**
- subset
- select
- Categorical Data
- **Join**
- innerjoin, outerjoin, crossjoin, leftjoin, rightjoin, semijoin, antijoin

In [187]:
filter(x-> x>3, [1,2,3,4,5])

2-element Vector{Int64}:
 4
 5

For DataFrames

`filter(source => f::Function, df)`

In [188]:
df

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0


In [193]:
# for DataFrames
equals_alice(name) = name == "Alice"

equals_alice (generic function with 3 methods)

In [194]:
filter(:name => equals_alice, df)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Alice,8.5


Also works for a vector!

In [195]:
filter(equals_alice, ["Alice", "Bob", "Dave"])

1-element Vector{String}:
 "Alice"

In [196]:
# anonymous function
filter(:name => n -> n == "Alice", df)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Alice,8.5


Maybe a better way will be:

In [197]:
filter(:name => ==("Alice"), df)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Alice,8.5


Or, not Alice ? 

In [200]:
filter(:name => !=("Alice"), df)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Sally,1.0
2,Bob,5.0
3,Hank,4.0


- filter
- **subset**
- select
- Categorical Data
- **Join**
- innerjoin, outerjoin, crossjoin, leftjoin, rightjoin, semijoin, antijoin

`subset` works on complete columns

In [201]:
subset(df, :name => ByRow(equals_alice))

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Alice,8.5


In [202]:
subset(df, :name => ByRow(name -> name == "Alice"))

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Alice,8.5


In [203]:
subset(df, :name => ByRow(==("Alice")))

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Alice,8.5


- filter
- subset
- **select**
- Categorical Data
- **Join**
- innerjoin, outerjoin, crossjoin, leftjoin, rightjoin, semijoin, antijoin

In [204]:
function responses()
    id = [1, 2]
    q1 = [28, 61]
    q2 = [:us, :fr]
    q3 = ["F", "B"]
    q4 = ["B", "C"]
    q5 = ["A", "E"]
    DataFrame(; id, q1, q2, q3, q4, q5)
end

responses (generic function with 1 method)

In [205]:
resp = responses()

Row,id,q1,q2,q3,q4,q5
Unnamed: 0_level_1,Int64,Int64,Symbol,String,String,String
1,1,28,us,F,B,A
2,2,61,fr,B,C,E


In [215]:
select(resp, :id, :q1)

Row,id,q1
Unnamed: 0_level_1,Int64,Int64
1,1,28
2,2,61


In [214]:
select(resp, "id", "q1", "q2")

Row,id,q1,q2
Unnamed: 0_level_1,Int64,Int64,Symbol
1,1,28,us
2,2,61,fr


In [213]:
# regex
select(resp, r"^q")

Row,q1,q2,q3,q4,q5
Unnamed: 0_level_1,Int64,Symbol,String,String,String
1,28,us,F,B,A
2,61,fr,B,C,E


In [211]:
select(resp, Not(:q5))

Row,id,q1,q2,q3,q4
Unnamed: 0_level_1,Int64,Int64,Symbol,String,String
1,1,28,us,F,B
2,2,61,fr,B,C


In [210]:
select(resp, Not([:q4, :q5]))

Row,id,q1,q2,q3
Unnamed: 0_level_1,Int64,Int64,Symbol,String
1,1,28,us,F
2,2,61,fr,B


In [217]:
# mix and match columns that we want to preserve with columns that we do Not want
select(resp, :q5, Not(:q5))

Row,q5,id,q1,q2,q3,q4
Unnamed: 0_level_1,String,Int64,Int64,Symbol,String,String
1,A,1,28,us,F,B
2,E,2,61,fr,B,C


In [218]:
select(resp, :q5, :)

Row,q5,id,q1,q2,q3,q4
Unnamed: 0_level_1,String,Int64,Int64,Symbol,String,String
1,A,1,28,us,F,B
2,E,2,61,fr,B,C


**renaming columns via `select`**

In [219]:
select(resp, 1, :q1, :q2)

Row,id,q1,q2
Unnamed: 0_level_1,Int64,Int64,Symbol
1,1,28,us
2,2,61,fr


In [220]:
select(resp, 1 => "participant", :q1 => "age", :q2 => "nationality")

Row,participant,age,nationality
Unnamed: 0_level_1,Int64,Int64,Symbol
1,1,28,us
2,2,61,fr


- filter
- subset
- select
- **Categorical Data**
- **Join**
- innerjoin, outerjoin, crossjoin, leftjoin, rightjoin, semijoin, antijoin

In [223]:
Pkg.add(["CategoricalArrays", "Dates"])

[32m[1m   Resolving[22m[39m package versions...


[32m[1m    Updating[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [ade2ca70] [39m[92m+ Dates[39m
[32m[1m  No Changes[22m[39m to `~/Documents/JuliaEO_Presentation/Manifest.toml`


In [224]:
using CategoricalArrays, Dates

In [233]:
function date_col()
    id = 1:4
    date =Date.(["28-01-2018", "03-04-2019", "01-08-2018", "22-11-2020"],
        dateformat"dd-mm-yyyy")
    age = ["adolescent", "adult", "infant", "adult"]
    DataFrame(; id, date, age)
end

date_col (generic function with 1 method)

In [234]:
date_col()

Row,id,date,age
Unnamed: 0_level_1,Int64,Date,String
1,1,2018-01-28,adolescent
2,2,2019-04-03,adult
3,3,2018-08-01,infant
4,4,2020-11-22,adult


In [236]:
sort(date_col(), :age)

Row,id,date,age
Unnamed: 0_level_1,Int64,Date,String
1,1,2018-01-28,adolescent
2,2,2019-04-03,adult
3,4,2020-11-22,adult
4,3,2018-08-01,infant


In [237]:
function fix_categ(df)
    levels = ["infant", "adolescent", "adult"]
    ages = categorical(df[!, :age]; levels, ordered=true)
    df[!, :age] = ages
    df
end

fix_categ (generic function with 1 method)

In [239]:
df_categ = fix_categ(date_col())

Row,id,date,age
Unnamed: 0_level_1,Int64,Date,Cat…
1,1,2018-01-28,adolescent
2,2,2019-04-03,adult
3,3,2018-08-01,infant
4,4,2020-11-22,adult


In [240]:
a = df_categ[1, :age]
b = df_categ[2, :age]
a < b

true

- filter
- subset
- select
- Categorical Data
- **Join**
- innerjoin, outerjoin, crossjoin, leftjoin, rightjoin, semijoin, antijoin

In [242]:
df_2021 = DataFrame(; name=["Bob 2", "Sally", "Hank"],
grade_2021=[9.5, 9.5, 6])

Row,name,grade_2021
Unnamed: 0_level_1,String,Float64
1,Bob 2,9.5
2,Sally,9.5
3,Hank,6.0


In [250]:
innerjoin(df, df_2021)

ArgumentError: ArgumentError: Missing join argument 'on'.

In [251]:
innerjoin(df, df_2021, on=:name)

Row,name,grade_2020,grade_2021
Unnamed: 0_level_1,String7,Float64,Float64
1,Sally,1.0,9.5
2,Hank,4.0,6.0


Do the others !

#### 11:40-12:15 Variable Transformations
- **transform**
- groupby
- combine
- dropmissing
- coalesce [replace missing values]
- skipmissing

In [252]:
plus_one(grades) = grades .+ 1

plus_one (generic function with 1 method)

In [253]:
transform(df, :grade_2020 => plus_one)

Row,name,grade_2020,grade_2020_plus_one
Unnamed: 0_level_1,String7,Float64,Float64
1,Sally,1.0,2.0
2,Bob,5.0,6.0
3,Alice,8.5,9.5
4,Hank,4.0,5.0


In [254]:
# rename
transform(df, :grade_2020 => plus_one => :grade_2020)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Sally,2.0
2,Bob,6.0
3,Alice,9.5
4,Hank,5.0


In [255]:
# rename false
transform(df, :grade_2020 => plus_one; renamecols=false)

Row,name,grade_2020
Unnamed: 0_level_1,String7,Float64
1,Sally,2.0
2,Bob,6.0
3,Alice,9.5
4,Hank,5.0


- transform
- **groupby**
- **combine**
- dropmissing
- coalesce [replace missing values]
- skipmissing

In [257]:
function all_grades(df1, df2)
    df1 = select(df1, :name, :grade_2020 => :grade)
    df2 = select(df2, :name, :grade_2021 => :grade)
    rename_bob2(data_col) = replace.(data_col, "Bob 2" => "Bob")
    df2 = transform(df2, :name => rename_bob2 => :name)
    return vcat(df1, df2)
end

all_grades (generic function with 1 method)

In [259]:
df_grades = all_grades(df, df_2021)

Row,name,grade
Unnamed: 0_level_1,String,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0
5,Bob,9.5
6,Sally,9.5
7,Hank,6.0


In [261]:
groupby(df_grades, :name)

Row,name,grade
Unnamed: 0_level_1,String,Float64
1,Sally,1.0
2,Sally,9.5

Row,name,grade
Unnamed: 0_level_1,String,Float64
1,Hank,4.0
2,Hank,6.0


In [262]:
Pkg.add("Statistics")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [10745b16] [39m[92m+ Statistics[39m
[32m[1m  No Changes[22m[39m to `~/Documents/JuliaEO_Presentation/Manifest.toml`


In [264]:
using Statistics

In [265]:
gdf = groupby(df_grades, :name)
combine(gdf, :grade => mean)

Row,name,grade_mean
Unnamed: 0_level_1,String,Float64
1,Sally,5.25
2,Bob,7.25
3,Alice,8.5
4,Hank,5.0


But what if we want to apply a function to multiple columns of our dataset?

In [266]:
group = [:A, :A, :B, :B]
X = 1:4
Y = 5:8
df_g = DataFrame(; group, X, Y)

Row,group,X,Y
Unnamed: 0_level_1,Symbol,Int64,Int64
1,A,1,5
2,A,2,6
3,B,3,7
4,B,4,8


In [267]:
gdf = groupby(df_g, :group)
combine(gdf, [:X, :Y] .=> mean; renamecols=false)

Row,group,X,Y
Unnamed: 0_level_1,Symbol,Float64,Float64
1,A,1.5,5.5
2,B,3.5,7.5


- transform
- groupby
- combine
- **dropmissing**
- **coalesce [replace missing values]**
- skipmissing

In [268]:
df_missing = DataFrame(;
    name=[missing, "Sally", "Alice", "Hank"],
    age=[17, missing, 20, 19],
    grade_2020=[5.0, 1.0, missing, 4.0],
)

Row,name,age,grade_2020
Unnamed: 0_level_1,String?,Int64?,Float64?
1,missing,17,5.0
2,Sally,missing,1.0
3,Alice,20,missing
4,Hank,19,4.0


In [269]:
dropmissing(df_missing)

Row,name,age,grade_2020
Unnamed: 0_level_1,String,Int64,Float64
1,Hank,19,4.0


In [270]:
dropmissing(df_missing, :name)

Row,name,age,grade_2020
Unnamed: 0_level_1,String,Int64?,Float64?
1,Sally,missing,1.0
2,Alice,20,missing
3,Hank,19,4.0


In [271]:
dropmissing(df_missing, [:name, :age])

Row,name,age,grade_2020
Unnamed: 0_level_1,String,Int64,Float64?
1,Alice,20,missing
2,Hank,19,4.0


In [272]:
filter(:name => !ismissing, df_missing)

Row,name,age,grade_2020
Unnamed: 0_level_1,String?,Int64?,Float64?
1,Sally,missing,1.0
2,Alice,20,missing
3,Hank,19,4.0


In [273]:
coalesce.([missing, "some value", missing], "zero")

3-element Vector{String}:
 "zero"
 "some value"
 "zero"

- transform
- groupby
- combine
- dropmissing
- coalesce [replace missing values]
- **skipmissing**

In [275]:
combine(df_missing, :grade_2020 => mean)

Row,grade_2020_mean
Unnamed: 0_level_1,Missing
1,missing


In [276]:
combine(df_missing, :grade_2020 => mean ∘ skipmissing )

Row,grade_2020_mean_skipmissing
Unnamed: 0_level_1,Float64
1,3.33333


#### Summary
- Language Syntax
- Native Data Structures
- using Pkg
- Tabular data
- Variable Transformations

In [277]:
Pkg.status()

[32m[1mStatus[22m[39m `~/Documents/JuliaEO_Presentation/Project.toml`
 [90m [336ed68f] [39mCSV v0.10.9
 [90m [324d7699] [39mCategoricalArrays v0.10.7
 [90m [a93c6f00] [39mDataFrames v1.4.4
 [90m [ade2ca70] [39mDates
 [90m [10745b16] [39mStatistics
