# Using Julia for some data wrangling taks


`Julia` is a scripting language, like R and Python, with similarities and differences:


  * Like R and Python, the basic syntax of `Julia` is easy enough to learn
  * Like R and unlike Python, `Julia` has a concept of generic methods and multiple dispatch as an alternative to object-oriented approaches
  * `Julia` has a richer type system with parametric types helpful for generic programming over concrete types.
  * `Julia`'s multiple dispatch is said to be as easy as S3 and as powerful as S4 (Styles for R)
  * Unlike R and Python (but not variants), `Julia` uses llvm to on-the-fly compile its methods, resulting in a tradeoff balancing compile-time latency on first use with very rapid runtime after compilation
  * When well written, `Julia` can match speeds of C and Fortran, so it is possible to avoid the "two-language" problem
  * Like R and Python, base `Julia` code is readily extended  by add-on packages; packages are easily managed by a package manager.
  * Like `R` and `Python`, `Julia` readily interfaces with other languages  (R, Python, C are good examples)
  * `Julia` inherits practices from: lisp, R, Python, Ruby, and Matlab (making anyone feel at home?)
  * For numerical programming, as is often done in Matlab and Python, `Julia` has many best in class packages (e.g. `SciML`)
  * For general purpose programming, such is often done with Python, `Julia` has a pretty rich set of packages
  * For statistical programming, such is often done with R, `Julia` has many packages and great promise for new-package development. Unlike R, most all statistical features are in add-on packages, such as `DataFrames.jl` shown below.


This presentation will demonstrate a modest data wrangling task that might be familiar to R users or Pandas users.


## A few key Julia features


### Types


Julia has types (not classes) for different code values


Base number types include:


  * integer, float, rational, big numbers, complex


In [1]:
i,s,r,b,c = 1, 1.0, 1//1, big(1), 1 + 0im

(1, 1.0, 1//1, 1, 1 + 0im)

In [2]:
typeof(i), typeof(r), typeof(s), typeof(b), typeof(c)

(Int64, Rational{Int64}, Float64, BigInt, Complex{Int64})

  * concrete versus abstract (Real, Integer, AbstractFloat)


In [3]:
isa(i, Integer), isa(r, Integer), isa(b, Integer)

(true, false, true)

In [4]:
isa(s, Real), isa(s, AbstractFloat)

(true, true)

Strings and symbols play a role in names for data frames:


In [5]:
"string", :symbol

("string", :symbol)

`Julia` has both `nothing` and `missing`, with `missing` playing the role of `NA` in R.


In [6]:
nothing, missing

(nothing, missing)

### Containers


Base `Julia` provides many basic containers for values:


Vectors promote values to a common type


In [7]:
[1, 1.0, 1//1]

3-element Vector{Float64}:
 1.0
 1.0
 1.0

Tuples allow for heterogeneous containers (like a list in R)


In [8]:
(1, 1.0, 1//1)

(1, 1.0, 1//1)

While in R, vectors are matrices, in `Julia` they are distinct, though both are special cases of the `Array{T, N}` type:


In [9]:
v = [1,2,3]

3-element Vector{Int64}:
 1
 2
 3

In [10]:
typeof(v)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

In [11]:
M = [1 2; 3 4]

2×2 Matrix{Int64}:
 1  2
 3  4

In [12]:
typeof(M)

Matrix{Int64}[90m (alias for [39m[90mArray{Int64, 2}[39m[90m)[39m

The matrix `M`, as defined above, is loaded row by row. We can create a row vector, mirroring `v`, but it is an array with 2 coordinates:


In [13]:
vr = [1 2 3]

1×3 Matrix{Int64}:
 1  2  3

In [14]:
typeof(vr)

Matrix{Int64}[90m (alias for [39m[90mArray{Int64, 2}[39m[90m)[39m

Julia takes transposes seriously (i.e. `v'` is not `vr`):


In [15]:
v'

1×3 adjoint(::Vector{Int64}) with eltype Int64:
 1  2  3

Associative arrays are implemented by dictionaries or named tuples:


In [16]:
nt = (a=1, b=2, c=3)
nt.a

1

In [17]:
dct = Dict("a"=>1, "b"=>2, "c"=>3)
dct["a"]

1

(The keys of a named tuple are symbols, for the dictionary they are strings above, but may be other types.)


In Julia it is very much possible for external packages to provide additional container types. We will use `DataFrame` and `NamedTable` in the following.


### Functions


There are different ways to define a function and different types of functions. These define two methods for a generic function `f`:


In [18]:
f(x) = x^5 + x - 1

f (generic function with 1 method)

In [19]:
function f(x, y)
   x + 2x*y + y^2
end

f (generic function with 2 methods)

This defines an anonymous function which is then *assigned* to `g`:


In [20]:
g = x -> x^5 - x - 1

#1 (generic function with 1 method)

Both types of functions are called in the conventional way:


In [21]:
f(1), g(1)

(1, -1)

But generic functions have dispatch determined by the signature. Here we see the number of arguments dictates which method is called:


In [22]:
f(1,2) # uses f(x,y) not a call to f(x), which would error

9

Dispatch on the type of an argument is possible too, and perhaps more common. Here are default methods for `log` restricted to an initial argument of type `Number`:


In [23]:
methods(log, (Number,))

Packages and users can extend the `log` generic for other types, though it is *expected* that it be narrowed to types that they "own."


---


Anonymous functions are useful with higher order programming. The above definition for `g` is just a binding of the name to the anonymous function. Bindings are dynamic:


In [24]:
g = 9.8

9.8

The generic function adds to a (global) method table. The method table can be modified, but the binding can not:


In [25]:
f(x,y,z) = x^2 + y^2 + x^2

f (generic function with 3 methods)

In [26]:
#| error: true
f = 42

LoadError: invalid redefinition of constant Main.f

Functions may have *positional* arguments (possibly with default values) and *keyword* ararguments (with default values)


In [27]:
h(a, b, c=3; d=4, e::Integer=5) = (a,b,c,d,e)
@show h(1, 2)
@show h(1, 2, 4)
@show h(1, 2; d=6)  # ; or , are okay when calling h; positional first

h(1, 2) = (1, 2, 3, 4, 5)
h(1, 2, 4) = (1, 2, 4, 4, 5)
h(1, 2; d = 6) = (1, 2, 3, 6, 5)


(1, 2, 3, 6, 5)

Function application is also available through the `|>` (pipe) operator:


In [28]:
3 |> f  # calls f(x) = x^5 - x - 1 with a value of `3`

245

The definition for this operator is just


In [29]:
#| eval: false (will error, it extends a base operator...)
|>(x, f) = f(x)

LoadError: invalid method definition in Main: function Base.|> must be explicitly imported to be extended

## The dot syntax


R is vectorized. Matlab is also, but needs a "dot" to disambiguate certain operations. Current `Julia` is not vectorized, but a "dot" is used to broadcast function calls over the arguments (perhaps of different sizes)


In [30]:
x = [1,2,3]
f.(x)  # [f(1), f(2), f(3)] like `map(f, x)`

3-element Vector{Int64}:
   1
  33
 245

In [31]:
a = [:a, :b]  # "column vector"
b = [:c :d]   # row vector
h(x,y) = (x,y)
h.(a,b)

2×2 Matrix{Tuple{Symbol, Symbol}}:
 (:a, :c)  (:a, :d)
 (:b, :c)  (:b, :d)

## Iteration


Basic iteration can be done using a for loop:


In [32]:
a, b = 1, 1
for i in 1:3
    a, b = b, a + b
end
a, b

(3, 5)

Many objects are iterable (as the range `1:3` above). There are several *helper* functions for iteration. Among others, these examples show 3 ways to iterate over a matrix (each element, each row, each column):


In [33]:
M = [1 2; 3 4]
for r ∈ M # order of traversal down column then over row
    @show r
end

r = 1
r = 3
r = 2
r = 4


In [34]:
for r ∈ eachrow(M)
    @show r
end

r = [1, 2]
r = [3, 4]


In [35]:
for r ∈ eachcol(M)
    @show r
end

r = [1, 3]
r = [2, 4]


Basic iteration tasks can also be achieved with a comprehension:


In [36]:
[2i + 4 for i in 1:3]

3-element Vector{Int64}:
  6
  8
 10

Similar to map which takes a function rather than an expression:


In [37]:
map(i -> 2i + 4, 1:3)

3-element Vector{Int64}:
  6
  8
 10

Base `Julia` provides many other higher order functions and add-on packages even more.


## A data wrangling example using DataFrames


Base Julia is extended by add-on packages. The built-in package manager can install them.


In [38]:
#| eval: false
using Pkg
Pkg.add(["CSV", "DataFrames"])

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


Packages are included in a session via `using` (or `import`):


In [39]:
using CSV, DataFrames

The background for this data is a possible college, Euphoria State. Each semester student records are stored, in this example as CSV files. These records included many things, but in particular grade for each class by student. The scenario below is there are historic grades and new enrollment data in separate files.


The `CSV` provides a package for parsing structured data into a tabular format.  Generally the data is in a file, here we read it in from a multi-line string and store the data as a `DataFrame` object:


In [40]:
s11_data = """
Column1,Term,Subject,Catalog,ID,Name,Session,Grade,Grade.In
1,1149,ACC,114,812315,"Abernathy,Alice",1,A,A
2,1149,MTH,123,812315,"Abernathy,Alice",1,C,C
3,1152,ENG,132,812315,"Abernathy,Alice",1,A,A
4,1152,ENG,211,812315,"Abernathy,Alice",1,B,B
5,1169,MTH,231,889995,"Ballew,Bob",1,A,A
6,1169,MTH,229,889995,"Ballew,Bob",1,A,A
7,1172,ENG,111,889995,"Ballew,Bob",1,B,B
8,1172,CSC,222,889995,"Ballew,Bob",1,A-,A-
9,1179,CSC,222,889995,"Ballew,Bob",1,F,F
10,1179,ENG,232,889995,"Ballew,Bob",1,A,A
11,1182,PSY,100,889995,"Ballew,Bob",1,B+,B+
12,1192,PSY,100,163486,"Carol,Carol",1,A,A
13,1192,MTH,123,163486,"Carol,Carol",1,A,A
14,1199,MTH,231,163486,"Carol,Carol",1,A,A
15,1202,MTH,232,163486,"Carol,Carol",1,W,W
"""

"Column1,Term,Subject,Catalog,ID,Name,Session,Grade,Grade.In\n1,1149,ACC,114,812315,\"Abernathy,Alice\",1,A,A\n2,1149,MTH,123,812315,\"Abernathy,Alice\",1,C,C\n3,1152,ENG,132,812315,\"Abernathy,Alice\",1,A,A\n4,1152,ENG,211,812315,\"Abernathy,Alice\",1,B,B\n5,1169,MTH,231,889995,\"Ball"[93m[1m ⋯ 168 bytes ⋯ [22m[39m",Bob\",1,F,F\n10,1179,ENG,232,889995,\"Ballew,Bob\",1,A,A\n11,1182,PSY,100,889995,\"Ballew,Bob\",1,B+,B+\n12,1192,PSY,100,163486,\"Carol,Carol\",1,A,A\n13,1192,MTH,123,163486,\"Carol,Carol\",1,A,A\n14,1199,MTH,231,163486,\"Carol,Carol\",1,A,A\n15,1202,MTH,232,163486,\"Carol,Carol\",1,W,W\n"

In [41]:
s11 = CSV.read(IOBuffer(s11_data), DataFrame)

Row,Column1,Term,Subject,Catalog,ID,Name,Session,Grade,Grade.In
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,Int64,String3,String3
1,1,1149,ACC,114,812315,"Abernathy,Alice",1,A,A
2,2,1149,MTH,123,812315,"Abernathy,Alice",1,C,C
3,3,1152,ENG,132,812315,"Abernathy,Alice",1,A,A
4,4,1152,ENG,211,812315,"Abernathy,Alice",1,B,B
5,5,1169,MTH,231,889995,"Ballew,Bob",1,A,A
6,6,1169,MTH,229,889995,"Ballew,Bob",1,A,A
7,7,1172,ENG,111,889995,"Ballew,Bob",1,B,B
8,8,1172,CSC,222,889995,"Ballew,Bob",1,A-,A-
9,9,1179,CSC,222,889995,"Ballew,Bob",1,F,F
10,10,1179,ENG,232,889995,"Ballew,Bob",1,A,A


Some things are non-essential: `Grade.In` is technical, `Column1` an artifact of writing to a CSV file, ...


### Access patterns


Values in DataFrames can be accessed by index, column name, etc.


In [42]:
s11[2,5], s11[2, :ID], s11[2, "ID"], s11[2, r"^I"]

(812315, 812315, 812315, [1mDataFrameRow
[1m Row │[1m ID
     │[90m Int64
─────┼────────
   2 │ 812315)

The last one prints differently, as the **column selector** could *possibly* match 0-1 or more columns, so a data frame is returned. The others match just a column, so the value is returned.


---


In [43]:
# 💻 What "Term" is recorded in the 5th row?
s11[5, :Term]

1169

---


All rows (or all columns) are implied by a colon, `:`:


In [44]:
s = first(s11, 3)  # first 3 rows
s.ID, s[:, 5], s[:, "ID"], s[:, :ID]

([812315, 812315, 812315], [812315, 812315, 812315], [812315, 812315, 812315], [812315, 812315, 812315])

The use of `:` above to reference all rows has an alternative:


In [45]:
s[!,5], s[!, :ID]

([812315, 812315, 812315], [812315, 812315, 812315])

The basic difference is that `:` makes a copy, whereas `!` makes a view. See this [blog post](https://bkamins.github.io/julialang/2022/10/28/indexing.html) for more detail.


When assigning a value in a column, the use of `s.ID` is convenient, as it replaces the current column.


## Combining data frames


Next we define some mock new data for the current semester


In [46]:
f22_data = """
Term,Session,Career,Class Nbr,Section,Subject,Catalog,Component,ID,Name,Gender,Phone,Email,Grade,Grade In,Repeat,Instructor ID,Instructor Name,Day,Mtg Start,Mtg End,Add Dt,User,Grade Base
1229,1,UGRD,36923,D001,MTH,105,LEC,163486,"Carol,Carol",F,555-1212,carol.carol@euphoria.edu,,,,24007235,Frank,M,10:10AM,12:05PM,8/17/22,23247055,GRD
1229,1,UGRD,36923,D001,MTH,105,LEC,163486,"Carol,Carol",F,555-1212,carol.carol@euphoria.edu,,,,24007235,Frank,W,10:10AM,12:05PM,8/17/22,23247055,GRD
1229,1,UGRD,34534,D001,ENG,110,LEC,163486,"Carol,Carol",F,555-1212,carol.carol@euphoria.edu,,,,43993434,Faythe,W,10:10:00AM,11:00AM,8/17/22,23247055,GRD
1229,1,UGRD,43244,D011,PSY,100,LEC,185109,"Brock,Erin",U,555-2121,Brock.erin@euphoria.edu,,,,5435352,Grace,Th,2:30PM,4:20PM,,,
1229,1,UGRD,36923,D001,MTH,105,LEC,185109,"Brock,Erin",U,555-2121,Brock.erin@euphoria.edu,,,,24007235,Frank,M,4:40PM,6:10PM,,,
1229,1,UGRD,44332,D200,ENG,111,LEC,659056,"Mallory,Yves",M,555-2211,mallory.eve@euphoria.edu,,,,75544555,Mike,W,8:00AM,9:50AM,,,
"""

"Term,Session,Career,Class Nbr,Section,Subject,Catalog,Component,ID,Name,Gender,Phone,Email,Grade,Grade In,Repeat,Instructor ID,Instructor Name,Day,Mtg Start,Mtg End,Add Dt,User,Grade Base\n1229,1,UGRD,36923,D001,MTH,105,LEC,163486,\"Carol,Carol\",F,555-1212,carol.carol@euph"[93m[1m ⋯ 481 bytes ⋯ [22m[39m"0PM,4:20PM,,,\n1229,1,UGRD,36923,D001,MTH,105,LEC,185109,\"Brock,Erin\",U,555-2121,Brock.erin@euphoria.edu,,,,24007235,Frank,M,4:40PM,6:10PM,,,\n1229,1,UGRD,44332,D200,ENG,111,LEC,659056,\"Mallory,Yves\",M,555-2211,mallory.eve@euphoria.edu,,,,75544555,Mike,W,8:00AM,9:50AM,,,\n"

This is read in as before:


In [47]:
f22 = CSV.read(IOBuffer(f22_data), DataFrame)

Row,Term,Session,Career,Class Nbr,Section,Subject,Catalog,Component,ID,Name,Gender,Phone,Email,Grade,Grade In,Repeat,Instructor ID,Instructor Name,Day,Mtg Start,Mtg End,Add Dt,User,Grade Base
Unnamed: 0_level_1,Int64,Int64,String7,Int64,String7,String3,Int64,String3,Int64,String15,String1,String15,String31,Missing,Missing,Missing,Int64,String7,String3,String15,String7,String7?,Int64?,String3?
1,1229,1,UGRD,36923,D001,MTH,105,LEC,163486,"Carol,Carol",F,555-1212,carol.carol@euphoria.edu,missing,missing,missing,24007235,Frank,M,10:10AM,12:05PM,8/17/22,23247055,GRD
2,1229,1,UGRD,36923,D001,MTH,105,LEC,163486,"Carol,Carol",F,555-1212,carol.carol@euphoria.edu,missing,missing,missing,24007235,Frank,W,10:10AM,12:05PM,8/17/22,23247055,GRD
3,1229,1,UGRD,34534,D001,ENG,110,LEC,163486,"Carol,Carol",F,555-1212,carol.carol@euphoria.edu,missing,missing,missing,43993434,Faythe,W,10:10:00AM,11:00AM,8/17/22,23247055,GRD
4,1229,1,UGRD,43244,D011,PSY,100,LEC,185109,"Brock,Erin",U,555-2121,Brock.erin@euphoria.edu,missing,missing,missing,5435352,Grace,Th,2:30PM,4:20PM,missing,missing,missing
5,1229,1,UGRD,36923,D001,MTH,105,LEC,185109,"Brock,Erin",U,555-2121,Brock.erin@euphoria.edu,missing,missing,missing,24007235,Frank,M,4:40PM,6:10PM,missing,missing,missing
6,1229,1,UGRD,44332,D200,ENG,111,LEC,659056,"Mallory,Yves",M,555-2211,mallory.eve@euphoria.edu,missing,missing,missing,75544555,Mike,W,8:00AM,9:50AM,missing,missing,missing


Over time the column names evolve. The old data has a minimal set, the new has more extensive repeated data:


In [48]:
names(s11)

9-element Vector{String}:
 "Column1"
 "Term"
 "Subject"
 "Catalog"
 "ID"
 "Name"
 "Session"
 "Grade"
 "Grade.In"

In [49]:
names(f22)

24-element Vector{String}:
 "Term"
 "Session"
 "Career"
 "Class Nbr"
 "Section"
 "Subject"
 "Catalog"
 "Component"
 "ID"
 "Name"
 "Gender"
 "Phone"
 "Email"
 "Grade"
 "Grade In"
 "Repeat"
 "Instructor ID"
 "Instructor Name"
 "Day"
 "Mtg Start"
 "Mtg End"
 "Add Dt"
 "User"
 "Grade Base"

This finds common column names using a Unicode infix operator for `intersect`:


In [50]:
nms = names(f22) ∩ names(s11)

7-element Vector{String}:
 "Term"
 "Session"
 "Subject"
 "Catalog"
 "ID"
 "Name"
 "Grade"

---


In [51]:
# 💻 The nms vector is a valid column selector. What is the data frame f22 with only the names from `nms`?
f22[:, nms]


Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,Missing
1,1229,1,MTH,105,163486,"Carol,Carol",missing
2,1229,1,MTH,105,163486,"Carol,Carol",missing
3,1229,1,ENG,110,163486,"Carol,Carol",missing
4,1229,1,PSY,100,185109,"Brock,Erin",missing
5,1229,1,MTH,105,185109,"Brock,Erin",missing
6,1229,1,ENG,111,659056,"Mallory,Yves",missing


In [52]:
# 💻 Wrap your previous command within `unique`. What is the difference?
f22[:, nms] |> unique # combines first two  into 1

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,Missing
1,1229,1,MTH,105,163486,"Carol,Carol",missing
2,1229,1,ENG,110,163486,"Carol,Carol",missing
3,1229,1,PSY,100,185109,"Brock,Erin",missing
4,1229,1,MTH,105,185109,"Brock,Erin",missing
5,1229,1,ENG,111,659056,"Mallory,Yves",missing


(The `f22` data is arranged to have replicated data for each day a class meets.)


---


The `vcat` function combines objects vertically (there is also `hcat` and `hvcat`).


In [53]:
studs = vcat(s11[:,nms], unique(f22[:,nms]))

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,String3?
1,1149,1,ACC,114,812315,"Abernathy,Alice",A
2,1149,1,MTH,123,812315,"Abernathy,Alice",C
3,1152,1,ENG,132,812315,"Abernathy,Alice",A
4,1152,1,ENG,211,812315,"Abernathy,Alice",B
5,1169,1,MTH,231,889995,"Ballew,Bob",A
6,1169,1,MTH,229,889995,"Ballew,Bob",A
7,1172,1,ENG,111,889995,"Ballew,Bob",B
8,1172,1,CSC,222,889995,"Ballew,Bob",A-
9,1179,1,CSC,222,889995,"Ballew,Bob",F
10,1179,1,ENG,232,889995,"Ballew,Bob",A


DataFrames has much functionality for other types of data joins


## Split-apply-combine


The [split-apply-combine](https://vita.had.co.nz/papers/plyr.pdf) strategy is often used, and here we see `DataFrames` supports it fairly naturally.


For this data, we want to create a new data structure for each student:


containing their 1) first semester, 2) their last semester, and 3) their gpa.


The first and last semester is conveniently returned by `extrema` when applied to `Term`, given the manner in which the term is coded.


---


In [54]:
# 💻 What does extrema find for studs.Term?
extrema(studs.Term)  # min and max in one pass

(1149, 1229)

---


The `gpa` requires turning letter grades into numbers. Here is a simple way:


In [55]:
function grade_to_number(x)
	ismissing(x) && return x
    x == "A"  ? 4.0 :
	x == "A-" ? 3.7 :
    x == "B+" ? 3.3 :
	x == "B"  ? 3.0 :
    x == "B-" ? 2.7 :
    x == "C+" ? 2.3 :
	x == "C"  ? 2.0 :
    x == "D"  ? 1.0 :
	x == "F"  ? 0.0 : missing
end

grade_to_number (generic function with 1 method)

The `gpa`  would just be done by applying `mean` (from the `Statistics` package).


In [56]:
using Statistics  # base Statistics module is very minimal, but has `mean`

---


In [57]:
# 💻  what goes wrong here? A one character fix is?
xs = grade_to_number(studs.Grade) # it needs grade_to_number.(studs.Grade) --- a dot to broadcast

missing

In [58]:
# 💻 After ensuring xs is a vector, try finding the mean. What value do you get?
mean(xs) # missing, need to somehow drop those

LoadError: MethodError: no method matching iterate(::Missing)

[0mClosest candidates are:
[0m  iterate([91m::LibGit2.GitRebase[39m)
[0m[90m   @[39m [36mLibGit2[39m [90m/usr/local/share/julia/stdlib/v1.10/LibGit2/src/[39m[90m[4mrebase.jl:48[24m[39m
[0m  iterate([91m::LibGit2.GitRebase[39m, [91m::Any[39m)
[0m[90m   @[39m [36mLibGit2[39m [90m/usr/local/share/julia/stdlib/v1.10/LibGit2/src/[39m[90m[4mrebase.jl:48[24m[39m
[0m  iterate([91m::PosLenString[39m)
[0m[90m   @[39m [32mWeakRefStrings[39m [90m~/.julia/packages/WeakRefStrings/31nkb/src/[39m[90m[4mposlenstrings.jl:325[24m[39m
[0m  ...


---


For `mean(xs)` we have to be a bit careful with


  * `missing` values (`ismissing`, `skipmissing`)
  * and empty iterators (`isempty`)


with this data:


In [59]:

function gpa(xs)
    isempty(xs) && return missing
    ys = grade_to_number.(xs)
    all(ismissing.(ys)) && return missing
    ys |> skipmissing |> mean
end

gpa (generic function with 1 method)

The function we apply to the dataframe for each unique student would be:


In [60]:
function summarize_student(u)
    m,n = size(u)
    fterm, lterm = extrema(u.Term)
    (F=fterm, L=lterm, N = m, gpa = gpa(u.Grade))
end

summarize_student (generic function with 1 method)

---


In [61]:
# 💻 Apply `summarize_student` to the data frame for Alice created by subsetting the rows:
df = studs[studs.Name .== "Abernathy,Alice",:]

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,String3?
1,1149,1,ACC,114,812315,"Abernathy,Alice",A
2,1149,1,MTH,123,812315,"Abernathy,Alice",C
3,1152,1,ENG,132,812315,"Abernathy,Alice",A
4,1152,1,ENG,211,812315,"Abernathy,Alice",B


In [62]:
summarize_student(df)

(F = 1149, L = 1152, N = 4, gpa = 3.25)

---


We can use the `groupby` function to split the data frame on an ID, call the above on each student, and then combine into a data frame.


The `groupby` function splits the data:


In [63]:
students = groupby(studs, :ID)

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,String3?
1,1149,1,ACC,114,812315,"Abernathy,Alice",A
2,1149,1,MTH,123,812315,"Abernathy,Alice",C
3,1152,1,ENG,132,812315,"Abernathy,Alice",A
4,1152,1,ENG,211,812315,"Abernathy,Alice",B

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,String3?
1,1229,1,ENG,111,659056,"Mallory,Yves",missing


---


In [64]:
# 💻 Group the data by the student name
groupby(studs, :Name)

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,String3?
1,1149,1,ACC,114,812315,"Abernathy,Alice",A
2,1149,1,MTH,123,812315,"Abernathy,Alice",C
3,1152,1,ENG,132,812315,"Abernathy,Alice",A
4,1152,1,ENG,211,812315,"Abernathy,Alice",B

Row,Term,Session,Subject,Catalog,ID,Name,Grade
Unnamed: 0_level_1,Int64,Int64,String3,Int64,Int64,String15,String3?
1,1229,1,ENG,111,659056,"Mallory,Yves",missing


---


The `GroupedDataFrame` object can be iterated over (but not broadcast over). Here we apply our function to each entry:


In [65]:
student_summaries = [summarize_student(student) for student ∈ students]

5-element Vector{NamedTuple{(:F, :L, :N, :gpa)}}:
 (F = 1149, L = 1152, N = 4, gpa = 3.25)
 (F = 1169, L = 1182, N = 7, gpa = 3.142857142857143)
 (F = 1192, L = 1229, N = 6, gpa = 4.0)
 (F = 1229, L = 1229, N = 2, gpa = missing)
 (F = 1229, L = 1229, N = 1, gpa = missing)

The `DataFrame` constructor can consume an array of named tuples that is produced by the comprehension, treating each tuple as a new row:


In [66]:
d = DataFrame(student_summaries)

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1149,1152,4,3.25
2,1169,1182,7,3.14286
3,1192,1229,6,4.0
4,1229,1229,2,missing
5,1229,1229,1,missing


### DataFrames mini language


DataFrames provides a minilanguage to support the actions:


  * `combine`: create a new data frame with columns coming from transformations
  * `select`: create a new data frame with same number of rows (cases) with only the specified columns
  * `transform`: create a new data frame, as `select` with the same number of cases, but keeps original columns and any additional ones


Transformations apply a function to source rows and store the result(s) in destination rows. The `=>` pair notation is used to separate. The pattern is


In [67]:
# source column(s) specifier => function => destination column(s) specifier

The middle one is a function, which may be an anonymous function, in which case parentheses may be needed due to operator precedence.)


For example, in the below we will see `:Term => minimum => :F` which will apply the `minimum` function to each `Term` value in a data frame. The `minimum` function is a *reduction* returning a scalar, this will be stored in the computed data frame with variable name `F`. Similarly we have `:L` and `:N` computed:


In [68]:
students = groupby(studs, :ID)
combine(students,
        :Term => minimum => :F,
        :Term => maximum => :L,
        :Term => length => :N,
        :Grade => gpa => :gpa)

Row,ID,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64?
1,812315,1149,1152,4,3.25
2,889995,1169,1182,7,3.14286
3,163486,1192,1229,6,4.0
4,185109,1229,1229,2,missing
5,659056,1229,1229,1,missing


---


In [69]:
# 💻 group studs by :Term then apply `gpa`. Which term has the lowest gpa in the mock data set?
combine(groupby(studs, :Term), :Grade => gpa => :gpa)

Row,Term,gpa
Unnamed: 0_level_1,Int64,Float64?
1,1149,3.0
2,1152,3.5
3,1169,4.0
4,1172,3.35
5,1179,2.0
6,1182,3.3
7,1192,4.0
8,1199,4.0
9,1202,missing
10,1229,missing


---


## Transforming data examples


We continue with a larger set of randomly generated mock data. Here we read the data from an internet source, so first the built-in `download` function is called to download the file,


In [70]:
url = "https://raw.githubusercontent.com/jverzani/DataCampPresentation.jl/main/d.csv"
d = CSV.read(download(url), DataFrame)

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1112,1112,8,0.6
2,1112,1132,24,3.85
3,1112,1132,23,3.35
4,1112,1159,44,2.65
5,1112,1149,14,4.04
6,1112,1119,13,2.75
7,1112,1122,12,1.31
8,1112,1112,7,3.53
9,1112,1112,5,3.8
10,1112,1112,6,missing


### Filtering


We want to consider the more recent students only, so we filter out the students who started earlier:


In [71]:
d = filter(r -> r.F >= 1159, d)

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1159,1162,11,3.34
2,1159,1192,41,2.51
3,1159,1172,9,4.1
4,1159,1189,23,2.7
5,1159,1192,34,3.8
6,1159,1202,41,2.81
7,1159,1159,7,4.1
8,1159,1222,20,3.74
9,1159,1182,16,0.96
10,1159,1162,7,3.8


  * Somewhat idiosyncratically `filter` for a data frame filters over rows. (a preferred direction isn't obvious)
  * the call above is a bit wasteful, as we can filter in place with `filter!`. (The above allocates a new data frame)
  * As an alternative to `filter` there is `subset` (and `subset!`) which could also be used. E.g.:


In [72]:
subset(d, :F => ByRow(>=(1209)))

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1209,1212,15,1.76
2,1209,1209,9,0.1
3,1209,1222,23,3.78
4,1209,1212,13,1.36
5,1209,1212,12,3.56
6,1209,1222,25,3.24
7,1209,1229,27,3.33
8,1209,1229,30,2.82
9,1209,1212,13,0.1
10,1209,1229,26,2.66


As another alternate, the mini language can also be used with filter


In [73]:
g1209(x) = x >= 1209
filter(:F => g1209, d)

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1209,1212,15,1.76
2,1209,1209,9,0.1
3,1209,1222,23,3.78
4,1209,1212,13,1.36
5,1209,1212,12,3.56
6,1209,1222,25,3.24
7,1209,1229,27,3.33
8,1209,1229,30,2.82
9,1209,1212,13,0.1
10,1209,1229,26,2.66


As an covenient alternate, we also have:


In [74]:
filter(:F => >=(1209), d)

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1209,1212,15,1.76
2,1209,1209,9,0.1
3,1209,1222,23,3.78
4,1209,1212,13,1.36
5,1209,1212,12,3.56
6,1209,1222,25,3.24
7,1209,1229,27,3.33
8,1209,1229,30,2.82
9,1209,1212,13,0.1
10,1209,1229,26,2.66


To explain a bit:


  * `>=(1209)` is a curried from of `>=(x,y)` with `y=1209` – there are a few such operators for convenience with such tasks;
  * `subset` needs `ByRow` (to ensure the function consumes an element in the column, not the entire column), whereas `filter` does not, as `filter` returns rows that match the function, so pass rows to the function, whereas, `subset` passes the whole column.)
  * The `r -> r.F == 1209` anonymous function is probably clearer...


---


In [75]:
# 💻 Using `filter` extract those students whose first term was `1229`. How many were there?
filter(:F => >=(1229), d)

Row,F,L,N,gpa
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?
1,1229,1229,5,missing
2,1229,1229,8,missing
3,1229,1229,9,missing
4,1229,1229,6,missing
5,1229,1229,9,missing
6,1229,1229,8,missing
7,1229,1229,9,missing
8,1229,1229,10,missing
9,1229,1229,9,missing
10,1229,1229,7,missing


In [76]:
# 💻 Can you answer the same question for all terms using `combine`, say?
combine(groupby(d, :F), :F => length => :cnt)

Row,F,cnt
Unnamed: 0_level_1,Int64,Int64
1,1159,456
2,1162,281
3,1169,719
4,1172,281
5,1179,800
6,1182,253
7,1189,762
8,1192,240
9,1199,701
10,1202,263


---


### Creating new columns


We want to compute how many semesters a student stayed. The data is computable as we have the first and last (`.F` and `.L`) semesters recorded. However, the semester uses an idiosyncratic storage (a leading `1`, two digit year, semester with spring=`1`, fall=`9`.)


Here we decode:


In [77]:
function decode_semester(x)
    yr  = div(x - 1000, 10)   # 1229 -> 22
    val = rem(x, 10) == 2 ? 0.0 : 0.5 # 1229 -> 22 + 0.5; 1222 -> 22 + 0.0
    yr + val
end

decode_semester (generic function with 1 method)

We want to combine the `:F` and `:L` columns and make a new column.  For use, this becomes


In [78]:
Δ(f,l) = decode_semester(l) - decode_semester(f) + 1/2
transform!(d, [:F, :L] => ByRow(Δ) => :semesters)

Row,F,L,N,gpa,semesters
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?,Float64
1,1159,1162,11,3.34,1.0
2,1159,1192,41,2.51,4.0
3,1159,1172,9,4.1,2.0
4,1159,1189,23,2.7,3.5
5,1159,1192,34,3.8,4.0
6,1159,1202,41,2.81,5.0
7,1159,1159,7,4.1,0.5
8,1159,1222,20,3.74,7.0
9,1159,1182,16,0.96,3.0
10,1159,1162,7,3.8,1.0


The subtlety above is the `ByRow` which is needed to broadcast the values here. In this example, we could have written `Δ` to broadcast with either:


In [79]:
Δ(f,l) = decode_semester.(l) - decode_semester.(f) .+ 1/2

Δ (generic function with 1 method)

Or using the `@.` **macro**:


In [80]:
@. Δ(f,l) = decode_semester(l) - decode_semester(f) + 1/2

Δ (generic function with 1 method)

Then we could have computed with:


In [81]:
d = transform(d, [:F, :L] => Δ => :alt_semesters)

Row,F,L,N,gpa,semesters,alt_semesters
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?,Float64,Float64
1,1159,1162,11,3.34,1.0,1.0
2,1159,1192,41,2.51,4.0,4.0
3,1159,1172,9,4.1,2.0,2.0
4,1159,1189,23,2.7,3.5,3.5
5,1159,1192,34,3.8,4.0,4.0
6,1159,1202,41,2.81,5.0,5.0
7,1159,1159,7,4.1,0.5,0.5
8,1159,1222,20,3.74,7.0,7.0
9,1159,1182,16,0.96,3.0,3.0
10,1159,1162,7,3.8,1.0,1.0


---


In [82]:
# 💻 Can you compute the average number of classes taken per semester for each student?
avg(s,n) = n/(2s)
combine(d, [:semesters, :N] => ByRow(avg) => :avg)

Row,avg
Unnamed: 0_level_1,Float64
1,5.5
2,5.125
3,2.25
4,3.28571
5,4.25
6,4.1
7,7.0
8,1.42857
9,2.66667
10,3.5


---


### Counting


The number of semesters a student stays is of interest. At Euphoria State there are many good  reasons to transfer, so the simple model of 8 semesters and out is not typical.


A simple tally could be done as follows:


In [83]:
sems = unique(d.semesters)
cnt = Dict(s => 0 for s ∈ sems)  # initialize with a generator
for r ∈ eachrow(d)
    cnt[r.semesters] += 1
end
cnt

Dict{Float64, Int64} with 15 entries:
  5.0 => 170
  7.0 => 17
  0.5 => 2191
  7.5 => 11
  1.5 => 784
  1.0 => 1165
  5.5 => 134
  4.0 => 371
  6.0 => 66
  2.0 => 545
  3.5 => 462
  6.5 => 53
  3.0 => 344
  2.5 => 540
  4.5 => 259

Since `for` loops are fast in `Julia` this is actually performant, but the dictionary used for counting is not that convenient.


This counting can be achieved with `combine` followed by sorting:


In [84]:
df = combine(groupby(d, :semesters), nrow => :n)
sort(df, :semesters)

Row,semesters,n
Unnamed: 0_level_1,Float64,Int64
1,0.5,2191
2,1.0,1165
3,1.5,784
4,2.0,545
5,2.5,540
6,3.0,344
7,3.5,462
8,4.0,371
9,4.5,259
10,5.0,170


---


In [85]:
# 💻 What is the distribution of the number of courses a student took while at Euphoria State?
# use the argument `rev=true` to sort to order the values. What is the most common number of courses?
df = combine(groupby(d, :N), nrow=> :n)
sort(df, :n; rev=true)

Row,N,n
Unnamed: 0_level_1,Int64,Int64
1,8,683
2,5,612
3,9,575
4,6,376
5,13,303
6,14,294
7,10,281
8,12,271
9,7,227
10,15,188


In [86]:
# 💻 what is the distribution of the mean number of courses taken by first term (:F)?
# After sorting, which term had the most?
df = combine(groupby(d, :F), :N => mean => :n)
sort(df, :n; rev=true)

Row,F,n
Unnamed: 0_level_1,Int64,Float64
1,1169,24.0946
2,1159,24.0241
3,1179,21.9925
4,1189,21.6496
5,1199,20.3666
6,1172,17.9893
7,1162,17.7117
8,1182,17.5099
9,1192,17.0333
10,1209,16.1586


---


## Contigency tables


We are curious how the number of semesters has varied over the years and want a contingency table.


Grouping by more than one column is possible, as this shows:


In [87]:
df = combine(groupby(d, [:semesters, :F]), nrow => :N)
sort(df, [:F, :semesters])

Row,semesters,F,N
Unnamed: 0_level_1,Float64,Int64,Int64
1,0.5,1159,73
2,1.0,1159,74
3,1.5,1159,32
4,2.0,1159,39
5,2.5,1159,26
6,3.0,1159,35
7,3.5,1159,11
8,4.0,1159,49
9,4.5,1159,12
10,5.0,1159,33


But a more familiar display is in the form of a contingency table. For that we reach for a package that is useful for contingency tables:


In [114]:
Pkg.add("FreqTables")
import FreqTables: freqtable

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Combinatorics ───── v1.0.2
[32m[1m   Installed[22m[39m FreqTables ──────── v0.4.6
[32m[1m   Installed[22m[39m CategoricalArrays ─ v0.10.8
[32m[1m   Installed[22m[39m NamedArrays ─────── v0.10.3
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[da1fdf0e] [39m[92m+ FreqTables v0.4.6[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[324d7699] [39m[92m+ CategoricalArrays v0.10.8[39m
  [90m[861a8166] [39m[92m+ Combinatorics v1.0.2[39m
  [90m[da1fdf0e] [39m[92m+ FreqTables v0.4.6[39m
  [90m[86f7a689] [39m[92m+ NamedArrays v0.10.3[39m
[32m[1mPrecompiling[22m[39m packages...
   2333.5 ms[32m  ✓ [39m[90mCombinatorics[39m
   2586.9 ms[32m  ✓ [39m[90mCategoricalArrays[39m
   2739.2 ms[32m  ✓ [39m[90mNamedArrays[39m
   2088.7 ms[32m  ✓ [39m[90mCategoricalArrays → CategoricalArraysRecipesBaseExt[3

The `freqtable` function is used like `R`'s `table` function (not `xtabs`, with its modeling formula interface):


In [115]:
m = freqtable(d.F, d.semesters)

15×15 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 0.5  1.0  1.5  2.0  2.5  3.0  …  5.0  5.5  6.0  6.5  7.0  7.5
────────────┼──────────────────────────────────────────────────────────────
1159        │  73   74   32   39   26   35  …   33   12   28   10   11   11
1162        │ 100   28   26   12   10   11       9   14    5    8    6    0
1169        │ 103  136   49   73   32   45      51   20   24   35    0    0
1172        │ 101   24   22   14   17   14      18    9    9    0    0    0
1179        │ 138  167   69   71   37   54      38   79    0    0    0    0
1182        │  82   36   26   12   18   15      21    0    0    0    0    0
1189        │ 150  159   43   55   40   57       0    0    0    0    0    0
1192        │  72   22   20   10   24   17       0    0    0    0    0    0
1199        │ 114  106   53   81   54   52       0    0    0    0    0    0
1202        │ 109   18   30   33   29   44       0    0    0    0    0    0
1209        │ 172  135   77   82  253    0       0    0    0  

Students on the lower diagonal are still enrolled, other students have left.


---


In [116]:
# 💻 Make a contingency table of :F versus :L for d. Is the shape expected?
freqtable(d.F, d.L) # yes, as :F <= :L we get 0s when this is not the case

15×15 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 1159  1162  1169  1172  1179  …  1209  1212  1219  1222  1229
────────────┼──────────────────────────────────────────────────────────────
1159        │   73    74    32    39    26  …    12    28    10    11    11
1162        │    0   100    28    26    12        9    14     5     8     6
1169        │    0     0   103   136    49       39    51    20    24    35
1172        │    0     0     0   101    24       21    16    18     9     9
1179        │    0     0     0     0   138       39    61    47    38    79
1182        │    0     0     0     0     0       15    24    10     9    21
1189        │    0     0     0     0     0       40    57    54    86   118
1192        │    0     0     0     0     0       10    24    17    25    50
1199        │    0     0     0     0     0       53    81    54    52   241
1202        │    0     0     0     0     0       18    30    33    29    44
1209        │    0     0     0     0     0      172   135    7

---


There are big variations between students who started in the fall versus the spring (more students start in the fall semester at Euphoria State). Here we select fall cohorts:


In [117]:
d1 = filter(r -> r.F % 10 == 9, d)
m = freqtable(d1.F, d1.semesters)

8×15 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 0.5  1.0  1.5  2.0  2.5  3.0  …  5.0  5.5  6.0  6.5  7.0  7.5
────────────┼──────────────────────────────────────────────────────────────
1159        │  73   74   32   39   26   35  …   33   12   28   10   11   11
1169        │ 103  136   49   73   32   45      51   20   24   35    0    0
1179        │ 138  167   69   71   37   54      38   79    0    0    0    0
1189        │ 150  159   43   55   40   57       0    0    0    0    0    0
1199        │ 114  106   53   81   54   52       0    0    0    0    0    0
1209        │ 172  135   77   82  253    0       0    0    0    0    0    0
1219        │ 194  147  321    0    0    0       0    0    0    0    0    0
1229        │ 599    0    0    0    0    0  …    0    0    0    0    0    0

---


In [118]:
# 💻 Repeat the above, finding a contingency table for those starting in the spring semester (Term ends in 2)
d1 = filter(r -> r.F % 10 == 2, d)
m = freqtable(d1.F, d1.semesters)

7×14 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 0.5  1.0  1.5  2.0  2.5  3.0  …  4.5  5.0  5.5  6.0  6.5  7.0
────────────┼──────────────────────────────────────────────────────────────
1162        │ 100   28   26   12   10   11  …   18    9   14    5    8    6
1172        │ 101   24   22   14   17   14      16   18    9    9    0    0
1182        │  82   36   26   12   18   15       9   21    0    0    0    0
1192        │  72   22   20   10   24   17       0    0    0    0    0    0
1202        │ 109   18   30   33   29   44       0    0    0    0    0    0
1212        │ 100   28   16   63    0    0       0    0    0    0    0    0
1222        │  84   85    0    0    0    0  …    0    0    0    0    0    0

---


This pattern of repeated data transformation is often done with a piping syntax, which can feel more natural. Here is one way to do so:


In [119]:
d |>
    x -> filter(r -> r.F % 10 == 9, x) |>
    x -> freqtable(x.F, x.semesters)

8×15 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 0.5  1.0  1.5  2.0  2.5  3.0  …  5.0  5.5  6.0  6.5  7.0  7.5
────────────┼──────────────────────────────────────────────────────────────
1159        │  73   74   32   39   26   35  …   33   12   28   10   11   11
1169        │ 103  136   49   73   32   45      51   20   24   35    0    0
1179        │ 138  167   69   71   37   54      38   79    0    0    0    0
1189        │ 150  159   43   55   40   57       0    0    0    0    0    0
1199        │ 114  106   53   81   54   52       0    0    0    0    0    0
1209        │ 172  135   77   82  253    0       0    0    0    0    0    0
1219        │ 194  147  321    0    0    0       0    0    0    0    0    0
1229        │ 599    0    0    0    0    0  …    0    0    0    0    0    0

The anonymous functions are easy to write, but difficult to parse. Plus they add some boilerplate that would be nice to remove.


In `Julia` there are **too** many solutions to this through add-on packages. One is to create a placeholder for the previous value so it can thread through the other might be to create simplified syntax for anonymous functions. We use the `Chain` package and `@chain` macro for the former, the `Underscores` package can do the latter (there are also related `Pipe`, `DataPipes`, `Lazy`, ... packages).


We will use `Chain`


In [121]:
Pkg.add("Chain")
using Chain

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m Chain ─ v0.6.0
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[8be319e6] [39m[92m+ Chain v0.6.0[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[8be319e6] [39m[92m+ Chain v0.6.0[39m
[32m[1mPrecompiling[22m[39m packages...
    850.9 ms[32m  ✓ [39mChain
  1 dependency successfully precompiled in 1 seconds. 462 already precompiled.


With `Chain` we use two simple rules:


  * we can use an underscore, `_`, to specify where the passed along value should fit into the next function call
  * if no `_` is used, it is passed to the first position.


Also with `Chain` the piping notation is implicit through a new line.


The above becomes:


In [122]:
@chain d begin
    filter(r -> r.F % 10 == 9, _)
    freqtable(_.F, _.semesters)
end

8×15 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 0.5  1.0  1.5  2.0  2.5  3.0  …  5.0  5.5  6.0  6.5  7.0  7.5
────────────┼──────────────────────────────────────────────────────────────
1159        │  73   74   32   39   26   35  …   33   12   28   10   11   11
1169        │ 103  136   49   73   32   45      51   20   24   35    0    0
1179        │ 138  167   69   71   37   54      38   79    0    0    0    0
1189        │ 150  159   43   55   40   57       0    0    0    0    0    0
1199        │ 114  106   53   81   54   52       0    0    0    0    0    0
1209        │ 172  135   77   82  253    0       0    0    0    0    0    0
1219        │ 194  147  321    0    0    0       0    0    0    0    0    0
1229        │ 599    0    0    0    0    0  …    0    0    0    0    0    0

(The `Underscores.jl` package could avoid the remaining anonymous function.) Here we define a *closure* to create a function that fixes the semester:


In [123]:
function start_semester(x=:fall)
    s = x == :spring ? 2 : 9
    r -> r.F % 10 == s
end

start_semester (generic function with 2 methods)

Then we have the above can be:


In [124]:
@chain d begin
    filter(start_semester(:fall), _)
    freqtable(_.F, _.semesters)
end

8×15 Named Matrix{Int64}
Dim1 ╲ Dim2 │ 0.5  1.0  1.5  2.0  2.5  3.0  …  5.0  5.5  6.0  6.5  7.0  7.5
────────────┼──────────────────────────────────────────────────────────────
1159        │  73   74   32   39   26   35  …   33   12   28   10   11   11
1169        │ 103  136   49   73   32   45      51   20   24   35    0    0
1179        │ 138  167   69   71   37   54      38   79    0    0    0    0
1189        │ 150  159   43   55   40   57       0    0    0    0    0    0
1199        │ 114  106   53   81   54   52       0    0    0    0    0    0
1209        │ 172  135   77   82  253    0       0    0    0    0    0    0
1219        │ 194  147  321    0    0    0       0    0    0    0    0    0
1229        │ 599    0    0    0    0    0  …    0    0    0    0    0    0

---


In [125]:
# 💻 Can you filter by fall semester; then filter by :F being 1199 or greater; then make a table of first semster by number of courses?
@chain d begin
filter(start_semester(:fall), _)
filter(:F => >=(1199), _)
freqtable(_.F, _.N)
end

4×38 Named Matrix{Int64}
Dim1 ╲ Dim2 │   5    6    7    8    9   10  …   37   38   39   40   41   42
────────────┼──────────────────────────────────────────────────────────────
1199        │  37   23   42   48   21   15  …   10    7    8    2    3    1
1209        │  55   39   18   53   54   23       1    0    0    0    0    1
1219        │  62   34   12   65   53   35       0    0    0    0    0    0
1229        │  60   64   38  171  191   64  …    0    0    0    0    0    0

In [126]:
# 💻 Can you filter by fall semester; then filter by :gpa being 3.0 or greater; then make a table of first semster by number of courses?
# something like this will be needed: filter(:gpa => !ismissing, _)
@chain d begin
filter(start_semester(:fall), _)
filter(:gpa => !ismissing, _)
filter(:gpa => >=(3.0), _)
end

Row,F,L,N,gpa,semesters,alt_semesters
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?,Float64,Float64
1,1159,1162,11,3.34,1.0,1.0
2,1159,1172,9,4.1,2.0,2.0
3,1159,1192,34,3.8,4.0,4.0
4,1159,1159,7,4.1,0.5,0.5
5,1159,1222,20,3.74,7.0,7.0
6,1159,1162,7,3.8,1.0,1.0
7,1159,1192,46,4.0,4.0,4.0
8,1159,1182,45,3.8,3.0,3.0
9,1159,1172,20,3.35,2.0,2.0
10,1159,1179,10,3.81,2.5,2.5


---


## Visualization


A visualization might be helpful. `Julia` has a few add-on packages for making plots: `PyPlot` uses the Python package `Matplotlib`; `GR` uses the GR graphing package; `Plots` is a very useful interface to those backends and others; `Makie` is a powerful package written in `Julia` which shines with 3-d graphics. Here we use the `PlotlyLight` interface to PlotlyJS, as it works quickly under colab.


In [127]:
Pkg.add("PlotlyLight")
using PlotlyLight

[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m StructTypes ──────── v1.11.0
[32m[1m   Installed[22m[39m JSON3 ────────────── v1.14.2
[32m[1m   Installed[22m[39m EasyConfig ───────── v0.1.16
[32m[1m   Installed[22m[39m Cobweb ───────────── v0.7.2
[32m[1m   Installed[22m[39m PlotlyLight ──────── v0.12.0
[32m[1m   Installed[22m[39m DefaultApplication ─ v1.1.0
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[ca7969ec] [39m[92m+ PlotlyLight v0.12.0[39m
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Manifest.toml`
  [90m[ec354790] [39m[92m+ Cobweb v0.7.2[39m
  [90m[3f0dd361] [39m[92m+ DefaultApplication v1.1.0[39m
  [90m[acab07b0] [39m[92m+ EasyConfig v0.1.16[39m
  [90m[0f8b85d8] [39m[92m+ JSON3 v1.14.2[39m
  [90m[ca7969ec] [39m[92m+ PlotlyLight v0.12.0[39m
  [90m[856f2bd8] [39m[92m+ StructTypes v1.11.0[39m
[32m[1mPrecompiling[22m[39m packages...
   1716.4 m

`PlotlyLight` is a *lightweight* interface to PlotlyJS, with `Config` used to create `JSON` from `Julia` objects. The PlotlyJS API has some shortcuts to make multiple plots, but for pedagogical reasons we show how to add each at once.


For this graphic we have to be careful to remove the values on the diagonal, as we are looking for when students leave. First we define a function to make the plotting data (`x`, `y` values and a label) for a given semester.


In [128]:
function gather_data(s)
    sem = first(s.F)
	m = maximum(s.semesters)
    n = length(s.semesters)

	xs = 0.5:0.5:m
	ys = [sum(s.semesters .== i) for i ∈ xs] ./ n

    (x = xs[1:end-1], y = ys[1:end-1], name = string(sem))
end

gather_data (generic function with 1 method)

To make different plots with `PlotlyLight`, we set up a basic configuration, and reuse this for each layer:


In [129]:
cfg = Config(type="scatter", mode="lines markers")
data = Config[]  # a typed array with no elements
for s  ∈ groupby(d, :F)
    first(s.F) == 1229 && continue
    plt = copy(cfg)
    plt.x, plt.y, plt.name = gather_data(s)
    push!(data, plt)
end
lyt = Config(width=800, height=500)
Plot(data, lyt)

A similar plot could be formed from the frequency table. In the above, we needlessly recreate that construction in `gather_data` with the comprehension.


---


In [130]:
# 💻 filter out students so only those that started in a fall from fall 19 to fall 22 are shown.
df = @chain d begin
filter(start_semester(:fall), _)
filter(:F => >=(1199), _)
end

data = Config[]  # a typed array with no elements
for s  ∈ groupby(df, :F)
    first(s.F) == 1229 && continue
    plt = copy(cfg)
    plt.x, plt.y, plt.name = gather_data(s)
    push!(data, plt)
end
lyt = Config(width=800, height=500)
Plot(data, lyt)


---


Restricting the semesters details a bit more change in the patterns due to the pandemic. We might see an increase in students leaving after an initial semester.


## More data management


Suppose that is to be looked at, we might want to see if the GPA has something to do with it. Perhaps it is lack of preparation due to the pandemic, perhaps not.


The `gpa` value is numeric, but we would prefer it be categorical. The `cut` function from the `CategoricalArrays` can perform that classification:


In [131]:
Pkg.add("CategoricalArrays")
import CategoricalArrays: cut

[32m[1m   Resolving[22m[39m package versions...
[32m[1m    Updating[22m[39m `~/.julia/environments/v1.10/Project.toml`
  [90m[324d7699] [39m[92m+ CategoricalArrays v0.10.8[39m
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`


In [132]:
rcode(gpa) = cut(gpa, [0.0, 2.0, 3.0, Inf];
                 labels=["lo", "medium", "hi"])
transform!(d, :gpa => rcode => :status)

Row,F,L,N,gpa,semesters,alt_semesters,status
Unnamed: 0_level_1,Int64,Int64,Int64,Float64?,Float64,Float64,Cat…?
1,1159,1162,11,3.34,1.0,1.0,hi
2,1159,1192,41,2.51,4.0,4.0,medium
3,1159,1172,9,4.1,2.0,2.0,hi
4,1159,1189,23,2.7,3.5,3.5,medium
5,1159,1192,34,3.8,4.0,4.0,hi
6,1159,1202,41,2.81,5.0,5.0,medium
7,1159,1159,7,4.1,0.5,0.5,hi
8,1159,1222,20,3.74,7.0,7.0,hi
9,1159,1182,16,0.96,3.0,3.0,lo
10,1159,1162,7,3.8,1.0,1.0,hi


In [133]:
@chain d begin
    filter(start_semester(:fall), _)
    filter(:semesters => ==(0.5), _)
    freqtable(_.F, _.status)
end

8×4 Named Matrix{Int64}
Dim1 ╲ Dim2 │      lo   medium       hi  missing
────────────┼───────────────────────────────────
1159        │      11        8       33       21
1169        │      26       15       33       29
1179        │      29       26       27       56
1189        │      46       18       37       49
1199        │      32        7       26       49
1209        │      34       19       68       51
1219        │      34       28       67       65
1229        │       0        0        0      599

Working a bit more, we want to manipulate the frequency table, but our tools are easier with DataFrames. Unfortunately, we don't have the most direct conversion. Here we extract its values and column names for the data frame, then insert the rownames as the first column of our data frame


In [134]:
"""
    nt_2_df(m::NamedTable; nm=:ID)

Convert named table (e.g., from `FreqTables`) into data frame.
"""
function nt_2_df(m; nm=:ID)
    rnames, cnames = names(m, 1), names(m, 2)
	D = DataFrame(m.array, (Symbol∘string).(cnames))
	insertcols!(D, 1, nm => rnames)
    D
end

nt_2_df

In [135]:
_prop(x...) = sum(x[1:end-1]) / x[end]

@chain d begin
    filter(:F => !=(1229), _)
    filter(:semesters => ==(0.5), _)
    freqtable(_.F, _.status)
    nt_2_df
    combine(:ID, Not(:ID) => (+) => :N, :)
    combine(:ID, :N,
            [:lo, :missing, :N] => ByRow(_prop) => :lo,
            [:medium, :N] => ByRow(_prop) => :medium,
            [:hi, :N] => ByRow(_prop) => :hi)
end


Row,ID,N,lo,medium,hi
Unnamed: 0_level_1,Int64,Int64,Float64,Float64,Float64
1,1159,73,0.438356,0.109589,0.452055
2,1162,100,0.38,0.21,0.41
3,1169,103,0.533981,0.145631,0.320388
4,1172,101,0.584158,0.128713,0.287129
5,1179,138,0.615942,0.188406,0.195652
6,1182,82,0.402439,0.170732,0.426829
7,1189,150,0.633333,0.12,0.246667
8,1192,72,0.416667,0.111111,0.472222
9,1199,114,0.710526,0.0614035,0.22807
10,1202,109,0.568807,0.0733945,0.357798


## Tangent: overriding a base method


This shoehorns in an example of defining a user defined structure and custom `show` method, a common, easy-to-do, practice.


This example is to create an alternate display for the table we saw previously:


In [136]:
sems = unique(d.semesters)
cnt = Dict(s => 0 for s ∈ sems)  # initialize with a generator
for r ∈ eachrow(d)
    cnt[r.semesters] += 1
end
cnt

Dict{Float64, Int64} with 15 entries:
  5.0 => 170
  7.0 => 17
  0.5 => 2191
  7.5 => 11
  1.5 => 784
  1.0 => 1165
  5.5 => 134
  4.0 => 371
  6.0 => 66
  2.0 => 545
  3.5 => 462
  6.5 => 53
  3.0 => 344
  2.5 => 540
  4.5 => 259

Defining new types is as easy as calling `struct` appropriately:


In [137]:
struct PrisonCount
    x::Int
end

This creates an immutable struct, mutable structs are also possible.


We use the following Unicode string for the display:


In [138]:
tallies =  "\u007C"^4*"\u0338 "

"||||̸ "

Unicode is more commonly entered using LaTeX shortcuts (e.g., `\alpha[tab]`), the above uses code points. It also illustrates that `^` for strings is repetition and `*` is concatenation.


To override the base `show` method for our new type, the method must be imported or qualified, as below, and the acceptable types of `x` below must be narrowed:


In [139]:
function Base.show(io::IO, x::PrisonCount)
    d,r = divrem(x.x, 5)
    if d > 10
       print(io, "($d*5)... + ")
       d = mod(d, 10)
    end
    print(io, tallies^d)
    println(io, tallies[1:r]) # add newline at end
end

  * `Julia` permits a user to override base types for any type, but the common practice is to only do so for types that a package developer ones. "Type piracy" can be an issue.
  * The `show` method defined above is the catch all, there is also the ability to override based on the MIME type of the display. This notebook shows objects differently than the command line.
  * Indexing into a string is fruitfully done above. The empty range created by `1:r` when `r=0` requires no special case.


Finally, we want to see the result


In [140]:
for k ∈ sort(collect(keys(cnt)))
    print(k, " | ")
    print(PrisonCount(cnt[k]))
end

0.5 | (438*5)... + ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ |
1.0 | (233*5)... + ||||̸ ||||̸ ||||̸ 
1.5 | (156*5)... + ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||
2.0 | (109*5)... + ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ 
2.5 | (108*5)... + ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ 
3.0 | (68*5)... + ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||
3.5 | (92*5)... + ||||̸ ||||̸ ||
4.0 | (74*5)... + ||||̸ ||||̸ ||||̸ ||||̸ |
4.5 | (51*5)... + ||||̸ ||||
5.0 | (34*5)... + ||||̸ ||||̸ ||||̸ ||||̸ 
5.5 | (26*5)... + ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||
6.0 | (13*5)... + ||||̸ ||||̸ ||||̸ |
6.5 | ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ ||||̸ |||
7.0 | ||||̸ ||||̸ ||||̸ ||
7.5 | ||||̸ ||||̸ |
