# Final examples

### Bogumił Kamiński

Let us wrap up our tutorial with examples of joining and reshaping data.

### Joining and reshaping data frames

In [1]:
using DataFrames

In [2]:
using CSV

In [3]:
using Pipe

In [4]:
using Unitful

In [5]:
using Dates

Load the weather forecast data from two cities from Poland.

In [6]:
rainfall_long = CSV.File("rainfall_forecast.csv") |> DataFrame

Unnamed: 0_level_0,city,date,rainfall
Unnamed: 0_level_1,String,Date,Float64
1,Olecko,2020-11-16,2.9
2,Olecko,2020-11-17,4.1
3,Olecko,2020-11-19,4.3
4,Olecko,2020-11-20,2.0
5,Olecko,2020-11-21,0.6
6,Olecko,2020-11-22,1.0
7,Ełk,2020-11-16,3.9
8,Ełk,2020-11-19,1.2
9,Ełk,2020-11-20,2.0
10,Ełk,2020-11-22,2.0


Note that we collect rainfall information, so it would be nice to add units to the measured values. This is not a problem with Unitful.jl. We take advantage of the fact that `DataFrame` can store vectors of any Julia objects.

In [7]:
transform!(rainfall_long, :rainfall => x -> x .* u"mm", renamecols=false)

Unnamed: 0_level_0,city,date,rainfall
Unnamed: 0_level_1,String,Date,Quantit…
1,Olecko,2020-11-16,2.9 mm
2,Olecko,2020-11-17,4.1 mm
3,Olecko,2020-11-19,4.3 mm
4,Olecko,2020-11-20,2.0 mm
5,Olecko,2020-11-21,0.6 mm
6,Olecko,2020-11-22,1.0 mm
7,Ełk,2020-11-16,3.9 mm
8,Ełk,2020-11-19,1.2 mm
9,Ełk,2020-11-20,2.0 mm
10,Ełk,2020-11-22,2.0 mm


With `renamecols=false` we left the name of the transformed column unchanged when we did an in-place update of the data frame using the `transform!` function.

It would be nice to see the data in a wide format, so that each city is represented by a single column. We can achieve this using the `unstack` function:

In [8]:
rainfall_wide = unstack(rainfall_long, :date, :city, :rainfall)

Unnamed: 0_level_0,date,Olecko,Ełk
Unnamed: 0_level_1,Date,Quantit…?,Quantit…?
1,2020-11-16,2.9 mm,3.9 mm
2,2020-11-17,4.1 mm,missing
3,2020-11-19,4.3 mm,1.2 mm
4,2020-11-20,2.0 mm,2.0 mm
5,2020-11-21,0.6 mm,missing
6,2020-11-22,1.0 mm,2.0 mm


We can see that the "gaps" in the rainfall information for `"Ełk"` column got automatically filled by `missing`.

There is also a `stack` function that does the reverse: transforms a data frame from wide to long format.

Also note that one of the cities is `"Ełk"`, which has a non standard character `ł` in its name. It is not a problem with DataFrames.jl. Let us e.g. extract this column as an exercise:

In [9]:
rainfall_wide.Ełk

6-element Vector{Union{Missing, Quantity{Float64, 𝐋, Unitful.FreeUnits{(mm,), 𝐋, nothing}}}}:
 3.9 mm
       missing
 1.2 mm
 2.0 mm
       missing
 2.0 mm

In [10]:
rainfall_wide."Ełk"

6-element Vector{Union{Missing, Quantity{Float64, 𝐋, Unitful.FreeUnits{(mm,), 𝐋, nothing}}}}:
 3.9 mm
       missing
 1.2 mm
 2.0 mm
       missing
 2.0 mm

When we read the data, we note that still there are gaps in the passed information --- one of the days is missing as there is no forecasted rainfall for it.

It would be nice to have information for all days in the considered period. Here is the way to do it:

In [11]:
all_days = DataFrame(date=Date.(2020,11, 16:22))

Unnamed: 0_level_0,date
Unnamed: 0_level_1,Date
1,2020-11-16
2,2020-11-17
3,2020-11-18
4,2020-11-19
5,2020-11-20
6,2020-11-21
7,2020-11-22


In [12]:
@pipe leftjoin(all_days, rainfall_wide, on=:date) |>
      coalesce.(_, 0.0u"mm")

Unnamed: 0_level_0,date,Olecko,Ełk
Unnamed: 0_level_1,Date,Quantit…,Quantit…
1,2020-11-16,2.9 mm,3.9 mm
2,2020-11-17,4.1 mm,0.0 mm
3,2020-11-19,4.3 mm,1.2 mm
4,2020-11-20,2.0 mm,2.0 mm
5,2020-11-21,0.6 mm,0.0 mm
6,2020-11-22,1.0 mm,2.0 mm
7,2020-11-18,0.0 mm,0.0 mm


Note that we additionally used a broadcasted `coalesce` operation on the whole data frame returned from `leftjoin` to replace all `missing` values by `0.0u"mm"` in it, as in this case `missing` meant that there is no rain forecasted for that day.

It was safe to do here, as we knew that `:date` column does not contain missings. In particular note that `leftjoin` would error by default if we tried to perfrom join on a column that contains `missing` values (use `matchmissing` keyword argument in joins to change this behavior).

### Conclusions

Before we finish let us summarize the major functions that DataFrames.jl provides:
1. data frame is a matrix-like data structure. You can index it just like a matrix. The differences are
   - you can use strings or `Symbol`s to select columns
   - if you select rows with `!` it selects you whole column of a data frame and passes it to you without copying
2. You can quickly summarize the contents of a data frame using the `describe` function
3. You can add rows to a data frame in-place using `push!` (similarly `append!` allows you to add multiple rows at the same time) (also `repeat`/`repeat!`, `hcat` and `vcat` are provided)
4. You can work on a grouped data frame that is created using the `groupby` function. It is a view and works as-if you have created a lookup index to a data frame.
5. There are `select`/`select!`/`transform`/`transform!`/`combine` functions that allow you to quickly transform/aggregate columns of a data frame or grouped data frame; there is also `mapcols`/`mapcols!` functions for quick aggregation of columns of a data frame
6. You can filter rows of a data frame using `filter` and `filter!` functions (also `subset` and `subset!` starting from version 1.0)
7. Use `sort` and `sort!` functions to sort data frames
8. You can join multiple data frames using `innerjoin`, `outerjoin`, `leftjoin`, `rightjoin`, `semijoin`, `antijoin`, and `crossjoin` functions (they work as you would expect them if you know SQL)
9. If you want to iterate rows or columns of a data frame use `eachrow` and `eachcol` functions (we have not discussed them, but they work exactly like in Julia Base)
10. You can change names of columns in a data frame using `rename` and `rename!` functions; to get names of columns of a data frame use `names` (strings) or `propertynames` (`Symbol`s)
11. To get number of rows and columns of a data frame use `nrow` and `ncol` functions
12. To flatten nested columns of a data frame use `flatten`
13. You can easily allow/disallow missing values in columns of a data frame using `allowmising`/`allowmissing!`/`disallowmising`/`disallowmissing!` functions
14. You can drop rows with missing data with `dropmissing`/`dropmissing!` functions
15. You can switch between [long and wide](https://en.wikipedia.org/wiki/Wide_and_narrow_data) representation of a data frame using `stack` and `unstack`

Additionally we have covered `freqtable` from FreqTables.jl, `@pipe` from Pipe.jl, and `lm` from GLM.jl packages that are often useful when wrangling data.

You can use many formats to store and read data frames, we have discussed CSV.jl and Arrow.jl packages that provide such functionality.

Finally we have shown how to integrate DataFrames.jl with plotting using PyPlot.jl and Unitful.jl.

Of course this course was just an introduction.

You can find reviews of functionality of DataFrames.jl in:
* an official manual at https://juliadata.github.io/DataFrames.jl/stable/
* a tutorial going through all functionalities of DataFrames.jl at https://github.com/bkamins/Julia-DataFrames-Tutorial
* documentation strings of the respective funcions