# Working with text files

### Bogumił Kamiński

In this notebook we will show how one can interact with CSV files when working with DataFrames.

In [1]:
using DataFrames

In [2]:
using CSV

In [3]:
using Arrow

In [4]:
using Statistics

First we download the data set we will work with and save it as auto.txt file in a current working directory.

In [5]:
download("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original",
         "auto.txt")

"auto.txt"

Let us check how the file looks inside using the `readlines` function:

In [6]:
readlines("auto.txt")

406-element Vector{String}:
 "18.0   8.   307.0      130.0      3504.      12.0   70.  1.\t\"chevrolet chevelle malibu\""
 "15.0   8.   350.0      165.0      3693.      11.5   70.  1.\t\"buick skylark 320\""
 "18.0   8.   318.0      150.0      3436.      11.0   70.  1.\t\"plymouth satellite\""
 "16.0   8.   304.0      150.0      3433.      12.0   70.  1.\t\"amc rebel sst\""
 "17.0   8.   302.0      140.0      3449.      10.5   70.  1.\t\"ford torino\""
 "15.0   8.   429.0      198.0      4341.      10.0   70.  1.\t\"ford galaxie 500\""
 "14.0   8.   454.0      220.0      4354.       9.0   70.  1.\t\"chevrolet impala\""
 "14.0   8.   440.0      215.0      4312.       8.5   70.  1.\t\"plymouth fury iii\""
 "14.0   8.   455.0      225.0      4425.      10.0   70.  1.\t\"pontiac catalina\""
 "15.0   8.   390.0      190.0      3850.       8.5   70.  1.\t\"amc ambassador dpl\""
 "NA     4.   133.0      115.0      3090.      17.5   70.  2.\t\"citroen ds-21 pallas\""
 "NA     8.   350.0      1

For this exercise we have chosen a typical file, which in practice means that things are not trivial.
* first we note that it has no header with column names.
* second, we see that the last column is tab separated, while earlier columns are separated with varying number of spaces
* finally we see that missing values are encoded by "NA" in this file

We will show several options how the file can be parsed to a `DataFrame`.
The first one is to replace tabs with spaces in the source file, and then load it using `CSV.File` command.

We start by getting the contents of the file into a single string.

In [7]:
raw_str = read("auto.txt", String)

"18.0   8.   307.0      130.0      3504.      12.0   70.  1.\t\"chevrolet chevelle malibu\"\n15.0   8.   350.0      165.0      3693.      11.5   70.  1.\t\"buick skylark 320\"\n18.0   8.   318.0      150.0      3436.      11.0   70.  1.\t\"plymouth satellite\"\n16.0   8.   304.0      150.0      3433.      12.0   70.  1.\t\"amc rebel sst\"\n17.0   8.   302.0      140.0      3449.      10.5   70.  1.\t\"ford torino\"\n15.0   8.   429.0      198.0      4341.      10.0   70.  1.\t\"ford galaxie 500\"\n14.0   8.   454.0      220.0      4354.       9.0   70.  1.\t\"chevrolet impala\"\n14.0   8.   440.0      215.0      4312.       8.5   70.  1.\t\"plymouth fury iii\"\n14.0   8.   455.0      225.0      4425.      10.0   70.  1.\t\"pontiac catalina\"\n15.0   8.   390.0      190.0      3850.       8.5   70.  1.\t\"amc ambassador dpl\"\nNA     4.   133.0      115.0      3090.      17.5   70.  2.\t\"citroen ds-21 pallas\"\nNA     8.   350.0      165.0      4142.      11.5   70.  1.\t\"chevrolet ch

Now we replace all tabs in this string by spaces

(note that in general it is not a safe operation as theoretically if you had columns that are strings they might have contained quoted tabs; fortunately in this case they do not have them so we are safe).

In [8]:
str_no_tab = replace(raw_str, '\t'=>' ')

"18.0   8.   307.0      130.0      3504.      12.0   70.  1. \"chevrolet chevelle malibu\"\n15.0   8.   350.0      165.0      3693.      11.5   70.  1. \"buick skylark 320\"\n18.0   8.   318.0      150.0      3436.      11.0   70.  1. \"plymouth satellite\"\n16.0   8.   304.0      150.0      3433.      12.0   70.  1. \"amc rebel sst\"\n17.0   8.   302.0      140.0      3449.      10.5   70.  1. \"ford torino\"\n15.0   8.   429.0      198.0      4341.      10.0   70.  1. \"ford galaxie 500\"\n14.0   8.   454.0      220.0      4354.       9.0   70.  1. \"chevrolet impala\"\n14.0   8.   440.0      215.0      4312.       8.5   70.  1. \"plymouth fury iii\"\n14.0   8.   455.0      225.0      4425.      10.0   70.  1. \"pontiac catalina\"\n15.0   8.   390.0      190.0      3850.       8.5   70.  1. \"amc ambassador dpl\"\nNA     4.   133.0      115.0      3090.      17.5   70.  2. \"citroen ds-21 pallas\"\nNA     8.   350.0      165.0      4142.      11.5   70.  1. \"chevrolet chevelle conco

Finally we create an `IOBuffer` backed by the string we have just created.

In [9]:
io = IOBuffer(str_no_tab)

IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=32149, maxsize=Inf, ptr=1, mark=-1)

You can think of variable `io` as an in-memory I/O stream. Therefore we can pass this stream to the `CSV.File` function to read it as-if it were a CSV file. Note that in the options we choose that:
* the delimiter is space
* we ignore repeated (consecutive) occurences of the delimiter (so we correctly handle our file which has columns padded by spaces)
* we explicitly pass column names via `header` keyword argument
* we specify that missing values are represented using `"NA"` string in our file

Finally note that we pass the result of `CSV.File` operation to a `DataFrame` constructor using `|>`.

In [10]:
df1 = CSV.File(io,
               delim=' ',
               ignorerepeated=true,
               header=[:mpg, :cylinders, :displacement, :horsepower,
                       :weight, :acceleration, :year, :origin, :name],
               missingstring="NA") |>
      DataFrame

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0


Note that not all columns of the data frame have been displayed and we see 30 rows (they do not fit on my screen).

It is easy to change the default maximum width and height of the output by setting appropriate values in `ENV` dictionary.

In [11]:
ENV["COLUMNS"], ENV["LINES"] = 200, 15

(200, 15)

In [12]:
df1

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0,ford galaxie 500
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0,chevrolet impala
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0,plymouth fury iii
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0,pontiac catalina
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0,amc ambassador dpl


Now the display is much nicer.

Let us discuss an alternative way to read in the original file.
This time we will first read the data in directly from the file.

In [13]:
df_raw = CSV.File("auto.txt", header=[:metrics, :name]) |> DataFrame

Unnamed: 0_level_0,metrics,name
Unnamed: 0_level_1,String,String
1,18.0 8. 307.0 130.0 3504. 12.0 70. 1.,chevrolet chevelle malibu
2,15.0 8. 350.0 165.0 3693. 11.5 70. 1.,buick skylark 320
3,18.0 8. 318.0 150.0 3436. 11.0 70. 1.,plymouth satellite
4,16.0 8. 304.0 150.0 3433. 12.0 70. 1.,amc rebel sst
5,17.0 8. 302.0 140.0 3449. 10.5 70. 1.,ford torino
6,15.0 8. 429.0 198.0 4341. 10.0 70. 1.,ford galaxie 500
7,14.0 8. 454.0 220.0 4354. 9.0 70. 1.,chevrolet impala
8,14.0 8. 440.0 215.0 4312. 8.5 70. 1.,plymouth fury iii
9,14.0 8. 455.0 225.0 4425. 10.0 70. 1.,pontiac catalina
10,15.0 8. 390.0 190.0 3850. 8.5 70. 1.,amc ambassador dpl


Note that this time CSV.jl auto-detected that tab is the right delimiter to split the columns

(it was the only delimiter that produced consistent number of columns).

We will split `:metrics` column manually now

In [14]:
str_metrics = split.(df_raw.metrics)

406-element Vector{Vector{SubString{String}}}:
 ["18.0", "8.", "307.0", "130.0", "3504.", "12.0", "70.", "1."]
 ["15.0", "8.", "350.0", "165.0", "3693.", "11.5", "70.", "1."]
 ["18.0", "8.", "318.0", "150.0", "3436.", "11.0", "70.", "1."]
 ["16.0", "8.", "304.0", "150.0", "3433.", "12.0", "70.", "1."]
 ["17.0", "8.", "302.0", "140.0", "3449.", "10.5", "70.", "1."]
 ⋮
 ["27.0", "4.", "140.0", "86.00", "2790.", "15.6", "82.", "1."]
 ["44.0", "4.", "97.00", "52.00", "2130.", "24.6", "82.", "2."]
 ["32.0", "4.", "135.0", "84.00", "2295.", "11.6", "82.", "1."]
 ["28.0", "4.", "120.0", "79.00", "2625.", "18.6", "82.", "1."]
 ["31.0", "4.", "119.0", "82.00", "2720.", "19.4", "82.", "1."]

Now let us create an empty `df1_2` data frame that we will populate with appropriate columns.
The pattern we use here is typical when you e.g. perform repeated computations whose results you want to store in a `DataFrame`.

In [15]:
df1_2 = DataFrame([col => Float64[] for
                  col in [:mpg, :cylinders, :displacement, :horsepower, :weight, :acceleration, :year, :origin]])

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64


Now we have a data frame that has 8 columns and 0 rows. It accepts floating point values. However in columns `:mpg` and `:horsepower` we have to allow the data frame to hold missing values. We do it using `allowmissing!` function

In [16]:
allowmissing!(df1_2, [:mpg, :horsepower])

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64


Note that the element type of columns `:mpg` and `:horsepower` changed to `Float64?` which signals that that columns allows missing values.

Now we are ready to populate our data frame.

In [17]:
for row in str_metrics
    push!(df1_2, [v == "NA" ? missing : parse(Float64, v) for v in row])
end

In [18]:
df1_2

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0


finally, let us add a column `:name` from the `df_raw` data frame

In [19]:
df1_2.name = df_raw.name

406-element Vector{String}:
 "chevrolet chevelle malibu"
 "buick skylark 320"
 "plymouth satellite"
 "amc rebel sst"
 "ford torino"
 ⋮
 "ford mustang gl"
 "vw pickup"
 "dodge rampage"
 "ford ranger"
 "chevy s-10"

In [20]:
df1_2

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0,ford galaxie 500
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0,chevrolet impala
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0,plymouth fury iii
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0,pontiac catalina
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0,amc ambassador dpl


Before we move forward we should stress one very important thing that is related to `df1_2.name = df_raw.name` assignment. After this operation columns `:name` in both `df1_2` and `df_raw` are the same objects.
We can easily check it:

In [21]:
df1_2.name === df_raw.name

true

Such behavior is allowed for performance reasons. In this case we accepted it as we want to discard `df_raw` data frame and not use it in later analysis.
However, in general it would be safer to create a copy of `:name` column when assigning it ti `df1_2` data frame.

It can be achieved like this:
```
df1_2[:, :name] = df_raw.name
```
or like this
```
df1_2.name = df_raw[:, :name]
```
We could also write:
```
df1_2[:, :name] = df_raw[:, :name]
```
but this time there would be one unnecessary copy made (one when reading the data from `df_raw` the other when writing data to `df1_2`)

We can check that `df1` and `df1_2` data frames are equal using `isequal` funtion, so that we ended up with identical data frames.

In [22]:
isequal(df1_2, df1)

true

Note that as the data frames contain missing values comparing them with `missing` would produce `missing`.

In [23]:
df1_2 == df1

missing

Let us investigate yet another way to create our data frame, this time in one-shot (this line is more difficult to understand, but it shows you the power that DataFrames.jl gives you when working with data, in part 4 of the tutorial we will give some more examples of supported transformations):

In [24]:
df1_3 = select(df_raw,
               :metrics =>
               ByRow(x -> something.(tryparse.(Float64, split(x)), missing)) =>
               [:mpg, :cylinders, :displacement, :horsepower, :weight, :acceleration, :year, :origin],
               :name)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0,ford galaxie 500
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0,chevrolet impala
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0,plymouth fury iii
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0,pontiac catalina
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0,amc ambassador dpl


Let us understand what is going on in the above expression.

The general rule os specifying transformations in `select` (also `transform` and `combine`) is:

`source_columns => transformation => destination_columns`

In this case we take `:metrics` column from the source data frame. We process it element by element,
which is signaled by `ByRow` (you can think of this as broadcasting).
Now each element is first `split` and then we try parsing it as `Float64`.
If we fail `tryparse` returns `nothing`, which we convert to `missing` using the `something` function.
As a result we obtain a vector of `Float64` or `Missing` values.
Since the last element of our transformation is a vector of culumn names this vector is expanded into
that many columns.

Finally we add the `:name` column from the source data frame

As above we check that the result is the same as `df1`.

In [25]:
isequal(df1_3, df1)

true

We can easily count the number of missing values in the `df1` data frame using the `eachcol` function that returns the iterator over columns of the data frame:

In [26]:
sum(count(ismissing, col) for col in eachcol(df1))

14

we could alternatively transform our data frame to a `Matrix` and use `count` on it (it would be slower though as unnecessary copies of data would be performed):

In [27]:
count(ismissing, Matrix(df1))

14

or if you like iterators:

In [28]:
count(ismissing, Iterators.flatten(eachcol(df1)))

14

Alternatively could use the `mapcols` function to get the number of missings per column:

In [29]:
mapcols(x -> count(ismissing, x), df1)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,8,0,0,6,0,0,0,0,0


also it is easy to find the rows containing missing values using the `filter` function:

In [30]:
filter(row -> any(ismissing, row), df1)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String
1,missing,4.0,133.0,115.0,3090.0,17.5,70.0,2.0,citroen ds-21 pallas
2,missing,8.0,350.0,165.0,4142.0,11.5,70.0,1.0,chevrolet chevelle concours (sw)
3,missing,8.0,351.0,153.0,4034.0,11.0,70.0,1.0,ford torino (sw)
4,missing,8.0,383.0,175.0,4166.0,10.5,70.0,1.0,plymouth satellite (sw)
5,missing,8.0,360.0,175.0,3850.0,11.0,70.0,1.0,amc rebel sst (sw)
6,missing,8.0,302.0,140.0,3353.0,8.0,70.0,1.0,ford mustang boss 302
7,25.0,4.0,98.0,missing,2046.0,19.0,71.0,1.0,ford pinto
8,missing,4.0,97.0,48.0,1978.0,20.0,71.0,2.0,volkswagen super beetle 117
9,21.0,6.0,200.0,missing,2875.0,17.0,74.0,1.0,ford maverick
10,40.9,4.0,85.0,missing,1835.0,17.3,80.0,2.0,renault lecar deluxe


Assume we are interested in the brand of each car. We can extract it from `:name` column using broadcasting like this:

In [31]:
df1.brand = first.(split.(df1.name))

406-element Vector{SubString{String}}:
 "chevrolet"
 "buick"
 "plymouth"
 "amc"
 "ford"
 ⋮
 "ford"
 "vw"
 "dodge"
 "ford"
 "chevy"

In [32]:
df1

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,brand
Unnamed: 0_level_1,Float64?,Float64,Float64,Float64?,Float64,Float64,Float64,Float64,String,SubStri…
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu,chevrolet
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320,buick
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite,plymouth
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst,amc
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino,ford
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0,ford galaxie 500,ford
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0,chevrolet impala,chevrolet
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0,plymouth fury iii,plymouth
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0,pontiac catalina,pontiac
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0,amc ambassador dpl,amc


Earlier we have shown how one can manually find rows that have missing values. A common operation is the reverse - i.e. selecting only rows that do not contain missing values. This can be achieved like this:

In [33]:
df2 = dropmissing(df1)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,brand
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,SubStri…
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu,chevrolet
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320,buick
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite,plymouth
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst,amc
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino,ford
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0,ford galaxie 500,ford
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0,chevrolet impala,chevrolet
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0,plymouth fury iii,plymouth
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0,pontiac catalina,pontiac
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0,amc ambassador dpl,amc


Now let us find all rows that correspond to `"saab"` brand. You can do it in two ways, either indexing or using `filter` function:

In [34]:
df2[df2.brand .== "saab", :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,brand
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,SubStri…
1,25.0,4.0,104.0,95.0,2375.0,17.5,70.0,2.0,saab 99e,saab
2,24.0,4.0,121.0,110.0,2660.0,14.0,73.0,2.0,saab 99le,saab
3,25.0,4.0,121.0,115.0,2671.0,13.5,75.0,2.0,saab 99le,saab
4,21.6,4.0,121.0,115.0,2795.0,15.7,78.0,2.0,saab 99gle,saab


In [35]:
filter(:brand => ==("saab"), df2)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,brand
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,SubStri…
1,25.0,4.0,104.0,95.0,2375.0,17.5,70.0,2.0,saab 99e,saab
2,24.0,4.0,121.0,110.0,2660.0,14.0,73.0,2.0,saab 99le,saab
3,25.0,4.0,121.0,115.0,2671.0,13.5,75.0,2.0,saab 99le,saab
4,21.6,4.0,121.0,115.0,2795.0,15.7,78.0,2.0,saab 99gle,saab


Note that the `:brand => ==("saab")` syntax means that we take elements of `:brand` column and pass it to `==("saab")` function.

Now `==("saab")` is just a shorthand for `x -> x == "saab"`.

Alternatively we could do the filtering operation in the following way (this is a bit slower but you might find it more readable):

In [36]:
filter(row -> row.brand == "saab", df2)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,brand
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String,SubStri…
1,25.0,4.0,104.0,95.0,2375.0,17.5,70.0,2.0,saab 99e,saab
2,24.0,4.0,121.0,110.0,2660.0,14.0,73.0,2.0,saab 99le,saab
3,25.0,4.0,121.0,115.0,2671.0,13.5,75.0,2.0,saab 99le,saab
4,21.6,4.0,121.0,115.0,2795.0,15.7,78.0,2.0,saab 99gle,saab


To finish this part of the tutorial let us save the `df2` file to auto2.csv and auto2.arrow files. We will use them later in the next parts of the course.

In [37]:
CSV.write("auto2.csv", df2)

"auto2.csv"

In [38]:
Arrow.write("auto2.arrow", df2)

"auto2.arrow"

Let us just quickly inspect what we have written to disk in auto2.csv (auto2.arrow file is binary) before we finish:

In [39]:
readlines("auto2.csv")

393-element Vector{String}:
 "mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,brand"
 "18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu,chevrolet"
 "15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320,buick"
 "18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite,plymouth"
 "16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst,amc"
 ⋮
 "27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0,ford mustang gl,ford"
 "44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0,vw pickup,vw"
 "32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage,dodge"
 "28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger,ford"
 "31.0,4.0,119.0,82.0,2720.0,19.4,82.0,1.0,chevy s-10,chevy"

Note that CSV.jl by default used a comma to separate the fields in our file and written a header in the first row of the file.