# Elegant Multithreading

No one wants slow CSV reading, especially in a performance intensive task such as machine learning, where the amount of data and the use of compute resources is plentiful. Our state-of-the-art data ecosystem takes full advantage of the newly introduced multithreading functionality in Julia.

It does so with careful consideration to ease of use, without loss of capability. The CSV.jl package exposes the threads via a simple keyword argument and is compared against the equivalent pandas library in python.

In [1]:
]activate .

[32m[1m Activating[22m[39m environment at `~/odsceurpoe/ODSCEurope2020/Project.toml`


In [1]:
using Base.Threads
Threads.nthreads()

1

In [56]:
using CSV, BenchmarkTools
using CSV.DataFrames

## Benchmarking CSV Reading

Real world datasets often comprise of varied forms of data, and are very heterogenous in nature, unless special care is taken to homogenise the datasets.

We therefore benchmark reading of a common file type - CSV - and compare it against the state-of-the-art in python; ie pandas.

Note that this simulates the case of a single machine. A library like dask or pandas can scale with multiple cores, but the scaling observed here will be carried forward with the same efficiency in the Julia side of things as well, so it's benefits are obvious.

In [57]:
df = CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/mixed.csv")
first(df, 5)

Unnamed: 0_level_0,col1,col2,col3,col4,col5
Unnamed: 0_level_1,Float64⍰,Float64⍰,Int64⍰,Dates…⍰,String⍰
1,0.164184,missing,-8618047875453619626,missing,J7BSebe75ZSElvwjLCmP
2,missing,missing,-50295243866911532,missing,missing
3,0.742546,0.4025,missing,missing,missing
4,missing,0.3716,4175438224706417676,missing,missing
5,missing,0.1643,-7177638387897108480,1960-10-17T14:20:12,missing


As can be noted, the data is a heavy mix of different data types and contains many `missing` values. This is intentional, to replicate a real world use case the closest.

# Benchmarking

## Run 1

This run is to set up a baseline. We use the mixed CSV from earlier, and use CSV.jl for Julia, and pandas for python. The first run with CSV.jl is run with threading disabled, while pandas is run as is. We will further compare the performance of CSV.jl with threading enabled.

In [58]:
run(`python3 -m timeit -s "import pandas" -p "pandas.read_csv('csv-benchmarks/mixed.csv')"`)

10 loops, best of 3: 675 msec per loop


Process(`[4mpython3[24m [4m-m[24m [4mtimeit[24m [4m-s[24m [4m'import pandas'[24m [4m-p[24m [4m"pandas.read_csv('csv-benchmarks/mixed.csv')"[24m`, ProcessExited(0))

In [59]:
@btime CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/mixed.csv", threaded = false);

  226.895 ms (158577 allocations: 47.27 MiB)


In [60]:
@btime CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/mixed.csv", threaded = true);

  46.207 ms (109924 allocations: 43.29 MiB)


## Run 2

Having set up a baseline with the mixed CSV file, we will now attempt to read in a much larger version of the same file ~ 1 GB

In [61]:
;ls -ltra --block=g csv-benchmarks/big_mixed.csv

-rw-rw-r-- 1 dhairyagandhi96 dhairyagandhi96 1G Jan 21 09:58 csv-benchmarks/big_mixed.csv


In [62]:
@btime CSV.read("/home/dhairyagandhi96/mlds/csv-benchmarks/big_mixed.csv", threaded = true);

  346.026 ms (951256 allocations: 430.31 MiB)


In [63]:
run(`python3 -m timeit -s "import pandas" -p "pandas.read_csv('csv-benchmarks/big_mixed.csv')"`)

10 loops, best of 3: 6.33 sec per loop


Process(`[4mpython3[24m [4m-m[24m [4mtimeit[24m [4m-s[24m [4m'import pandas'[24m [4m-p[24m [4m"pandas.read_csv('csv-benchmarks/big_mixed.csv')"[24m`, ProcessExited(0))

# That's a speedup of about **15x** with multithreading, and about 3x without !!