In a [recent MotherDuck blog post](https://motherduck.com/blog/python-faker-duckdb-exploration/), 
the author generated 1 billion fake people records using Python in order to analyze the data 
with DuckDB. I suspect the point of the article was to showcase how awesome `duckdb` is at 
handling large amounts of local data, but it did spend the majority of its time explaining the 
data generation process, which made for a fun read.

One of the most interesting tidbits from the article to me was:

> I used the GNU Parallel technique discussed above with a hefty m6i.32xlarge instance on 
Amazon EC2, though generated a billion people in 1k parquet files. This took about 2 hours to 
generate.

Yikes, that's a lot of firepower! That machine 
[comes with 128 vCPUs and 512 GiB RAM](https://instances.vantage.sh/aws/ec2/m6i.32xlarge), and 
costs about $6 an hour. So pretty hefty indeed.

Being a big fan of Julia, I decided to see what it would be like to generate such a dataset 
with Julia. More concretely, I wanted to see if I can use less resource (only my laptop) *and* 
have the process run in significantly less time.


The results were initially disappointing, as I'll explain below, but in the end I did get that 
nice speed-up I was looking for.


## The Details

Refer to the original post for the full details, but I'll go over the basic details here. A 
`person` record consists of the following randomly generated fields:
```
- id
- first_name
- last_name
- email
- company
- phone
```

Using the Python [Faker](https://faker.readthedocs.io/en/master/) library to generate the 
data and [GNU Parallel](https://www.gnu.org/software/parallel/) to parallelize the operation, 
the author created 1,000 parquet files with 1 million records each before populating a 
`duckdb` database for further analysis.

In this post, we'll explore Julia's own 
[Faker.jl](https://github.com/neomatrixcode/Faker.jl) package, and how to leverage the various, 
built-in capabilities Julia has for concurrency and parallelism.


## Julia: First Attempt

As mentioned, Julia has its own Faker library. Using it is as simple as:

```julia
using Faker

Faker.first_name()
```

```
"Wilfredo"
```

Instead of putting all the fields in a dictionary, I created a struct instead:

```julia
struct Person
    id::String
    first_name::String
    last_name::String
    email::String
    company::String
    phone::String
end
```

Aside from being a natural thing to do in Julia, this ended up being a really handy vehicle 
for populating a DataFrame, as we'll see in a moment.

In order to construct the `Person` object, we have the following function, which is essentially 
the same as in the Python version in the original post:

```julia
function get_person()
    person = Person(
        Faker.random_int(min=1000, max=9999999999999),
        Faker.first_name(),
        Faker.last_name(),
        Faker.email(),
        Faker.company(),
        Faker.phone_number()
    )
    return person
end

get_person()
```

```
Person("8429894898777", "Christin", "Gleason", "Archie99@yahoo.com", "Mante, Hilll and Hessel", "1-548-869-5799 x26945")
```


This approach clearly suffers from the same deficiency as the original in that the generated 
email address bears absolutely no semblance to the generated first and last names 😂. But 
that's ok, we're just making up data for mocking and testing purposes anyhow.

To create an array of `Person`s, we can use a comprehension:

```julia
list_of_five = [get_person() for _ in 1:5]
```

```
5-element Vector{Person}:
 Person("502327436522", "Simon", "Lind", "Franklyn.Satterfield@yahoo.com", "Rutherford-Barton", "054.718.0236")
 Person("1988647737198", "Charlott", "Jacobs", "Walter.Ziemann@hotmail.com", "Towne, Gorczany and Brekke", "839-605-0245 x477")
 Person("3335059941285", "Glory", "Brakus", "Nienow.Cassandra@lh.net", "Schuppe, Powlowski and Powlowski", "(122) 872-3081 x3643")
 Person("4996530776723", "Hedwig", "Pfannerstill", "wSchamberger@hg.net", "Langosh Group", "(594) 274-0196 x72486")
 Person("2217875886672", "Coletta", "Effertz", "Whitley.Bechtelar@mz.org", "Rippin Inc", "991.601.1323")
```

Notice how we get a `Vector` of `Person`s... this is partially where that cool thing happens. 
Placing that vector in a `DataFrame` constructor creates a dataframe object for us without any 
hassle at all:

```julia
using DataFrames

df = DataFrame(list_of_five)
```

```
5×6 DataFrame
 Row │ id             first_name  last_name     email                           company                           phone                 
     │ String         String      String        String                          String                            String                
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 502327436522   Simon       Lind          Franklyn.Satterfield@yahoo.com  Rutherford-Barton                 054.718.0236
   2 │ 1988647737198  Charlott    Jacobs        Walter.Ziemann@hotmail.com      Towne, Gorczany and Brekke        839-605-0245 x477
   3 │ 3335059941285  Glory       Brakus        Nienow.Cassandra@lh.net         Schuppe, Powlowski and Powlowski  (122) 872-3081 x3643
   4 │ 4996530776723  Hedwig      Pfannerstill  wSchamberger@hg.net             Langosh Group                     (594) 274-0196 x72486
   5 │ 2217875886672  Coletta     Effertz       Whitley.Bechtelar@mz.org        Rippin Inc                        991.601.1323
```

That's pretty neat!

Anyhow, with our basic functionality all set up, it's time to do some light benchmarking to 
get a sense of how this code will perform. I started off small by generating only a 100,000 
records:

```julia
@time [get_person() for _ in 1:100_000] |> DataFrame;
```

```
 67.001679 seconds (51.67 M allocations: 2.484 GiB, 1.74% gc time, 0.19% compilation time: 26% of which was recompilation)
```

Oof, that result is not very comforting -- taking a minute plus just for 100,000 records does 
not bode well. Assuming linear scaling, it would take 65 * 10_000 seconds, or roughly 180 hours 
to run the full thing 😰.

At this point, I'm thinking we can speed things up a bit by using multi-threading. But figuring 
out the right syntax for creating an array and then populating it with data using threading 
appeared a bit clunky. Luckily there exists the ThreadsX.jl package that allows us to use 
comprehensions for such things, specifically by using `ThreadsX.collect` over our comprehension:


In [None]:
using ThreadsX

@time ThreadsX.collect(get_person() for _ in 1:100_000) |> DataFrame;

```
 11.800948 seconds (51.89 M allocations: 2.534 GiB, 6.59% gc time, 1.21% compilation time)
```

Ok so that's a little better, but running 12 threads and getting a 5-6x speed-up is not that 
great, but, more importantly, by our linear scaling logic, the full 1 billion record run 
would take approximately 30 hours on my laptop. Just to generate the data, nevermind 
serializing it to disk.

Despite knowing it's a losing battle, I wrote a function to generate all the data and save it 
as parquet files, just like in the original post:

```julia
using Parquet2: writefile

function save_the_people(num_people, num_files)
    @sync for i in 1:num_files
        file_num = string(i, pad=ndigits(num_files))
        file_loc = "./data/outfile_$(file_num).parquet"
        df = ThreadsX.collect(get_person() for _ in 1:num_people) |> DataFrame
        @async writefile(file_loc, df; compression_codec=:snappy)
    end
end
```

## Appendix

The code in this post was run with Julia 1.8.5 and the following package versions:

```julia
using Pkg

Pkg.status(["Faker", "DataFrames", "Parquet2", "ThreadsX"])
```

```
Status `~/.julia/environments/v1.8/Project.toml`
  [a93c6f00] DataFrames v1.5.0
  [0efc519c] Faker v0.3.5
  [98572fba] Parquet2 v0.2.9
  [ac1d9e8a] ThreadsX v0.1.11
```

Additional hardware and software info:
```julia
versioninfo()
```

```
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 4750U with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 12 on 16 virtual cores
Environment:
  JULIA_NUM_THREADS = 12
  JULIA_EDITOR = code
```