In [14]:
import polars as pl
from mimesis import Fieldset
from mimesis.locales import Locale
from faux_lars import generate_dataframe
from faker import Faker

All benchmarks are arbitrary. These were run on a lightweight Nanobook with a pending Windows update restart.


But faux_lars is **much** quicker than its comparisons; `mimesis` and `faker`.

`mimesis` and `faker` are far more feature rich.

`mimesis`, in particular, has a great choice of both locales and customizability.

But for medium size (10,000 - ~1,000,000) row data generation, I think `faux_lars` is a credible choice.

# 10,000 Rows Benchmark

In [20]:
rows = 10_000

### Faker

In [21]:
%%timeit
fake = Faker()
pl.DataFrame({"Name":[fake.name() for _ in range( rows)], "Email":[fake.ascii_email() for _ in range(rows)], "Phone":[fake.phone_number() for _ in range(rows)]})

13.7 s ± 6.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Mimesis

In [3]:
%%timeit
fs = Fieldset(locale=Locale.EN, i=rows)

df = pl.DataFrame({
    "Name": fs("person.full_name"),
    "Email": fs("email"),
    "Phone": fs("telephone", mask="+1 (###) #5#-7#9#"),
})

2.64 s ± 957 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Fauxlars

In [4]:
%%timeit
df = generate_dataframe({"Name": "name", "Email":"safe_email","Phone":"mobile_number"}, rows, "en")

57.5 ms ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# 100,000 Rows Benchmark

In [22]:
rows = 100_000

### Faker

In [23]:
%%timeit
fake = Faker()
pl.DataFrame({"Name":[fake.name() for _ in range( rows)], "Email":[fake.ascii_email() for _ in range(rows)], "Phone":[fake.phone_number() for _ in range(rows)]})

1min 58s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Mimesis

In [6]:
%%timeit
fs = Fieldset(locale=Locale.EN, i=rows)

df = pl.DataFrame({
    "Name": fs("person.full_name"),
    "Email": fs("email"),
    "Phone": fs("telephone", mask="+1 (###) #5#-7#9#"),
})

The slowest run took 5.82 times longer than the fastest. This could mean that an intermediate result is being cached.
6.48 s ± 5.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Fauxlars

In [7]:
%%timeit
df = generate_dataframe({"Name": "name", "Email":"safe_email","Phone":"mobile_number"}, rows, "en")

142 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# 1,000,000 Rows Benchmark

In [24]:
rows = 1_000_000

### Faker

Goes out of memory, to quit the loop early.


In [25]:
%%timeit
fake = Faker()
pl.DataFrame({"Name":[fake.name() for _ in range( rows)], "Email":[fake.ascii_email() for _ in range(rows)], "Phone":[fake.phone_number() for _ in range(rows)]})

KeyboardInterrupt: 

### Mimesis

In [9]:
%%timeit
fs = Fieldset(locale=Locale.EN, i=rows)

df = pl.DataFrame({
    "Name": fs("person.full_name"),
    "Email": fs("email"),
    "Phone": fs("telephone", mask="+1 (###) #5#-7#9#"),
})

The slowest run took 5.58 times longer than the fastest. This could mean that an intermediate result is being cached.
1min 30s ± 1min 8s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Fauxlars

In [10]:
%%timeit
df = generate_dataframe({"Name": "name", "Email":"safe_email","Phone":"mobile_number"}, rows, "en")

4.58 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# 10,000,000 Rows Benchmark

In [11]:
rows = 10_000_000

### Faker
I'm not going to run `faker` on  10,000,000 rows. 1,000,000 already went OOM. 😱

In [None]:
# %%timeit
# fake = Faker()
# pl.DataFrame({"Name":[fake.name() for _ in range( rows)], "Email":[fake.ascii_email() for _ in range(rows)], "Phone":[fake.phone_number() for _ in range(rows)]})

### Mimesis

In [12]:
%%timeit
fs = Fieldset(locale=Locale.EN, i=rows)

df = pl.DataFrame({
    "Name": fs("person.full_name"),
    "Email": fs("email"),
    "Phone": fs("telephone", mask="+1 (###) #5#-7#9#"),
})

7min 8s ± 2min 6s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Fauxlars

In [13]:
%%timeit
df = generate_dataframe({"Name": "name", "Email":"safe_email","Phone":"mobile_number"}, rows, "en")

18.5 s ± 1.31 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
