Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package restructuring #41

Open
Arkoniak opened this issue Mar 24, 2021 · 1 comment
Open

Package restructuring #41

Arkoniak opened this issue Mar 24, 2021 · 1 comment

Comments

@Arkoniak
Copy link
Collaborator

Arkoniak commented Mar 24, 2021

I want to summarize here problems that I see with current implementation and some ideas how to overcome it.

Behind the scenes data downloading

In current implementation, data is loaded invisibly for the user. Moreover, it is not only loaded invisibly, it also downloads invisibly.

It leads to the following issues:

  1. Unpredictable times of first geolocate call, it can take from milliseconds (actual lookup) to seconds or even minutes (when data is loaded).
  2. Uncontrollable behaviour: user can't choose whether he wants to load data from an existing file, whether he wants to update the database, or even which base to use.
  3. It is hard to switch from IPv4 to IPv6, an application should load both bases behind the scene and then somehow choose which one to use. See Add Support for IPv6 #21
  4. It is hard to change localization, once again, all necessary files should be loaded behind the scene. It generates additional memory pressure. See Localization #22
  5. It is hard to switch from CSV to MaxMindDB, because it is not quite clear which base to use. See Add Maxminddb support #26

Solution to all of these problems is the following methods which are accessible by user:

  1. load: it should accept various parameters and modes. User can choose between local and internet data loading, between different database formats and localization
  2. update!: it should accept parameters similar to `load, but it should validate the current state of the database and update database if new version is available.
  3. geolocate should be changed to geolocate(::DB, ::IP). For convenience, getindex method can be added db[IP] which works as geolocate.

Loaded Data structure and results

In the current implementation DataFrame is used as a storage format, and Dict{String, Any} used as a return query format.

It leads to the following issues

  1. DataFrame is type unstable by construction, so improper use can lead to unnecessary allocations and overall slowness.
  2. Row construction is rather slow
  3. Output is type unstable, making it the reason of slowdown in final application.

Possible solution:

  1. Use StructArray or Vector of GeoResult structs.
  2. Return GeoResult, which should be concretely typed and have a fixed number of fields. Use sentinel values instead of missing data.
@Arkoniak
Copy link
Collaborator Author

Arkoniak commented Apr 27, 2021

Current roadmap:

The next version is 0.6 which should include number of breaking changes. The main idea is to make code type-stable and remove Dict(String, Any) construction, which takes ~95% of the time.

Version 0.7 should target improvement of initial load

  • 1. Improve speed of load function
  • 2. Store DB in a binary form in order to provide a "lazy" load.

Version 0.8 may include support for MaxMind database binary support and conversion utilities. This one is questionable since MaxMind binary file is much slower than native Julia structures. In this case, it can be added as one of the features post 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant