-
Notifications
You must be signed in to change notification settings - Fork 373
Description
See discussion starting from this comment. The key issue is that some CSV reading packages like to have a DataFrame as a default output: it is probably the most popular tabular data format in Julia and is preferrable to the "typed alternatives" when the data is wide (NamedTuple based alternatives have poor performance when the number of columns is in the thousands). In practice, both CSV.jl and TableReader.jl have a dependency on DataFrames which they only use so that they can output a DataFrame. Unfortunately, this causes a slowdown when loading those packages due to the complexity of the code of DataFrames. It also makes those packages less appealing as dependencies for tabular data packages that rely on alternative formats (like IndexedTables, StructArrays or TypedTables).
A proposed solution is to split the DataFrame definition and constructors to a DataFramesBase packages which would have basically zero dependencies and would be very fast to load, so that CSV and TableReader can just depend on DataFramesBase. DataFrames would then reexport DataFramesBase.DataFrame so this change would not affect users.
As an added bonus, I suspect this may make it more appealing to port IndexedTables from their current "mutable table" structure (a dictionary of columns) to just using a DataFrame: converting IndexedTable to DataFrame, replacing a few columns in various ways and converting back to IndexedTable is a useful workflow.