-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [Scan Operator] Add Python I/O support (+ JSON) to MicroPartition
reads
#1578
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,8 @@ use daft_table::Table; | |
use snafu::ResultExt; | ||
|
||
use crate::DaftCoreComputeSnafu; | ||
#[cfg(feature = "python")] | ||
use crate::PyIOSnafu; | ||
|
||
use daft_io::{IOConfig, IOStatsRef}; | ||
use daft_stats::TableMetadata; | ||
|
@@ -84,20 +86,6 @@ fn materialize_scan_task( | |
// Schema to cast resultant tables into, ensuring that all Tables have the same schema. | ||
// Note that we need to apply column pruning here if specified by the ScanTask | ||
let cast_to_schema = cast_to_schema.unwrap_or_else(|| scan_task.schema.clone()); | ||
let cast_to_schema = match &column_names { | ||
None => Ok(cast_to_schema), | ||
Some(column_names) => Ok(Arc::new( | ||
Schema::new( | ||
cast_to_schema | ||
.names() | ||
.iter() | ||
.filter(|name| column_names.contains(&name.as_str())) | ||
.map(|name| cast_to_schema.get_field(name).unwrap().clone()) | ||
.collect(), | ||
) | ||
.context(DaftCoreComputeSnafu)?, | ||
)), | ||
}?; | ||
|
||
let table_values = match scan_task.storage_config.as_ref() { | ||
StorageConfig::Native(native_storage_config) => { | ||
|
@@ -146,11 +134,22 @@ fn materialize_scan_task( | |
// Native CSV Reads | ||
// **************** | ||
FileFormatConfig::Csv(cfg @ CsvSourceConfig { .. }) => { | ||
let col_names = if !cfg.has_headers { | ||
Some( | ||
cast_to_schema | ||
.fields | ||
.values() | ||
.map(|f| f.name.as_str()) | ||
.collect::<Vec<_>>(), | ||
) | ||
} else { | ||
None | ||
}; | ||
urls.map(|url| { | ||
daft_csv::read::read_csv( | ||
url, | ||
None, | ||
None, // column_names.clone(), NOTE: `read_csv` seems to be buggy when provided with out-of-order column_names | ||
col_names.clone(), | ||
column_names.clone(), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should assign Also I wonder what the expected (and current!) behavior for ordering of columns is when we provide: schema, column_names and include_columns... Man I hate CSVs. I feel like it should be:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Offline discussion: |
||
scan_task.limit, | ||
cfg.has_headers, | ||
Some(cfg.delimiter.as_bytes()[0]), | ||
|
@@ -178,9 +177,82 @@ fn materialize_scan_task( | |
} | ||
#[cfg(feature = "python")] | ||
StorageConfig::Python(_) => { | ||
todo!("TODO: Implement Python I/O backend for MicroPartitions.") | ||
use pyo3::Python; | ||
match scan_task.file_format_config.as_ref() { | ||
FileFormatConfig::Parquet(ParquetSourceConfig { | ||
coerce_int96_timestamp_unit, | ||
.. | ||
}) => Python::with_gil(|py| { | ||
urls.map(|url| { | ||
crate::python::read_parquet_into_py_table( | ||
py, | ||
url, | ||
cast_to_schema.clone().into(), | ||
(*coerce_int96_timestamp_unit).into(), | ||
scan_task.storage_config.clone().into(), | ||
scan_task.columns.as_ref().map(|cols| cols.as_ref().clone()), | ||
scan_task.limit, | ||
) | ||
.map(|t| t.into()) | ||
.context(PyIOSnafu) | ||
}) | ||
.collect::<crate::Result<Vec<_>>>() | ||
})?, | ||
FileFormatConfig::Csv(CsvSourceConfig { | ||
has_headers, | ||
delimiter, | ||
double_quote, | ||
.. | ||
}) => Python::with_gil(|py| { | ||
urls.map(|url| { | ||
crate::python::read_csv_into_py_table( | ||
py, | ||
url, | ||
*has_headers, | ||
delimiter, | ||
*double_quote, | ||
cast_to_schema.clone().into(), | ||
scan_task.storage_config.clone().into(), | ||
scan_task.columns.as_ref().map(|cols| cols.as_ref().clone()), | ||
scan_task.limit, | ||
) | ||
.map(|t| t.into()) | ||
.context(PyIOSnafu) | ||
}) | ||
.collect::<crate::Result<Vec<_>>>() | ||
})?, | ||
FileFormatConfig::Json(_) => Python::with_gil(|py| { | ||
urls.map(|url| { | ||
crate::python::read_json_into_py_table( | ||
py, | ||
url, | ||
cast_to_schema.clone().into(), | ||
scan_task.storage_config.clone().into(), | ||
scan_task.columns.as_ref().map(|cols| cols.as_ref().clone()), | ||
scan_task.limit, | ||
) | ||
.map(|t| t.into()) | ||
.context(PyIOSnafu) | ||
}) | ||
.collect::<crate::Result<Vec<_>>>() | ||
})?, | ||
} | ||
} | ||
}; | ||
let cast_to_schema = match &column_names { | ||
None => Ok(cast_to_schema), | ||
Some(column_names) => Ok(Arc::new( | ||
Schema::new( | ||
cast_to_schema | ||
.names() | ||
.iter() | ||
.filter(|name| column_names.contains(&name.as_str())) | ||
.map(|name| cast_to_schema.get_field(name).unwrap().clone()) | ||
.collect(), | ||
) | ||
.context(DaftCoreComputeSnafu)?, | ||
)), | ||
}?; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jaychia I moved the field pruning on the schema after the read, since I believe that all of the I/O layers expect the schema to give the full (pre-projection) schema of the file, rather than the pruned schema. I at least know that this is the case for the CSV reader. This also matches the behavior of the non-scan operator + micropartition path. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can make that strong of an assumption actually, since this might not hold true in two scenarios: Scenario 1User-provided schema hints:
I think if a user ran the above, we would expect it to work on all the CSV files with the assumption that each CSV file could have any number of columns, but they should all at least have one column called Scenario 2Assymetric files/schema inference
We perform schema inference on SolutionThe current way we work around this in both our Python Parquet/CSV/JSON reads and in our native Parquet read is:
For native CSV reads, we currently pass in a hardcoded There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Offline discussion:
|
||
|
||
let casted_table_values = table_values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the current code would run
And also I think based on the way our code is written right now, (1) will run it on the pre-pruned schema, and (2) will run it on the pruned schema. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Synced offline: this should be fine for now, since it doesn't affect correctness and will be short lived (deprecated soon) |
||
.iter() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaychia Here was the
include_columns
column ordering bug; we were pruning columns while maintaining the original field ordering instead of using the new field ordering indicated by the order ofinclude_columns
.