Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces columnar CSV reading functionality to Scautable, providing an alternative memory format optimized for analytical workloads. The implementation adds a new ReadAs enum that allows users to choose between row-oriented iteration (the existing behavior) and column-oriented arrays (the new feature). The changes integrate cleanly with the existing compile-time CSV parsing infrastructure while maintaining backward compatibility.
Key Changes:
- Added
CsvColumnsReaderobject providingresource,absolutePath, andfromStringmethods that returnNamedTuple[K, (Array[T1], Array[T2], ...)]instead of iterators - Introduced
ReadAsenum (Rows/Columns) to unify both reading modes under the existingCSVAPI - Extended
CsvOptswith areadAsparameter and new convenience constructors - Changed default
TypeInferrerfromStringTypetoFromAllRowsfor better type inference
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| scautable/src/csvColumns.scala | New 397-line file implementing columnar CSV reader with compile-time type inference and decoders for Array[T] types |
| scautable/src/csv.scala | Extended with columnar reading support via ReadAs parameter, adding ~350 lines of column construction logic |
| scautable/src/CsvOptions.scala | Added ReadAs enum and updated CsvOpts with readAs parameter and new convenience constructors |
| scautable/src/package.scala | Exported CsvColumnsReader and ReadAs to public API |
| scautable/test/src-jvm/CsvColumnsSuite.scala | Comprehensive 212-line test suite covering column reading for simple.csv, cereals.csv, and titanic.csv |
| scautable/test/src/io/github/quafadas/scautable/CsvFromStringSuite.scala | Updated existing tests to use new CsvOpts API with explicit TypeInferrer.StringType |
| scautable/test/src-jvm/testCsv.scala | Updated tests to use new CsvOpts API structure |
| scautable/test/resources/cereals.csv | Added 78-row test dataset with 16 columns for testing columnar reading |
| scautable/test/resources/*.xlsx | Added Excel test files (binary format) |
| benchmark/src/readbenchmark.scala | Minor formatting fix adding "end while" keyword |
| build.mill | Added comment about scalafmt command |
scautable/src/CsvOptions.scala
Outdated
| /** CSV options with readAs mode */ | ||
| inline def apply(readAs: ReadAs): CsvOpts = CsvOpts(HeaderOptions.Default, TypeInferrer.FromAllRows, ',', readAs) | ||
|
|
||
| /** CSV options with type inference and readAs mode */ |
There was a problem hiding this comment.
The CsvOpts.apply overload combining typeInferrer and readAs lacks documentation. Similar to the comment on line 75, this convenience method should document its intended use case.
| /** CSV options with type inference and readAs mode */ | |
| /** | |
| * CSV options with custom type inference and read mode. | |
| * | |
| * @param typeInferrer | |
| * How to infer types for columns (e.g., from all rows, as strings, etc.) | |
| * @param readAs | |
| * Whether to read as rows (iterator) or columns (arrays) | |
| * | |
| * Uses default header handling (HeaderOptions.Default) and comma delimiter. | |
| */ |
scautable/src/CsvOptions.scala
Outdated
|
|
||
| /** CSV options with custom header handling */ | ||
| inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.StringType, ',') | ||
| inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.StringType, ',', ReadAs.Rows) |
There was a problem hiding this comment.
The apply(headers: HeaderOptions) overload defaults to TypeInferrer.StringType, but the new default behavior in other places is TypeInferrer.FromAllRows (as seen in extractTypeInferrer changes). This inconsistency could be confusing - when a user specifies only HeaderOptions, they get StringType, but in other contexts they get FromAllRows. Consider aligning these defaults for consistency.
| inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.StringType, ',', ReadAs.Rows) | |
| inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.FromAllRows, ',', ReadAs.Rows) |
| test("fromString - basic functionality") { | ||
| // Note: string must be inline literal for compile-time processing | ||
| val cols = CsvColumnsReader.fromString("""name,age,score | ||
| Alice,30,95.5 | ||
| Bob,25,87.3 | ||
| Charlie,35,92.1""") |
There was a problem hiding this comment.
The test uses a multi-line string literal but the comment on line 42 indicates "string must be inline literal for compile-time processing". However, this test doesn't verify that inline requirement is enforced. Consider adding a negative test case or documentation about what happens if non-inline strings are used, to help users understand the compile-time constraint.
| bh.consume(row) | ||
| count += 1 | ||
| end while | ||
| count |
There was a problem hiding this comment.
The added "end while" is good for consistency with Scala 3 style, but the indentation of the count return statement (line 35) appears incorrect - it should be aligned with the method body at the same level as the while loop, not indented to match the end while.
| count | |
| count |
scautable/src/csvColumns.scala
Outdated
| def constructWithTypes[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] = | ||
| '{ | ||
| val content = $csvContentExpr | ||
| val lines = content.linesIterator | ||
| val (headers, iterator) = lines.headers($csvHeadersExpr, $delimiterExpr) | ||
|
|
||
| val numCols = headers.length | ||
| val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]()) | ||
|
|
||
| iterator.foreach { line => | ||
| val parsed = CSVParser.parseLine(line, $delimiterExpr) | ||
| var i = 0 | ||
| while i < parsed.length && i < numCols do | ||
| buffers(i) += parsed(i) | ||
| i += 1 | ||
| end while | ||
| } | ||
|
|
||
| val typedColumns = ColumnsDecoder.decodeAllColumns[ArrayData](buffers) | ||
| NamedTuple.build[Hdrs & Tuple]()(typedColumns) |
There was a problem hiding this comment.
The constructWithTypes function contains significant code duplication with similar functions in csv.scala (lines 484-508). The logic for creating buffers, parsing lines, and building NamedTuples is nearly identical. This duplication makes maintenance more difficult and increases the risk of inconsistencies. Consider extracting the common runtime logic into a shared helper function.
| def constructWithTypes[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] = | |
| '{ | |
| val content = $csvContentExpr | |
| val lines = content.linesIterator | |
| val (headers, iterator) = lines.headers($csvHeadersExpr, $delimiterExpr) | |
| val numCols = headers.length | |
| val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]()) | |
| iterator.foreach { line => | |
| val parsed = CSVParser.parseLine(line, $delimiterExpr) | |
| var i = 0 | |
| while i < parsed.length && i < numCols do | |
| buffers(i) += parsed(i) | |
| i += 1 | |
| end while | |
| } | |
| val typedColumns = ColumnsDecoder.decodeAllColumns[ArrayData](buffers) | |
| NamedTuple.build[Hdrs & Tuple]()(typedColumns) | |
| // Helper to build a NamedTuple from CSV content, headers, delimiter, and a column decoder | |
| private def buildNamedTupleFromCsv[Hdrs <: Tuple, ArrayData <: Tuple]( | |
| content: String, | |
| csvHeaders: Option[Seq[String]], | |
| delimiter: Char, | |
| decodeColumns: Array[scala.collection.mutable.ArrayBuffer[String]] => ArrayData | |
| ): NamedTuple[Hdrs, ArrayData] = | |
| val lines = content.linesIterator | |
| val (headers, iterator) = lines.headers(csvHeaders, delimiter) | |
| val numCols = headers.length | |
| val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]()) | |
| iterator.foreach { line => | |
| val parsed = CSVParser.parseLine(line, delimiter) | |
| var i = 0 | |
| while i < parsed.length && i < numCols do | |
| buffers(i) += parsed(i) | |
| i += 1 | |
| end while | |
| } | |
| val typedColumns = decodeColumns(buffers) | |
| NamedTuple.build[Hdrs & Tuple]()(typedColumns) | |
| def constructWithTypes[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] = | |
| '{ | |
| buildNamedTupleFromCsv[Hdrs, ArrayData]( | |
| $csvContentExpr, | |
| $csvHeadersExpr, | |
| $delimiterExpr, | |
| ColumnsDecoder.decodeAllColumns[ArrayData] | |
| ) |
| // Helper to extract ReadAs expression from CsvOpts | ||
| private[scautable] def extractReadAs(optsExpr: Expr[CsvOpts])(using Quotes): Expr[ReadAs] = | ||
| import quotes.reflect.* | ||
| // Unwrap Inlined nodes | ||
| def unwrapInlined(term: Term): Term = term match | ||
| case Inlined(_, _, body) => unwrapInlined(body) | ||
| case other => other | ||
|
|
||
| // Helper to unwrap NamedArg nodes | ||
| def unwrapNamedArg(term: Term): Term = term match | ||
| case NamedArg(_, value) => value | ||
| case other => other | ||
|
|
||
| val readAsType = TypeRepr.of[ReadAs] | ||
| val term = unwrapInlined(optsExpr.asTerm) | ||
|
|
||
| // Helper to check if a term is ReadAs.Columns | ||
| def isColumns(t: Term): Boolean = | ||
| unwrapInlined(t) match | ||
| case Select(_, "Columns") => true | ||
| case _ => false | ||
|
|
||
| // Helper to check if a term is a default parameter call | ||
| def isDefaultParam(t: Term): Boolean = | ||
| unwrapInlined(t) match | ||
| case Select(_, name) if name.contains("$default$") => true | ||
| case Ident(name) if name.startsWith("readAs$") => true // Variable binding for default | ||
| case _ => false | ||
|
|
||
| // Helper to extract ReadAs from a term, returning proper expression | ||
| def extractFromTerm(t: Term): Expr[ReadAs] = | ||
| val unwrapped = unwrapInlined(t) | ||
| if isColumns(unwrapped) then '{ ReadAs.Columns } | ||
| else if isDefaultParam(unwrapped) then '{ ReadAs.Rows } | ||
| else if unwrapped.tpe <:< readAsType then unwrapped.asExprOf[ReadAs] | ||
| else '{ ReadAs.Rows } | ||
| end if | ||
| end extractFromTerm | ||
|
|
||
| term match | ||
| // Handle Block with variable bindings from default parameters | ||
| case Block(statements, Apply(_, args)) => | ||
| args | ||
| .collectFirst { case NamedArg("readAs", value) => | ||
| extractFromTerm(value) | ||
| } | ||
| .orElse { | ||
| // Look for a ReadAs argument by type | ||
| args.map(unwrapNamedArg).collectFirst { | ||
| case t if t.tpe <:< readAsType => extractFromTerm(t) | ||
| } | ||
| } | ||
| .getOrElse('{ ReadAs.Rows }) | ||
|
|
||
| // Try to extract arguments from Apply node | ||
| case Apply(_, args) => | ||
| args | ||
| .collectFirst { case NamedArg("readAs", value) => | ||
| extractFromTerm(value) | ||
| } | ||
| .orElse { | ||
| // Look for a ReadAs argument by type | ||
| args.map(unwrapNamedArg).collectFirst { | ||
| case t if t.tpe <:< readAsType => extractFromTerm(t) | ||
| } | ||
| } | ||
| .getOrElse('{ ReadAs.Rows }) | ||
|
|
||
| case _ => | ||
| // Check if it's CsvOpts.default | ||
| if optsExpr.matches('{ CsvOpts.default }) then '{ ReadAs.Rows } | ||
| else | ||
| report.info(s"Could not extract ReadAs from CsvOpts (using Rows): ${optsExpr.show}") | ||
| '{ ReadAs.Rows } | ||
| end match | ||
| end extractReadAs |
There was a problem hiding this comment.
The extractReadAs method follows a very similar pattern to extractHeaderOptions, extractTypeInferrer, and extractDelimiter. While some duplication is expected in extraction logic, there's significant overlap in the term unwrapping and pattern matching structure. Consider if these could share more common utility functions for term inspection and unwrapping.
| def constructColumnArrays[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] = | ||
| val filePathExpr = Expr(path) | ||
| '{ | ||
| val source = scala.io.Source.fromFile($filePathExpr) | ||
| val lines = source.getLines() | ||
| val (headers, iterator) = lines.headers($csvHeadersExpr, $delimiterExpr) | ||
|
|
||
| case '{ TypeInferrer.FromAllRows } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, false, Int.MaxValue, delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => constructWithTypes[hdrs & Tuple, v & Tuple] | ||
| end match | ||
| val numCols = headers.length | ||
| val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]()) | ||
|
|
||
| case '{ TypeInferrer.FirstN(${ Expr(n) }) } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, true, n, delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => constructWithTypes[hdrs & Tuple, v & Tuple] | ||
| end match | ||
| iterator.foreach { line => | ||
| val parsed = CSVParser.parseLine(line, $delimiterExpr) | ||
| var i = 0 | ||
| while i < parsed.length && i < numCols do | ||
| buffers(i) += parsed(i) | ||
| i += 1 | ||
| end while | ||
| } | ||
|
|
||
| case '{ TypeInferrer.FirstN(${ Expr(n) }, ${ Expr(preferIntToBoolean) }) } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, preferIntToBoolean, n, delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => constructWithTypes[hdrs & Tuple, v & Tuple] | ||
| end match | ||
| source.close() | ||
|
|
||
| case _ => | ||
| report.throwError("Could not infer literal header tuple.") | ||
| end match | ||
| val typedColumns = ColumnsDecoder.decodeAllColumns[ArrayData](buffers) | ||
| NamedTuple.build[Hdrs & Tuple]()(typedColumns) | ||
| } | ||
| end constructColumnArrays |
There was a problem hiding this comment.
The constructColumnArrays function is duplicated in csvColumns.scala. The logic for creating buffers, parsing CSV lines, closing sources, and decoding columns is identical. This creates unnecessary maintenance overhead. The same pattern appears in both readHeaderlineAsCsv and readCsvFromString methods.
| if !isColumnMode then | ||
| headerTupleExpr match | ||
| case '{ $tup: hdrs } => | ||
| typeInferrerExpr match | ||
|
|
||
| case '{ TypeInferrer.FromTuple[t]() } => | ||
| constructRowIterator[hdrs & Tuple, t & Tuple] | ||
|
|
||
| case '{ TypeInferrer.StringType } => | ||
| constructRowIterator[hdrs & Tuple, StringyTuple[hdrs & Tuple] & Tuple] | ||
|
|
||
| case '{ TypeInferrer.FirstRow } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, true, delimiter = delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => | ||
| constructRowIterator[hdrs & Tuple, v & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FromAllRows } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, false, Int.MaxValue, delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => constructRowIterator[hdrs & Tuple, v & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FirstN(${ Expr(n) }) } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, true, n, delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => constructRowIterator[hdrs & Tuple, v & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FirstN(${ Expr(n) }, ${ Expr(preferIntToBoolean) }) } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, preferIntToBoolean, n, delimiter) | ||
| inferredTypeRepr.asType match | ||
| case '[v] => constructRowIterator[hdrs & Tuple, v & Tuple] | ||
| end match | ||
|
|
||
| case _ => | ||
| report.throwError("Could not infer literal header tuple.") | ||
| end match | ||
| else // isColumnMode | ||
| headerTupleExpr match | ||
| case '{ $tup: hdrs } => | ||
| typeInferrerExpr match | ||
|
|
||
| case '{ TypeInferrer.FromTuple[t]() } => | ||
| val arrayTypeRepr = toArrayTupleType(TypeRepr.of[t]) | ||
| arrayTypeRepr.asType match | ||
| case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.StringType } => | ||
| val stringyType = TypeRepr.of[StringyTuple[hdrs & Tuple]] | ||
| val arrayTypeRepr = toArrayTupleType(stringyType) | ||
| arrayTypeRepr.asType match | ||
| case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FirstRow } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, true, delimiter = delimiter) | ||
| val arrayTypeRepr = toArrayTupleType(inferredTypeRepr) | ||
| arrayTypeRepr.asType match | ||
| case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FromAllRows } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, false, Int.MaxValue, delimiter) | ||
| val arrayTypeRepr = toArrayTupleType(inferredTypeRepr) | ||
| arrayTypeRepr.asType match | ||
| case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FirstN(${ Expr(n) }) } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, true, n, delimiter) | ||
| val arrayTypeRepr = toArrayTupleType(inferredTypeRepr) | ||
| arrayTypeRepr.asType match | ||
| case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple] | ||
| end match | ||
|
|
||
| case '{ TypeInferrer.FirstN(${ Expr(n) }, ${ Expr(preferIntToBoolean) }) } => | ||
| val inferredTypeRepr = InferrerOps.inferrer(iter, preferIntToBoolean, n, delimiter) | ||
| val arrayTypeRepr = toArrayTupleType(inferredTypeRepr) | ||
| arrayTypeRepr.asType match | ||
| case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple] | ||
| end match | ||
|
|
||
| case _ => | ||
| report.throwError("Could not infer literal header tuple.") | ||
| end match | ||
| end if |
There was a problem hiding this comment.
The row-mode and column-mode branches (lines 510-598) contain extensive duplication of the type inference pattern matching logic. Each TypeInferrer case (FromTuple, StringType, FirstRow, FromAllRows, FirstN with variations) is handled twice with nearly identical code except for calling either constructRowIterator or constructColumnArrays. This doubles the maintenance burden and increases the risk of inconsistent behavior. Consider refactoring to parameterize the construction function.
scautable/src/CsvOptions.scala
Outdated
| CsvOpts(headers, typeInferrer, ',') | ||
| CsvOpts(headers, typeInferrer, ',', ReadAs.Rows) | ||
|
|
||
| /** CSV options with readAs mode */ |
There was a problem hiding this comment.
The CsvOpts.apply overload for readAs parameter lacks documentation. While other overloads have doc comments explaining their purpose, this new convenience method should also document when and why to use it (e.g., "CSV options with readAs mode - use when you only need to specify whether to read as rows or columns").
| /** CSV options with readAs mode */ | |
| /** | |
| * CSV options with readAs mode. | |
| * | |
| * Use this overload when you only need to specify whether to read the CSV as rows (iterator) or columns (arrays). | |
| * All other options (headers, type inference, delimiter) use their defaults. | |
| * | |
| * @param readAs Whether to read as rows (iterator) or columns (arrays) | |
| */ |
scautable/src/csvColumns.scala
Outdated
| // Convert a value type tuple (Int, String, Double) to array type tuple (Array[Int], Array[String], Array[Double]) | ||
| def toArrayTupleType(valueTypeRepr: TypeRepr): TypeRepr = | ||
| valueTypeRepr.asType match | ||
| case '[EmptyTuple] => TypeRepr.of[EmptyTuple] | ||
| case '[h *: t] => | ||
| val headArrayType = TypeRepr.of[Array].appliedTo(TypeRepr.of[h]) | ||
| val tailArrayType = toArrayTupleType(TypeRepr.of[t]) | ||
| TypeRepr.of[*:].appliedTo(List(headArrayType, tailArrayType)) |
There was a problem hiding this comment.
The toArrayTupleType function is duplicated in both csvColumns.scala and csv.scala. This creates a maintainability issue where changes need to be made in two places. Consider extracting this function to a shared utility object or trait that both files can use.
|
@copilot can you fix CI? Also, please read the changes and remove extraneous code, in particular the api specific to this change also, update docs |
… parameter (#113) * Initial plan * Plan: Remove CsvColumnsReader API, keep unified CSV with ReadAs, fix CI, update docs Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com> * Remove CsvColumnsReader API, update tests and docs for unified CSV with ReadAs Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com> * Apply scalafmt formatting and fix duplicate end while in benchmark Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com> * Fix documentation comment for CsvOpts.apply(headers) to match implementation Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com> * Fix CI: Resolve variable name conflict in ColumnOrient.md documentation Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
This introduces the ability to read csv files straight into a columnar memory format