Skip to content

First test of columnar reading#112

Merged
Quafadas merged 5 commits intomainfrom
columnar
Dec 19, 2025
Merged

First test of columnar reading#112
Quafadas merged 5 commits intomainfrom
columnar

Conversation

@Quafadas
Copy link
Owner

This introduces the ability to read csv files straight into a columnar memory format

Copilot AI review requested due to automatic review settings December 13, 2025 16:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces columnar CSV reading functionality to Scautable, providing an alternative memory format optimized for analytical workloads. The implementation adds a new ReadAs enum that allows users to choose between row-oriented iteration (the existing behavior) and column-oriented arrays (the new feature). The changes integrate cleanly with the existing compile-time CSV parsing infrastructure while maintaining backward compatibility.

Key Changes:

  • Added CsvColumnsReader object providing resource, absolutePath, and fromString methods that return NamedTuple[K, (Array[T1], Array[T2], ...)] instead of iterators
  • Introduced ReadAs enum (Rows/Columns) to unify both reading modes under the existing CSV API
  • Extended CsvOpts with a readAs parameter and new convenience constructors
  • Changed default TypeInferrer from StringType to FromAllRows for better type inference

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
scautable/src/csvColumns.scala New 397-line file implementing columnar CSV reader with compile-time type inference and decoders for Array[T] types
scautable/src/csv.scala Extended with columnar reading support via ReadAs parameter, adding ~350 lines of column construction logic
scautable/src/CsvOptions.scala Added ReadAs enum and updated CsvOpts with readAs parameter and new convenience constructors
scautable/src/package.scala Exported CsvColumnsReader and ReadAs to public API
scautable/test/src-jvm/CsvColumnsSuite.scala Comprehensive 212-line test suite covering column reading for simple.csv, cereals.csv, and titanic.csv
scautable/test/src/io/github/quafadas/scautable/CsvFromStringSuite.scala Updated existing tests to use new CsvOpts API with explicit TypeInferrer.StringType
scautable/test/src-jvm/testCsv.scala Updated tests to use new CsvOpts API structure
scautable/test/resources/cereals.csv Added 78-row test dataset with 16 columns for testing columnar reading
scautable/test/resources/*.xlsx Added Excel test files (binary format)
benchmark/src/readbenchmark.scala Minor formatting fix adding "end while" keyword
build.mill Added comment about scalafmt command

/** CSV options with readAs mode */
inline def apply(readAs: ReadAs): CsvOpts = CsvOpts(HeaderOptions.Default, TypeInferrer.FromAllRows, ',', readAs)

/** CSV options with type inference and readAs mode */
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CsvOpts.apply overload combining typeInferrer and readAs lacks documentation. Similar to the comment on line 75, this convenience method should document its intended use case.

Suggested change
/** CSV options with type inference and readAs mode */
/**
* CSV options with custom type inference and read mode.
*
* @param typeInferrer
* How to infer types for columns (e.g., from all rows, as strings, etc.)
* @param readAs
* Whether to read as rows (iterator) or columns (arrays)
*
* Uses default header handling (HeaderOptions.Default) and comma delimiter.
*/

Copilot uses AI. Check for mistakes.

/** CSV options with custom header handling */
inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.StringType, ',')
inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.StringType, ',', ReadAs.Rows)
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply(headers: HeaderOptions) overload defaults to TypeInferrer.StringType, but the new default behavior in other places is TypeInferrer.FromAllRows (as seen in extractTypeInferrer changes). This inconsistency could be confusing - when a user specifies only HeaderOptions, they get StringType, but in other contexts they get FromAllRows. Consider aligning these defaults for consistency.

Suggested change
inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.StringType, ',', ReadAs.Rows)
inline def apply(headers: HeaderOptions): CsvOpts = CsvOpts(headers, TypeInferrer.FromAllRows, ',', ReadAs.Rows)

Copilot uses AI. Check for mistakes.
Comment on lines 41 to 46
test("fromString - basic functionality") {
// Note: string must be inline literal for compile-time processing
val cols = CsvColumnsReader.fromString("""name,age,score
Alice,30,95.5
Bob,25,87.3
Charlie,35,92.1""")
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a multi-line string literal but the comment on line 42 indicates "string must be inline literal for compile-time processing". However, this test doesn't verify that inline requirement is enforced. Consider adding a negative test case or documentation about what happens if non-inline strings are used, to help users understand the compile-time constraint.

Copilot uses AI. Check for mistakes.
bh.consume(row)
count += 1
end while
count
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added "end while" is good for consistency with Scala 3 style, but the indentation of the count return statement (line 35) appears incorrect - it should be aligned with the method body at the same level as the while loop, not indented to match the end while.

Suggested change
count
count

Copilot uses AI. Check for mistakes.
Comment on lines 222 to 241
def constructWithTypes[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] =
'{
val content = $csvContentExpr
val lines = content.linesIterator
val (headers, iterator) = lines.headers($csvHeadersExpr, $delimiterExpr)

val numCols = headers.length
val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]())

iterator.foreach { line =>
val parsed = CSVParser.parseLine(line, $delimiterExpr)
var i = 0
while i < parsed.length && i < numCols do
buffers(i) += parsed(i)
i += 1
end while
}

val typedColumns = ColumnsDecoder.decodeAllColumns[ArrayData](buffers)
NamedTuple.build[Hdrs & Tuple]()(typedColumns)
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructWithTypes function contains significant code duplication with similar functions in csv.scala (lines 484-508). The logic for creating buffers, parsing lines, and building NamedTuples is nearly identical. This duplication makes maintenance more difficult and increases the risk of inconsistencies. Consider extracting the common runtime logic into a shared helper function.

Suggested change
def constructWithTypes[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] =
'{
val content = $csvContentExpr
val lines = content.linesIterator
val (headers, iterator) = lines.headers($csvHeadersExpr, $delimiterExpr)
val numCols = headers.length
val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]())
iterator.foreach { line =>
val parsed = CSVParser.parseLine(line, $delimiterExpr)
var i = 0
while i < parsed.length && i < numCols do
buffers(i) += parsed(i)
i += 1
end while
}
val typedColumns = ColumnsDecoder.decodeAllColumns[ArrayData](buffers)
NamedTuple.build[Hdrs & Tuple]()(typedColumns)
// Helper to build a NamedTuple from CSV content, headers, delimiter, and a column decoder
private def buildNamedTupleFromCsv[Hdrs <: Tuple, ArrayData <: Tuple](
content: String,
csvHeaders: Option[Seq[String]],
delimiter: Char,
decodeColumns: Array[scala.collection.mutable.ArrayBuffer[String]] => ArrayData
): NamedTuple[Hdrs, ArrayData] =
val lines = content.linesIterator
val (headers, iterator) = lines.headers(csvHeaders, delimiter)
val numCols = headers.length
val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]())
iterator.foreach { line =>
val parsed = CSVParser.parseLine(line, delimiter)
var i = 0
while i < parsed.length && i < numCols do
buffers(i) += parsed(i)
i += 1
end while
}
val typedColumns = decodeColumns(buffers)
NamedTuple.build[Hdrs & Tuple]()(typedColumns)
def constructWithTypes[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] =
'{
buildNamedTupleFromCsv[Hdrs, ArrayData](
$csvContentExpr,
$csvHeadersExpr,
$delimiterExpr,
ColumnsDecoder.decodeAllColumns[ArrayData]
)

Copilot uses AI. Check for mistakes.
Comment on lines +349 to +424
// Helper to extract ReadAs expression from CsvOpts
private[scautable] def extractReadAs(optsExpr: Expr[CsvOpts])(using Quotes): Expr[ReadAs] =
import quotes.reflect.*
// Unwrap Inlined nodes
def unwrapInlined(term: Term): Term = term match
case Inlined(_, _, body) => unwrapInlined(body)
case other => other

// Helper to unwrap NamedArg nodes
def unwrapNamedArg(term: Term): Term = term match
case NamedArg(_, value) => value
case other => other

val readAsType = TypeRepr.of[ReadAs]
val term = unwrapInlined(optsExpr.asTerm)

// Helper to check if a term is ReadAs.Columns
def isColumns(t: Term): Boolean =
unwrapInlined(t) match
case Select(_, "Columns") => true
case _ => false

// Helper to check if a term is a default parameter call
def isDefaultParam(t: Term): Boolean =
unwrapInlined(t) match
case Select(_, name) if name.contains("$default$") => true
case Ident(name) if name.startsWith("readAs$") => true // Variable binding for default
case _ => false

// Helper to extract ReadAs from a term, returning proper expression
def extractFromTerm(t: Term): Expr[ReadAs] =
val unwrapped = unwrapInlined(t)
if isColumns(unwrapped) then '{ ReadAs.Columns }
else if isDefaultParam(unwrapped) then '{ ReadAs.Rows }
else if unwrapped.tpe <:< readAsType then unwrapped.asExprOf[ReadAs]
else '{ ReadAs.Rows }
end if
end extractFromTerm

term match
// Handle Block with variable bindings from default parameters
case Block(statements, Apply(_, args)) =>
args
.collectFirst { case NamedArg("readAs", value) =>
extractFromTerm(value)
}
.orElse {
// Look for a ReadAs argument by type
args.map(unwrapNamedArg).collectFirst {
case t if t.tpe <:< readAsType => extractFromTerm(t)
}
}
.getOrElse('{ ReadAs.Rows })

// Try to extract arguments from Apply node
case Apply(_, args) =>
args
.collectFirst { case NamedArg("readAs", value) =>
extractFromTerm(value)
}
.orElse {
// Look for a ReadAs argument by type
args.map(unwrapNamedArg).collectFirst {
case t if t.tpe <:< readAsType => extractFromTerm(t)
}
}
.getOrElse('{ ReadAs.Rows })

case _ =>
// Check if it's CsvOpts.default
if optsExpr.matches('{ CsvOpts.default }) then '{ ReadAs.Rows }
else
report.info(s"Could not extract ReadAs from CsvOpts (using Rows): ${optsExpr.show}")
'{ ReadAs.Rows }
end match
end extractReadAs
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extractReadAs method follows a very similar pattern to extractHeaderOptions, extractTypeInferrer, and extractDelimiter. While some duplication is expected in extraction logic, there's significant overlap in the term unwrapping and pattern matching structure. Consider if these could share more common utility functions for term inspection and unwrapping.

Copilot uses AI. Check for mistakes.
Comment on lines +484 to +508
def constructColumnArrays[Hdrs <: Tuple: Type, ArrayData <: Tuple: Type]: Expr[NamedTuple[Hdrs, ArrayData]] =
val filePathExpr = Expr(path)
'{
val source = scala.io.Source.fromFile($filePathExpr)
val lines = source.getLines()
val (headers, iterator) = lines.headers($csvHeadersExpr, $delimiterExpr)

case '{ TypeInferrer.FromAllRows } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, false, Int.MaxValue, delimiter)
inferredTypeRepr.asType match
case '[v] => constructWithTypes[hdrs & Tuple, v & Tuple]
end match
val numCols = headers.length
val buffers = Array.fill(numCols)(scala.collection.mutable.ArrayBuffer[String]())

case '{ TypeInferrer.FirstN(${ Expr(n) }) } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, true, n, delimiter)
inferredTypeRepr.asType match
case '[v] => constructWithTypes[hdrs & Tuple, v & Tuple]
end match
iterator.foreach { line =>
val parsed = CSVParser.parseLine(line, $delimiterExpr)
var i = 0
while i < parsed.length && i < numCols do
buffers(i) += parsed(i)
i += 1
end while
}

case '{ TypeInferrer.FirstN(${ Expr(n) }, ${ Expr(preferIntToBoolean) }) } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, preferIntToBoolean, n, delimiter)
inferredTypeRepr.asType match
case '[v] => constructWithTypes[hdrs & Tuple, v & Tuple]
end match
source.close()

case _ =>
report.throwError("Could not infer literal header tuple.")
end match
val typedColumns = ColumnsDecoder.decodeAllColumns[ArrayData](buffers)
NamedTuple.build[Hdrs & Tuple]()(typedColumns)
}
end constructColumnArrays
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructColumnArrays function is duplicated in csvColumns.scala. The logic for creating buffers, parsing CSV lines, closing sources, and decoding columns is identical. This creates unnecessary maintenance overhead. The same pattern appears in both readHeaderlineAsCsv and readCsvFromString methods.

Copilot uses AI. Check for mistakes.
Comment on lines +510 to +598
if !isColumnMode then
headerTupleExpr match
case '{ $tup: hdrs } =>
typeInferrerExpr match

case '{ TypeInferrer.FromTuple[t]() } =>
constructRowIterator[hdrs & Tuple, t & Tuple]

case '{ TypeInferrer.StringType } =>
constructRowIterator[hdrs & Tuple, StringyTuple[hdrs & Tuple] & Tuple]

case '{ TypeInferrer.FirstRow } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, true, delimiter = delimiter)
inferredTypeRepr.asType match
case '[v] =>
constructRowIterator[hdrs & Tuple, v & Tuple]
end match

case '{ TypeInferrer.FromAllRows } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, false, Int.MaxValue, delimiter)
inferredTypeRepr.asType match
case '[v] => constructRowIterator[hdrs & Tuple, v & Tuple]
end match

case '{ TypeInferrer.FirstN(${ Expr(n) }) } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, true, n, delimiter)
inferredTypeRepr.asType match
case '[v] => constructRowIterator[hdrs & Tuple, v & Tuple]
end match

case '{ TypeInferrer.FirstN(${ Expr(n) }, ${ Expr(preferIntToBoolean) }) } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, preferIntToBoolean, n, delimiter)
inferredTypeRepr.asType match
case '[v] => constructRowIterator[hdrs & Tuple, v & Tuple]
end match

case _ =>
report.throwError("Could not infer literal header tuple.")
end match
else // isColumnMode
headerTupleExpr match
case '{ $tup: hdrs } =>
typeInferrerExpr match

case '{ TypeInferrer.FromTuple[t]() } =>
val arrayTypeRepr = toArrayTupleType(TypeRepr.of[t])
arrayTypeRepr.asType match
case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple]
end match

case '{ TypeInferrer.StringType } =>
val stringyType = TypeRepr.of[StringyTuple[hdrs & Tuple]]
val arrayTypeRepr = toArrayTupleType(stringyType)
arrayTypeRepr.asType match
case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple]
end match

case '{ TypeInferrer.FirstRow } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, true, delimiter = delimiter)
val arrayTypeRepr = toArrayTupleType(inferredTypeRepr)
arrayTypeRepr.asType match
case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple]
end match

case '{ TypeInferrer.FromAllRows } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, false, Int.MaxValue, delimiter)
val arrayTypeRepr = toArrayTupleType(inferredTypeRepr)
arrayTypeRepr.asType match
case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple]
end match

case '{ TypeInferrer.FirstN(${ Expr(n) }) } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, true, n, delimiter)
val arrayTypeRepr = toArrayTupleType(inferredTypeRepr)
arrayTypeRepr.asType match
case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple]
end match

case '{ TypeInferrer.FirstN(${ Expr(n) }, ${ Expr(preferIntToBoolean) }) } =>
val inferredTypeRepr = InferrerOps.inferrer(iter, preferIntToBoolean, n, delimiter)
val arrayTypeRepr = toArrayTupleType(inferredTypeRepr)
arrayTypeRepr.asType match
case '[arrTup] => constructColumnArrays[hdrs & Tuple, arrTup & Tuple]
end match

case _ =>
report.throwError("Could not infer literal header tuple.")
end match
end if
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row-mode and column-mode branches (lines 510-598) contain extensive duplication of the type inference pattern matching logic. Each TypeInferrer case (FromTuple, StringType, FirstRow, FromAllRows, FirstN with variations) is handled twice with nearly identical code except for calling either constructRowIterator or constructColumnArrays. This doubles the maintenance burden and increases the risk of inconsistent behavior. Consider refactoring to parameterize the construction function.

Copilot uses AI. Check for mistakes.
CsvOpts(headers, typeInferrer, ',')
CsvOpts(headers, typeInferrer, ',', ReadAs.Rows)

/** CSV options with readAs mode */
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CsvOpts.apply overload for readAs parameter lacks documentation. While other overloads have doc comments explaining their purpose, this new convenience method should also document when and why to use it (e.g., "CSV options with readAs mode - use when you only need to specify whether to read as rows or columns").

Suggested change
/** CSV options with readAs mode */
/**
* CSV options with readAs mode.
*
* Use this overload when you only need to specify whether to read the CSV as rows (iterator) or columns (arrays).
* All other options (headers, type inference, delimiter) use their defaults.
*
* @param readAs Whether to read as rows (iterator) or columns (arrays)
*/

Copilot uses AI. Check for mistakes.
Comment on lines 213 to 220
// Convert a value type tuple (Int, String, Double) to array type tuple (Array[Int], Array[String], Array[Double])
def toArrayTupleType(valueTypeRepr: TypeRepr): TypeRepr =
valueTypeRepr.asType match
case '[EmptyTuple] => TypeRepr.of[EmptyTuple]
case '[h *: t] =>
val headArrayType = TypeRepr.of[Array].appliedTo(TypeRepr.of[h])
val tailArrayType = toArrayTupleType(TypeRepr.of[t])
TypeRepr.of[*:].appliedTo(List(headArrayType, tailArrayType))
Copy link

Copilot AI Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The toArrayTupleType function is duplicated in both csvColumns.scala and csv.scala. This creates a maintainability issue where changes need to be made in two places. Consider extracting this function to a shared utility object or trait that both files can use.

Copilot uses AI. Check for mistakes.
@Quafadas
Copy link
Owner Author

@copilot can you fix CI?

Also, please read the changes and remove extraneous code, in particular the api specific to this change

also, update docs

Copy link
Contributor

Copilot AI commented Dec 13, 2025

@Quafadas I've opened a new pull request, #113, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 4 commits December 16, 2025 21:34
… parameter (#113)

* Initial plan

* Plan: Remove CsvColumnsReader API, keep unified CSV with ReadAs, fix CI, update docs

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>

* Remove CsvColumnsReader API, update tests and docs for unified CSV with ReadAs

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>

* Apply scalafmt formatting and fix duplicate end while in benchmark

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>

* Fix documentation comment for CsvOpts.apply(headers) to match implementation

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>

* Fix CI: Resolve variable name conflict in ColumnOrient.md documentation

Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Quafadas <24899792+Quafadas@users.noreply.github.com>
@Quafadas Quafadas merged commit 0e35b07 into main Dec 19, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants