## 0.15: New Features

- Experimental new CSV parser based on [Deephaven-CSV](https://github.com/deephaven/deephaven-csv).
- Experimental new `GeoDataFrame` class for working with geographical data (from GeoJson/Shapefile) and plotting it with [Kandy](https://github.com/Kotlin/kandy).
- Full `BigInteger` support.
- Custom SQL Database support by passing the `dbType` parameter to read functions.
- Improved parsing.
- Custom SQL DB registration

### Experimental new CSV parser based on Deephaven-CSV

DataFrame's CSV parsing has been based on [Apache Commons CSV](https://commons.apache.org/proper/commons-csv/) from the beginning. While this has been sufficient for most applications, it had some issues like running out of memory, performance, and our API lacking in clarity, documentation, and completeness.

For DataFrame 0.15, we introduce a new separate package [`org.jetbrains.kotlinx:dataframe-csv`](https://central.sonatype.com/artifact/org.jetbrains.kotlinx/dataframe-csv) which tries to solve all these issues at once. It's based on [Deephaven-CSV](https://github.com/deephaven/deephaven-csv) which makes it faster and more memory efficient. And since we built it from the ground up, we made sure the API was complete, predictable, and documented carefully.

To try it yourself, explicitly add the dependency [`org.jetbrains.kotlinx:dataframe-csv`](https://central.sonatype.com/artifact/org.jetbrains.kotlinx/dataframe-csv) to your project or notebook, as such:

In [1]:
// this needs to be called before importing dataframe itself
USE {
    dependencies {
        implementation("org.jetbrains.kotlinx:dataframe-csv:0.15.0-RC2")
    }
}

In [2]:
%useLatestDescriptors
%use dataframe(v=0.15.0-RC2)

Given a large CSV file, such as below, the chances of running out of memory are now (still possible, but) lower:

In [6]:
// Old csv function:
DataFrame.readCSV(
    "../../../../dataframe-csv/src/test/resources/largeCsv.csv.gz",
)

java.lang.OutOfMemoryError: Ran out of memory reading this CSV-like file. You can try our new experimental CSV reader by adding the dependency "org.jetbrains.kotlinx:dataframe-csv:{VERSION}" and using `DataFrame.readCsv()` instead of `DataFrame.readCSV()`.

In [3]:
// New csv function:
DataFrame.readCsv(
    "../../../../dataframe-csv/src/test/resources/largeCsv.csv.gz",
)

Year,Age,Ethnic,Sex,Area,count
2018,0,1,1,1,795
2018,0,1,1,2,5067
2018,0,1,1,3,2229
2018,0,1,1,4,1356
2018,0,1,1,5,180
2018,0,1,1,6,738
2018,0,1,1,7,630
2018,0,1,1,8,1188
2018,0,1,1,9,2157
2018,0,1,1,12,177


40 million rows! Not bad, right? Most of this speed increase is due to Deephaven CSV's ability to parse columns directly to the target type, like `Int`, or `Double`, instead of parsing everything as a `String` and then converting it. DataFrame still reads everything into (boxed) memory, so there are limits to the size of the file you can read, but now the CSV reader is not a limiting factor anymore.

Switching to the new API, in most cases, is as easy as swapping `readCSV` with `readCsv` (and `readTSV` with `readTsv`, etc.). However, there are a few differences in the API, so be sure to check the KDocs of the new functions.

Here's a small demonstration of the new API:

In [9]:
import java.util.Locale

DataFrame.readCsv(
    "../../../../dataframe-csv/src/test/resources/irisDataset.csv",
    delimiter = ',',

    // overwriting the given header
    header = listOf("sepalLength", "sepalWidth", "petalLength", "petalWidth", "species"),

    // skipping the first line in the file with old header
    skipLines = 1,

    // reading only 50 lines
    readLines = 50,

    // manually specifying the types of the columns, will be inferred otherwise
    colTypes = mapOf(
        "species" to ColType.String, // setting the type of the species column to String
        ColType.DEFAULT to ColType.Double, // setting type of all other columns to Double
    ),

    // manually specifying some parser options
    // Will be read from the global parser options `DataFrame.parser` otherwise
    parserOptions = ParserOptions(
        // setting the locale to US, uses `DataFrame.parser.locale` or `Locale.getDefault()` otherwise
        locale = Locale.US,
        // overriding null strings
        nullStrings = DEFAULT_DELIM_NULL_STRINGS + "nothing",
        // using the new faster double parser, true by default for readCsv
        useFastDoubleParser = true,
    ),

    // new! specifying the quote character
    quote = '\"',

    // specifying whether to ignore empty lines in between rows in the file, and plenty more options...
    ignoreEmptyLines = false,
    allowMissingColumns = true,
    ignoreSurroundingSpaces = true,
    trimInsideQuoted = false,
    parseParallel = true,
)

sepalLength,sepalWidth,petalLength,petalWidth,species
5.1,3.5,1.4,0.2,Setosa
4.9,3.0,1.4,0.2,Setosa
4.7,3.2,1.3,0.2,Setosa
4.6,3.1,1.5,0.2,Setosa
5.0,3.6,1.4,0.2,Setosa
5.4,3.9,1.7,0.4,Setosa
4.6,3.4,1.4,0.3,Setosa
5.0,3.4,1.5,0.2,Setosa
4.4,2.9,1.4,0.2,Setosa
4.9,3.1,1.5,0.1,Setosa


Since deephaven supports it, we can now also read multi-space separated files, like logs:

In [16]:
DataFrame.readDelimStr(
    """
    NAME                     STATUS   AGE      LABELS
    argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
    argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
    argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
    beta                     Active   4y235d   kubernetes.io/metadata.name=beta
    """.trimIndent(),
    hasFixedWidthColumns = true,
)

NAME,STATUS,AGE,LABELS
argo-events,Active,2y77d,app.kubernetes.io/instance=argo-event...
argo-workflows,Active,2y77d,app.kubernetes.io/instance=argo-workf...
argocd,Active,5y18d,kubernetes.io/metadata.name=argocd


We provide single overload (with `InputStream`) which exposes the underlying implementation for when ours is not sufficient for your needs.

In [18]:
import io.deephaven.csv.containers.ByteSlice
import io.deephaven.csv.tokenization.Tokenizer
import java.io.InputStream

DataFrame.readCsv(
    inputStream = File("../../../../dataframe-csv/src/test/resources/irisDataset.csv").inputStream(),
    adjustCsvSpecs = {
        this
            .headerLegalizer {
                it.map { it.lowercase().replace('.', '_') }.toTypedArray()
            }
            .customDoubleParser(object : Tokenizer.CustomDoubleParser {
                override fun parse(bs: ByteSlice?): Double = TODO("Not yet implemented")
                override fun parse(cs: CharSequence?): Double = TODO("Not yet implemented")
            })
            // etc..
    },
)

sepal_length,sepal_width,petal_length,petal_width,variety
5.1,3.5,1.4,0.2,Setosa
4.9,3.0,1.4,0.2,Setosa
4.7,3.2,1.3,0.2,Setosa
4.6,3.1,1.5,0.2,Setosa
5.0,3.6,1.4,0.2,Setosa
5.4,3.9,1.7,0.4,Setosa
4.6,3.4,1.4,0.3,Setosa
5.0,3.4,1.5,0.2,Setosa
4.4,2.9,1.4,0.2,Setosa
4.9,3.1,1.5,0.1,Setosa


Finally, we now support reading from ZIP files directly, along with GZIP (already demonstrated above) and custom compression formats:

In [23]:
DataFrame.readCsv(
    "../../../../dataframe-csv/src/test/resources/testCSV.zip",
    // this can be manually specified, but is inferred automatically from the file extension
    // compression = Compression.Zip,
)

untitled,user_id,name,duplicate,username,duplicate1,duplicate11,double,number,time,empty
0,4,George,,abc,a,,1203.0,599.213,2021-01-07T15:12:32,
1,5,Paul,,paul,,,,214.211,2021-01-14T14:36:19,
2,8,Johnny,,qwerty,b,,20.0,412.214,2021-02-23T19:47,
3,10,Jack,,buk,,,2414.0,1.01,2021-03-08T23:38:52,
4,12,Samuel,,qwerty,,,inf,0.0,2021-04-01T02:30:22,


In [20]:
USE { dependencies("org.tukaani:xz:1.10", "org.apache.commons:commons-compress:1.27.1") }

In [22]:
import org.apache.commons.compress.archivers.tar.TarFile
import org.apache.commons.io.IOUtils
import org.apache.commons.compress.utils.SeekableInMemoryByteChannel

// custom compression format by specifying how to convert a compressed InputStream to a normal one
val tarCompression = Compression.Custom({ tarInputStream ->
    val tar = TarFile(SeekableInMemoryByteChannel(IOUtils.toByteArray(tarInputStream)))
    tar.getInputStream(tar.entries.first())
})

DataFrame.readCsv("irisDataset.tar", compression = tarCompression)

sepal.length,sepal.width,petal.length,petal.width,variety
5.1,3.5,1.4,0.2,Setosa
4.9,3.0,1.4,0.2,Setosa
4.7,3.2,1.3,0.2,Setosa
4.6,3.1,1.5,0.2,Setosa
5.0,3.6,1.4,0.2,Setosa
5.4,3.9,1.7,0.4,Setosa
4.6,3.4,1.4,0.3,Setosa
5.0,3.4,1.5,0.2,Setosa
4.4,2.9,1.4,0.2,Setosa
4.9,3.1,1.5,0.1,Setosa


Writing is also supported; it still uses Apache Commons CSV under the hood.
The API is similar to the reading API:

In [24]:
val irisDf = DataFrame.readCsv("../../../../dataframe-csv/src/test/resources/irisDataset.csv")

irisDf.writeCsv("irisDataset.csv")

some options can be specified:

In [32]:
irisDf.writeDelim(
    path = "irisDataset.csv",
    delimiter = ';',
    includeHeader = false,
    quoteMode = QuoteMode.ALL,
    escapeChar = '\\',
    commentChar = '#',
    headerComments = listOf("This is a comment", "This is another comment"),
    recordSeparator = "\n",
)

and similarly we have a single overload which exposes the underlying implementation:

In [31]:
irisDf.writeCsv(
    writer = File("irisDataset.csv").writer(),
    adjustCsvFormat = {
        this
            .setSkipHeaderRecord(true)
            .setHeader("sepalLength", "sepalWidth", "petalLength", "petalWidth", "species")
            .setTrailingData(true)
            .setNullString("null")
            // etc..
    },
)