## Train Disruptions in the Netherlands

<img alt="ns-delay.jpg" src="ns-delay.jpg" width="1000"/>

### Exploring data from Rijden de Treinen using [Kotlin for Data Analysis](https://kotlinlang.org/docs/data-analysis-overview.html)

Data from https://www.rijdendetreinen.nl/en/open-data/

Let's find out together:
- What causes delays?
- What's the longest delay in 2023 and where did it occur?
- Which track or station had the most delays in 2023?
- Do I get to complain about Dutch trains in live demos? (I came by car)

[disruptions 2023 CSV](data/disruptions/disruptions-2023.csv)

In [None]:
%use dataframe

In [None]:
disruptions2023.schema()

 Looking at the schema, we can see it mostly parsed the data correctly.
 `rdt_lines_id: Double?` is a mistake though.
 
From the website: "These are the IDs of the lines linked to a disruption by Rijden de Treinen, separated by a comma."
Understandably, `"24,32"` is parsed as a `Double` instead of `String`. Let's try to nudge it into the right direction when reading the data
by supplying it with a manual type for this column.

Let's also rename it to camel case while we're at it.

In [None]:
disruptions2023.schema()

Now the schema looks better! One of the best things about using DataFrame in notebooks
is that in between cell calls type-safe accessors are generated for you!

In [None]:
disruptions2023.rdtLinesId

We can actually make this hidden process visible by tracking all code that's executed under the hood.

Libraries for the Kotlin Jupyter kernel and notebooks can be very powerful as you can see!

In [None]:
%trackExecution


In [None]:
%trackExecution off

In [None]:
val a = dataFrame.a
val b = dataFrame.b

a

Anyway, let's get back to our data!

Let's make our data easier to work with:
- We already renamed to camelCase
- Remove Dutch columns in favor of English ones
- Remove NS and `cause` columns (in favor of rdt columns and statisticalCause respectively)
- Drop rows where durationMinutes is `null`
- Add helper columns for just the `date: LocalDate` and `duration: kotlin.time.Duration` for easier viewing and plotting
- Parse comma-split columns as lists

For an overview, check out [DataFrame Operations](https://kotlin.github.io/dataframe/operations.html)

In [None]:
// before
disruptions2023

In [None]:
import kotlin.time.Duration.Companion.minutes

val df1 = disruptions2023

    // we remove nsLines, dutch columns, and causeEn (as statisticalCauseEn is better according to the docs)
    

    // let's also remove some rows where durationMinutes == null
    
    
    // Parsing minutes into kotlin.time.Duration and creating an extra date column
    

    // renaming columns to remove "rdt" and "En" from the beginning and end
    

df1

Almost perfect! However, we still have some list-like columns. We can split those into lists to make them more manageable.

In [None]:
val df2 = df1
    // splitting lines, linesId, stationNames, stationCodes by ","
    
    
    // converting linesId from List<String> to List<Int>
    .convert { linesId.cast<List<String>>() }.with { it.map { it.toInt() } }

df2

In [None]:
df2.schema()

Done! Now let's get to work!

Remember, we wanted to find:

- What's the longest delay in 2023 and where did it occur?
- What causes delays?
- Which track or station had the most delays in 2023?
- Do I get to complain about Dutch trains in live demos? (I came by car)


## Longest delay in 2023?

In [None]:
df2


## What causes delays?

I'm actually quite interested in these causes and what makes up a "cause group".
Let's find all groups and see what causes are inside :)

Note the nested DataFrames :)

In [None]:
df2
    // group by causeGroup and get `valueCounts()` of statisticalCause into "causes"
    
    // sort descending by the number of rows in causes 
    

## Which line had the most delays?

To find the line with the most delays, we first need to explode the `lines` column
to get a separate row for each line, then group by the `lines` column and count how many rows 
we get per individual line. Finally, sort descending by count.

In [None]:
val byLines = df2
    // explode lines

    // groupBy lines


byLines.count()

Well, what a surprise that was!

Now, this was per line, what about per station? 
The data also provides the affected stations in each line using the `stationNames` column.

Let's do the same as before:

In [None]:
val byStation = df2
    .explode { stationNames }
    .groupBy { stationNames }

byStation.count().sortByDesc("count")

Interesting! We have another 'winner'.

Let's get some more information about the duration of the delay, because just a count doesn't tell the whole story.
We can `describe()` the `duration` column to get statistical details about it.

In [None]:
byStation.aggregate {
    duration.describe().first() into "duration"
}

I don't know about you, but this requires some visualization, doesn't it?

Let's use Kandy, as it has excellent integration with notebooks and DataFrame.

In [None]:
%use kandy

Let's take a look at the examples: https://kotlin.github.io/kandy/examples.html

Looks like a boxplot can best show the results of a top-10 "worst" stations.

In [None]:
val top10 =
    byStation.sortByGroupDesc {
        count()
//    durationMinutes.mean()
//    count() * durationMinutes.median()
//    count() * durationMinutes.mean()
    }
        .filter { it.index() < 10 }
        .concat()

top10

In [None]:
top10.plot {
    boxplot(x = stationNames, y = durationMinutes) {
        boxes.fillColor(stationNames.distinct()) {
            legend.type = LegendType.None
        }
        y {
            scale = continuous(transform = Transformation.LOG10)
        }
    }

    layout.size = 1000 to 500
}

## Do I get to complain about Dutch trains in a demo?