## Analyzing train disruptions in the Netherlands

From https://www.rijdendetreinen.nl/en/open-data/disruptions#downloads

In [None]:
disruptions2023.schema()

 Looking at the schema, we can see it mostly parsed the data correctly.
 `rdt_lines_id: Double?` is a mistake though.
 
From the website: "These are the IDs of the lines linked to a disruption by Rijden de Treinen, separated by a comma."
Understandably, `"24,32"` is parsed as a `Double` instead of `String`. Let's try to nudge it into the right direction when reading the data
by supplying it with a manual type for this column.

Let's also rename it to camel case while we're at it.

In [None]:
disruptions2023.schema()

Now the schema looks better! One of the best things about using DataFrame in notebooks
is that in between cell calls type-safe accessors are generated for you!

We can actually make this hidden process visible by tracking all code that's executed under the hood.

Libraries for the Kotlin Jupyter kernel and notebooks can be very powerful as you can see!

In [None]:
%trackExecution
//

In [None]:
%trackExecution off

In [None]:
val a = dataFrame.a
val b = dataFrame.b

a

Anyway, let's get back to our data!

Let's remove the columns we don't need and convert and rename some others.

In [None]:
// before
disruptions2023

In [None]:
import kotlin.time.Duration.Companion.minutes

val df1 = disruptions2023

    // we remove nsLines, dutch columns, and causeEn (as statisticalCauseEn is better according to the docs)
    

    // let's also remove some rows where durationMinutes == null
    
    
    // Parsing minutes into kotlin.time.Duration and creating an extra date column
    

    // renaming columns to remove "rdt" and "En" from the beginning and end
    

df1

Almost perfect! 

However, we still have some list-like columns.
We can split those into lists to make them more manageable.

In [None]:
val df2 = df1
    // splitting lines, linesId, stationNames, stationCodes to lists
    
    
    // convert linesId to List<Int>
    

df2

In [None]:
df2.schema()

Done! Now let's get to work! We can find all sorts of interesting stuff:

  - What's the longest delay duration in 2023?
  - What track had the most delays in 2023?
  - What causes delays?
  - Do I have the right to complain about Dutch trains in demos?
  

## Cause groups

I'm actually quite interested in these causes and what makes up a "cause group".
Let's find all groups and see what causes are inside :)

Note the nested DataFrames :)

In [None]:
df2
    // group by causeGroup and aggregate by counting the values in statisticalCause
    
    // sort descending by the n.o. rows in the new "statisticalCauses" frame column
    

## Which line had the most delays?

In [None]:
val byLines = df2
    .explode { lines }
    .groupBy { lines }

byLines.count().sortByDesc("count")

Well, what a surprise that was!

Now, this was per line, what about per station? The data also provides the affected stations in each line:


In [None]:
val byStation = df2
    .explode { stationNames }
    .groupBy { stationNames }

byStation.count().sortByDesc("count")

Let's get some more information about the duration of the delay, because just a count doesn't tell the whole story.


In [None]:
byStation.aggregate {
    duration.describe().first() into "duration"
}

Interesting! We have another 'winner'.

I don't know about you, but this requires some visualization, doesn't it?

Let's use Kandy, as it has excellent integration with notebooks and DataFrame.

Let's take a look at the examples: https://kotlin.github.io/kandy/examples.html

Looks like a boxplot can best show the results of a top-10 of "worst" stations.

In [None]:
%use kandy

In [None]:
val top10 = byStation.sortByGroupDesc {
    count()
//    durationMinutes.mean()
//    count() * durationMinutes.median()
//    count() * durationMinutes.mean()
}.filter { it.index() < 10 }

top10

In [None]:
top10.boxplot {
    x(stationNames named "name")
    y(durationMinutes)
}.configure {
//    y { scale = continuous(transform = Transformation.LOG10) }

    layout {
        size = 1000 to 500
    }
}

## Do I have the right to complain about Dutch trains in a demo?