### NS Data model
First we need to retrieve all the files and add them all into a single file.
<br><br>
The first problem I came across was that I have too much data for pandas to read it in a single chunk, this is why I decided to change the chuck-size so pandas will be able to read the files in multiple parts. I also added some text displaying how much is left. This takes about 20 minutes to write a combined.csv file.
<br><br>
Rule of Thumb in Business:
<br>
Exploratory analysis, smaller data → pandas with chunks
<br>
Production pipelines, big data → Dask/Spark
<br>
Long-term storage/analytics → write to Parquet/Delta Lake and query with SQL engines

pandas with chunks solution:<br>
This solution works good but is quite slow

In [2]:
import pandas as pd
import glob
import os

files = glob.glob("NS_Data/Trein_archief/*.csv.gz")

out_file = "combined.csv"
first = True  # flag to write header only once

for i, f in enumerate(files, start=1):
    print(f"[{i}/{len(files)}] Processing file: {f}")

    for j, chunk in enumerate(pd.read_csv(f, compression="gzip", chunksize=100_000), start=1):
        print(f"   - Writing chunk {j} from {f}")
        
        # Append mode after the first write
        chunk.to_csv(out_file, mode="a", index=False, header=first)
        first = False

[1/14] Processing file: NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 1 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 2 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 3 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 4 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 5 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 6 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 7 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 8 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 9 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 10 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 11 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 12 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 13 from NS_Data/Trein_archief\services-2019.csv.gz
   - Writing chunk 14 from NS_Da

  for j, chunk in enumerate(pd.read_csv(f, compression="gzip", chunksize=100_000), start=1):


   - Writing chunk 77 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 78 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 79 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 80 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 81 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 82 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 83 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 84 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 85 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 86 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 87 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 88 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 89 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 90 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 9

  for j, chunk in enumerate(pd.read_csv(f, compression="gzip", chunksize=100_000), start=1):


   - Writing chunk 196 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 197 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 198 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 199 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 200 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 201 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 202 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 203 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 204 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 205 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 206 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 207 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 208 from NS_Data/Trein_archief\services-2023.csv.gz
   - Writing chunk 209 from NS_Data/Trein_archief\services-2023.csv.gz
   - W

#### Using dask: 
This next block uses dask, dask is similar to pandas but has parallel processing and memory management intergrated. This is nice when handling datasets bigger than system memory.
<br>
There is an issue that dask needs the files to be small enough to fit into memory since it is a compressed file. For this I first need to unzip the files. Since these files are not too big I want to try this aswell. I couldn't get this to work yet, may do this in the future.

In [1]:
import gzip
import shutil
import glob
import os

files = glob.glob("NS_Data/Trein_archief/*.csv.gz")

for f in files:
    out_file = f[:-3]  # remove ".gz"
    print(f"Decompressing {f} -> {out_file}")
    with gzip.open(f, "rb") as f_in:
        with open(out_file, "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

Decompressing NS_Data/Trein_archief\services-2019.csv.gz -> NS_Data/Trein_archief\services-2019.csv


KeyboardInterrupt: 

In [2]:
import dask.dataframe as dd
df = dd.read_csv("NS_Data/Trein_archief/*.csv", blocksize="32MB")

print(df.head())

df.to_csv("combined_data.csv", single_file=True, index=False)

   Service:RDT-ID Service:Date Service:Type Service:Company  \
0          738804   2019-01-01    Intercity              NS   
1          738804   2019-01-01    Intercity              NS   
2          738804   2019-01-01    Intercity              NS   
3          738804   2019-01-01    Intercity              NS   
4          738804   2019-01-01    Intercity              NS   

   Service:Train number  Service:Completely cancelled  \
0                  1410                         False   
1                  1410                         False   
2                  1410                         False   
3                  1410                         False   
4                  1410                         False   

   Service:Partly cancelled  Service:Maximum delay  Stop:RDT-ID  \
0                     False                      1      6220112   
1                     False                      0      6220116   
2                     False                      0      6220120   
3         

MemoryError: 

#### Loading the data into python
The code block above created the combined.csv, this is all the data that is in all the different service files. We can now load this in sizeable chunks into python, at first I started with 100.000 rows and showed the head of this data.

In [2]:
import pandas as pd

df = pd.read_csv("combined.csv", nrows=100_000)

print(df.shape)
print(df.head())

(100000, 20)
   Service:RDT-ID Service:Date Service:Type Service:Company  \
0          738804   2019-01-01    Intercity              NS   
1          738804   2019-01-01    Intercity              NS   
2          738804   2019-01-01    Intercity              NS   
3          738804   2019-01-01    Intercity              NS   
4          738804   2019-01-01    Intercity              NS   

   Service:Train number  Service:Completely cancelled  \
0                  1410                         False   
1                  1410                         False   
2                  1410                         False   
3                  1410                         False   
4                  1410                         False   

   Service:Partly cancelled  Service:Maximum delay  Stop:RDT-ID  \
0                     False                      1      6220112   
1                     False                      0      6220116   
2                     False                      0      6220120 

### Data analyse
Next up is some data analasys to understand what I am working with.<br>
Some possible statistics are:
📊 Possible Statistics
#### Disruptions
Number of disruptions per day/week/month/year
<br>Average duration of disruptions
<br>Distribution of causes (e.g., equipment problems, overhead line issues, signal failures, weather influences)
<br>Top affected routes and stations
<br>Distribution by time of day (e.g., peak hours vs off-peak)
<br>Seasonal influences (e.g., more disruptions in autumn due to leaves, in winter due to snow)
<br>Average recovery time per cause

#### Stations
<br>Top 10 largest hubs (most disruptions or delays)
<br>Stations with the most platform changes
<br>Comparison of megastations vs small stations (number of disruptions or delays)
<br>Geographic heatmap of disruptions

#### Train Services
<br>Percentage of fully canceled vs partly canceled services
<br>Average delay per operator (NS, Arriva, etc.)
<br>Top 10 train numbers with the most delays
<br>Distribution of delays (e.g., <5 min, 5–15 min, >30 min)
<br>Platform changes per station or by train type (Intercity vs Sprinter)
<br>Punctuality by time of day