### NS Data model
First we need to retrieve all the files and add them all into a single file.
<br><br>
The first problem I came across was that I have too much data for pandas to read it in a single chunk, this is why I decided to change the chuck-size so pandas will be able to read the files in multiple parts. I also added some text displaying how much is left. This takes about 20 minutes to write a combined.csv file.
<br><br>
Rule of Thumb in Business:
<br>
Exploratory analysis, smaller data → pandas with chunks
<br>
Production pipelines, big data → Dask/Spark
<br>
Long-term storage/analytics → write to Parquet/Delta Lake and query with SQL engines

In [7]:
import pandas as pd
import glob
import os

files = glob.glob("NS Data/Trein ritten/*.csv.gz")

out_file = "combined.csv"
first = True  # flag to write header only once

for i, f in enumerate(files, start=1):
    print(f"[{i}/{len(files)}] Processing file: {f}")

    for j, chunk in enumerate(pd.read_csv(f, compression="gzip", chunksize=100_000), start=1):
        print(f"   - Writing chunk {j} from {f}")
        
        # Append mode after the first write
        chunk.to_csv(out_file, mode="a", index=False, header=first)
        first = False

[1/14] Processing file: NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 1 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 2 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 3 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 4 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 5 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 6 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 7 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 8 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 9 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 10 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 11 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 12 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 13 from NS Data/Trein ritten\services-2019.csv.gz
   - Writing chunk 14 from NS Data/Trein ritte

  for j, chunk in enumerate(pd.read_csv(f, compression="gzip", chunksize=100_000), start=1):


   - Writing chunk 77 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 78 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 79 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 80 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 81 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 82 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 83 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 84 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 85 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 86 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 87 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 88 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 89 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 90 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 91 from NS Data

  for j, chunk in enumerate(pd.read_csv(f, compression="gzip", chunksize=100_000), start=1):


   - Writing chunk 196 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 197 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 198 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 199 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 200 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 201 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 202 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 203 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 204 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 205 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 206 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 207 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 208 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 209 from NS Data/Trein ritten\services-2023.csv.gz
   - Writing chunk 2

#### Loading the data into pythong
The code block above created the combined.csv, this is all the data that is in all the different service files. We can now load this in sizeable chunks into python, at first I started with 100.000 rows and showed the head of this data.

In [2]:
import pandas as pd

df = pd.read_csv("combined.csv", nrows=100_000)

print(df.shape)
print(df.head())

(100000, 20)
   Service:RDT-ID Service:Date Service:Type Service:Company  \
0          738804   2019-01-01    Intercity              NS   
1          738804   2019-01-01    Intercity              NS   
2          738804   2019-01-01    Intercity              NS   
3          738804   2019-01-01    Intercity              NS   
4          738804   2019-01-01    Intercity              NS   

   Service:Train number  Service:Completely cancelled  \
0                  1410                         False   
1                  1410                         False   
2                  1410                         False   
3                  1410                         False   
4                  1410                         False   

   Service:Partly cancelled  Service:Maximum delay  Stop:RDT-ID  \
0                     False                      1      6220112   
1                     False                      0      6220116   
2                     False                      0      6220120 