# Create train/test/validation split on the data


This notebook will take a DataFrame with at least `['id', 'scheduleDateTime']` and creates a train/test split

1. Calculate time features from DateTime values
2. Output pd.DataFrame with id columns + time feature columns
3. Write output to CSV file

### Parameters

--------------
- `input_file`: Filepath of flights data in format received from Schiphol
- `output_file`: Filepath to write output csv file with minimal modelling input
- `strategy': One of ['sample', 'timeseries']
- `test_size`: (Optional) Default 0.3. Fraction to use as test data between 0 and 1
- `val_size`: (Optional) Default 0.1. Fraction to use as validation data between 0 and 1

### Returns

--------------

Output format 
    
    id                   |   model_set |
    123414481790510775   |  train      |
    123414479288269149   |  train      |
    123414479666542945   |  test       |
    123414479288365061   |  test       |
    123414479288274329   |  validation |

## File parameters


In [1]:
# parameters
input_file = "../lvt-schiphol-assignment-snakemake/data/model_input/delays_base_input.csv"
output_file = "train_test__sample__0.2.csv"
strategy = 'sample'
test_size = 0.2

In [2]:
assert test_size < 1 and test_size > 0
assert strategy in ['sample', 'timeseries']

## Imports

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

import sys
sys.path.append("../")

from src.data.google_storage_io import read_csv_data, write_csv_data

AttributeError: module 'grpc' has no attribute 'AuthMetadataPlugin'

## Load data

In [None]:
%%time

df = read_csv_data(input_file)
df = df[["id", "scheduleDateTime"]]
df["scheduleDateTime"] = pd.to_datetime(df["scheduleDateTime"])
df.head()

## Make train/test/validation split

In [None]:
print(f"Strategy: {strategy}")

if strategy == 'sample':
    train_ids, test_ids = train_test_split(df["id"], test_size=test_size)
elif strategy == 'timeseries':
    df = df.sort_values("scheduleDateTime").reset_index()
    test_size = int(len(df) * 0.2)
    train_ids, test_ids = df.iloc[:-test_size]["id"], df.iloc[-test_size:]["id"]
    
df_train_test = pd.concat([
    pd.DataFrame(dict(id = train_ids.values, model_set = "train")),
    pd.DataFrame(dict(id = test_ids.values, model_set = "test"))
])

df_train_test

In [None]:
pd.merge(
    df_train_test,
    df, on="id", how="left")

## Write output to CSV

Local or Google Storage is both handled

In [11]:
# write output file
write_csv_data(df_train_test, output_file, index=False)

Writing file to local directory
File:	processed_flights.csv



### Overview of the output data

In [13]:
df_train_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487716 entries, 0 to 487715
Data columns (total 8 columns):
id                      487716 non-null int64
aircraftRegistration    487713 non-null object
airlineCode             486503 non-null float64
terminal                477391 non-null float64
serviceType             482937 non-null object
scheduleDateTime        487716 non-null datetime64[ns, Europe/Amsterdam]
actualOffBlockTime      487716 non-null datetime64[ns, Europe/Amsterdam]
scheduleDelaySeconds    487716 non-null float64
dtypes: datetime64[ns, Europe/Amsterdam](2), float64(3), int64(1), object(2)
memory usage: 29.8+ MB
