# Splitting GMSC "train full" data with scikit

## Load data

In [1]:
from pandas import read_csv
train_full = read_csv('csv/train_full.csv', index_col=0)

## Fix seed for reproducibility

Let's make this notebook's output stable across runs!

* Scikit uses numpy as backend for numerical computing
* Scikit's `train_test_split` has `shuffle` argument (`True` by default)
* Shuffling is random...
    * `random_state` argument set to `None` by default
    * Means random number generator is numpy's

Let's set the state of numpy's random number generator, before using scikit to do anything random — including splitting the data!

In [2]:
from numpy import random
SEED = 42
random.seed(SEED)

## Split data

Let's do 80-20%:

In [3]:
from sklearn.model_selection import train_test_split
VAL_SIZE = 0.2
train, val = train_test_split(train_full, test_size=VAL_SIZE)

Quick sanity check on shapes:

In [4]:
print("Train full shape: " + str(train_full.shape))
print("Train shape: " + str(train.shape))
print("Val shape: " + str(val.shape))

Train full shape: (150000, 17)
Train shape: (120000, 17)
Val shape: (30000, 17)


## Save data

In [5]:
train.to_csv("csv/train_sk.csv")
val.to_csv("csv/val_sk.csv")