# Splitting data into the train/validation/test dataset

It is important to split your full dataset into train/validation/test datasets, and reliably use the same datasets for your modeling tasks later.

In [13]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

from sklearn.model_selection import train_test_split

# Set a random seed to ensure reproducibility across runs
RNG_SEED = 42
np.random.seed(seed=RNG_SEED)

## Load the pre-processed dataset

We will start with the processed dataset that we saved from the last notebook.

In [14]:
PATH = os.getcwd()
data_path = os.path.join(PATH, '../data_abs/cp_data_abs_cleaned.csv')

df = pd.read_csv(data_path)
print(f'Full DataFrame shape: {df.shape}')

Full DataFrame shape: (3612, 3)


## Separate the DataFrame into your input variables ($X$) and target variables ($y$)

The $X$ will be used as the input data, and $y$ will be used as the prediction targets for your ML model.

If your target variables are discrete (such as `metal`/`non-metal` or types of crystal structures), then you will be performing a classification task.
In our case, since our target variables are continuous values (absorption coefficient), we are performing a regression task.

In [15]:
X = df[['formula', 'energy']]
y = df['abs']

print(f'Shape of X: {X.shape}')
print(f'Shape of y: {y.shape}')

Shape of X: (3612, 2)
Shape of y: (3612,)


## Splitting data (and a word of caution)
### Normally, we could simply split the data with a simple `sklearn` function

The scikit-learn `train_test_split` function randomly splits a dataset into train and test datasets.
Typically, you can use `train_test_split` to first split your data into "train" and "test" datasets, and then use the function again to split your "train" data into "train" and "validation" dataset splits.

As a rule of thumb, you can roughly aim for the following dataset proportions when splitting your data:

| | train split | validation split | test split |
| --- | --- | --- | --- |
| proportion<br> of original<br> dataset | 50% to 70% | 20% to 30% | 10% to 20% |

If you have copious amounts of data, it may suffice to train your models on just 50% of the data; that way, you have a larger amount of data samples to validate and to test with.
If you however have a smaller dataset and thus very few training samples for your models, you may wish to increase your proportion of training data during dataset splitting.

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=RNG_SEED)

print(X_train.shape)
print(X_test.shape)

NameError: name 'train_test_split' is not defined

In [17]:
#num_rows = len(X_train)
#print(f'There are in total {num_rows} rows in the X_train DataFrame.')

#num_unique_formulae = len(X_train['formula'].unique())
#print(f'But there are only {num_unique_formulae} unique formulae!\n')

#print('Unique formulae and their number of occurances in the X_train DataFrame:')
#print(X_train['formula'].value_counts(), '\n')
#print('Unique formulae and their number of occurances in the X_test DataFrame:')
#print(X_test['formula'].value_counts())

## Splitting data, cautiously (manually)

First we get a list of all of the unique formulae in the dataset.

In [18]:
unique_formulae = X['formula'].unique()
print(f'{len(unique_formulae)}')

3612


In [19]:
# Set a random seed to ensure reproducibility across runs
np.random.seed(seed=RNG_SEED)

# Store a list of all unique formulae
all_formulae = unique_formulae.copy()

# Define the proportional size of the dataset split
val_size = 0.20
test_size = 0.10
train_size = 1 - val_size - test_size

# Calculate the number of samples in each dataset split
num_val_samples = int(round(val_size * len(unique_formulae)))
num_test_samples = int(round(test_size * len(unique_formulae)))
num_train_samples = int(round((1 - val_size - test_size) * len(unique_formulae)))

# Randomly choose the formulate for the validation dataset, and remove those from the unique formulae list
val_formulae = np.random.choice(all_formulae, size=num_val_samples, replace=False)
all_formulae = [f for f in all_formulae if f not in val_formulae]

# Randomly choose the formulate for the test dataset, and remove those from the unique formulae list
test_formulae = np.random.choice(all_formulae, size=num_test_samples, replace=False)
all_formulae = [f for f in all_formulae if f not in test_formulae]

# The remaining formulae will be used for the training dataset
train_formulae = all_formulae.copy()

print('Number of training formulae:', len(train_formulae))
print('Number of validation formulae:', len(val_formulae))
print('Number of testing formulae:', len(test_formulae))

Number of training formulae: 2529
Number of validation formulae: 722
Number of testing formulae: 361


In [20]:
# Split the original dataset into the train/validation/test datasets using the formulae lists above
df_train = df[df['formula'].isin(train_formulae)]
df_val = df[df['formula'].isin(val_formulae)]
df_test = df[df['formula'].isin(test_formulae)]

print(f'train dataset shape: {df_train.shape}')
print(f'validation dataset shape: {df_val.shape}')
print(f'test dataset shape: {df_test.shape}\n')

train dataset shape: (2529, 3)
validation dataset shape: (722, 3)
test dataset shape: (361, 3)



To be sure that we really only have mutually exclusive formulae within each of the datasets, we can do the following to check:

In [21]:
train_formulae = set(df_train['formula'].unique())
val_formulae = set(df_val['formula'].unique())
test_formulae = set(df_test['formula'].unique())

common_formulae1 = train_formulae.intersection(test_formulae)
common_formulae2 = train_formulae.intersection(val_formulae)
common_formulae3 = test_formulae.intersection(val_formulae)

print(f'# of common formulae in intersection 1: {len(common_formulae1)}; common formulae: {common_formulae1}')
print(f'# of common formulae in intersection 2: {len(common_formulae2)}; common formulae: {common_formulae2}')
print(f'# of common formulae in intersection 3: {len(common_formulae3)}; common formulae: {common_formulae3}')

# of common formulae in intersection 1: 0; common formulae: set()
# of common formulae in intersection 2: 0; common formulae: set()
# of common formulae in intersection 3: 0; common formulae: set()


## Save split datasets to csv

Finally, after splitting the dataset into train/validation/test dataset splits, you can save them to disk for you to use later.

By saving these dataset splits into files, you can then later reproducibly use these same exact splits during your subsequent model training and comparison steps.
Use the same datasets for all your models---that way, you can ensure a fair comparison.

Also, when you publish your results, you can include these dataset splits, so that others can use the exact datasets in their own studies.

In [22]:
# saving these splits into csv files
PATH = os.getcwd()

train_path = os.path.join(PATH, '../data_abs/cp_train.csv')
val_path = os.path.join(PATH, '../data_abs/cp_val.csv')
test_path = os.path.join(PATH, '../data_abs/cp_test.csv')

df_train.to_csv(train_path, index=False)
df_val.to_csv(val_path, index=False)
df_test.to_csv(test_path, index=False)

Remember, keep the test dataset locked away and forget about it until you have finalized your model!
**Never look at the test dataset!!** 