# Partition Raw CSV Dataset Into Training/Testing Subsets

Suppose we have a directory with CSV files in it that collectively represent a dataset.  Each row in a file is of the form
$$f_1, f_2, \ldots, f_N, \ell_1, \ell_2, \ldots, \ell_K$$ where each $f_i$ is a feature and each $\ell_j$ is a ground truth label/category.  Assume there are several rows in each CSV file and several CSV files.  No guarantees that the examples are nicely shuffled among the files or organized in any particular order.  You get what you get.

We want to take all the examples in all the files, shuffle them all together, and redistribute them into a new set of files.  This new set will be partitioned into subdirectories, according to our training needs.  For example, if we're doing standard training we'll want three subdirectories: one for training, one for validation, and one for testing.  If we're doing K-fold cross validation, we'll want K + 1 subdirectories (K for training/validation and one for testing).

**ASSUMPTION**: We assume all the data from all the files fit in memory (+ virtual memory).  Obviously, this is not always the case.

In [None]:
SOURCE_DIR = 'cluster2D'
TARGET_DIR = 'dataset_cluster2D'
EXAMPLES_PER_FILE = 7

# List of subdirectories and the proportion of the total data allocated to each one
SUBSET_PROPORTIONS = {'train': 0.6, 'validation': 0.2, 'test':0.2}

# For K-fold cross validation, do this instead:
# K = ...
# SUBSET_NAMES = list(range(K)).append('test')
# SUBSET_PROPORTIONS = (K+1) * [1/(K+1)]  

In [None]:
import random
import os
import numpy as np

# Get list of input filenames to be put in the blender
input_csv_file_list = [os.path.join(SOURCE_DIR, x) for x in os.listdir(SOURCE_DIR) if x.endswith('.csv')]

# Make a big list of all the lines in all the files and scramble it up
all_the_lines = []
for filename in input_csv_file_list:
    fid = open(filename, "r")    
    all_the_lines += fid.readlines()
    fid.close()    
random.shuffle(all_the_lines)

# For each subdirectory, find the block in 'all_the_lines' that that will be written to it.
# [This would look nicer done with list comprehensions, but we need to do everything in-place.]
subset_start = 0
n_lines = len(all_the_lines)
for subset in SUBSET_PROPORTIONS.keys():
    
    subset_dir = os.path.join(TARGET_DIR, subset)
    os.makedirs(subset_dir)
    
    subset_len = int(SUBSET_PROPORTIONS[subset] * n_lines)
    subset_end = subset_start + subset_len
    num_files_in_subset = int(np.ceil(subset_len / EXAMPLES_PER_FILE))
    
    # For each output file to be written in this subdirectory
    a = subset_start
    for i in range(num_files_in_subset):
        b = min(subset_end, a + EXAMPLES_PER_FILE)
        file_path = os.path.join(subset_dir, str(i) + '.csv')
        lines = all_the_lines[a:b]
        fid = open(file_path, "w")
        fid.writelines(lines)
        fid.close()
        a = b