# Random Split
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

In [1]:
import azureml.dataprep as dprep

Azure ML Data Prep provides the functionality of splitting a data set into two. When training a machine learning model, it is often desirable to train the model on a subset of data, then validate the model on a different subset.

The `random_split(percentage, seed=None, split_dataflow_name=None)` function in Data Prep takes in a Dataflow, randomly splitting it into two distinct subsets (approximately by the percentage specified).

The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `random_split` will receive different seeds.

The `split_dataflow_name` is also optional. If a name is not provided, the second Dataflow returned will be the name of the first Dataflow with a suffix "_split".

To demonstrate, you can go through the following example. First, you can read the first 10,000 lines from a file. Since the contents of the file don't matter, just the first two columns can be used for a simple example.

In [2]:
dflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv').take(10000)
dflow = dflow.keep_columns(['ID', 'Date'])
dflow = dflow.set_name('dflow_test_name')
profile = dflow.get_profile()
print('Row count of "%s": %d' % (dflow.name, profile.columns['ID'].count))

Row count of "dflow_test_name": 10000


Next, you can call `random_split` with the percentage set to 10% (the actual split ratio will be an approximation of `percentage`). You can take a look at the row count of the first returned Dataflow. You should see that `dflow_test` has approximately 1,000 rows (10% of 10,000).

In [3]:
(dflow_test, dflow_train) = dflow.random_split(percentage=0.1)
profile_test = dflow_test.get_profile()
print('Row count of "%s": %d' % (dflow_test.name, profile_test.columns['ID'].count))

Row count of "dflow_test_name": 956


Now you can take a look at the row count of the second returned Dataflow. The row count of `dflow_test` and `dflow_train` sums exactly to 10,000, because `random_split` results in two subsets that make up the original Dataflow. Also note that the second Dataflow's name is "dataflow_test_name_split", because no Dataflow name was supplied to `random_split`.

In [4]:
profile_train = dflow_train.get_profile()
print('Row count of "%s": %d' % (dflow_train.name, profile_train.columns['ID'].count))

Row count of "dflow_test_name_split": 9044


To specify a fixed seed, or a name for the second Dataflow, simply provide them to the `random_split` function.

In [5]:
(dflow_test, dflow_train) = dflow.random_split(percentage=0.1, seed=12345, split_dataflow_name='random_split_dflow_train')