# More Data Wrangling 5

In this notebook I demonstrate how to split a dataframe into two random subsets, allocating a certain percentage of the data to one object and the remainder to another. This is useful for Machine Learning and statistical learning analyses where we might want to assess how well as statistical model performs at predicting out of sample data. This is commonly referred to train/test split, where the model is trained on part of the data and then tested on a different subset to see how well it generalises when presented with unseen observations. 

In [1]:
import pandas as pd

import numpy as np

In [2]:
maz = pd.read_csv('MarioMscData.csv')

In [3]:
# Split a dataframe into two random subsets.

# Looking at the length of the datset from Mario's MSc:

len(maz)

589

In [4]:
# Splitting the dataset on a 75/ 25 random split using the dot sample method and assigning 
# the sample to a new data frame 

maz_1 = maz.sample(frac = 0.75, random_state = 1234)

In [5]:
# Assigning the other 25% of the dataset to a second data frame using the dot drop method:

maz_2 = maz.drop(maz_1.index)

In [6]:
# Checking the length of both matches the length of the original data file:

len(maz_1) + len(maz_2)

589

In [7]:
# Using the index to show that every participant is in either maz_1 or maz_2:

maz_1.index.sort_values()

Int64Index([  0,   5,   6,   7,   9,  11,  12,  13,  16,  17,
            ...
            576, 577, 578, 580, 581, 582, 584, 586, 587, 588],
           dtype='int64', length=442)

In [8]:
maz_2.index.sort_values()

# Need to keep in mind that this approach will not work if your index values are not unique. 

Int64Index([  1,   2,   3,   4,   8,  10,  14,  15,  18,  26,
            ...
            545, 552, 557, 560, 562, 565, 574, 579, 583, 585],
           dtype='int64', length=147)