# **Data Splitting**

## Objectives
- Split the dataset into train and test sets for each of the maths, writing and reading exam scores, so that models can be trained to predict each exam score

## Inputs
- outputs/datasets/collection/student-exam-results.csv

## Outputs
- train and test sets for each of the math_score, reading_score and writing_score variables

## Additional comments

# Change working directory

Since this notebook exists in the jupyter_notebooks directory, we need to change the current working directory from the jupyter_notebooks directory to the workspace, so that any directories created in further codes cells are added in the correct place. 

We access the current directory with the OS packages' `getcwd()` method

In [1]:
import os
current_directory = os.getcwd()
current_directory


'/workspace/Exam-Scores-Analysis/jupyter_notebooks'

We now want to set the working directory as the parent of the current working directory, jupyter_notebooks

- The `os.path.dirname()` method gets the parent directory
- The `os.chir()` method defines the new current directory
- We do this to access all of the project's files and directories, rather than those in the jupyter_notebooks directory

In [2]:
os.chdir(os.path.dirname(current_directory))
print("You set a new current directory")

You set a new current directory


To make certain of things, we now use a code cell to confirm that we have set the current working directory properly

In [3]:
current_directory = os.getcwd()
current_directory

'/workspace/Exam-Scores-Analysis'

## Load data

In [4]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/student-exam-results.csv')
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score,literacy_score
0,male,group A,high school,standard,completed,67,67,63,65,65
1,female,group D,some high school,free/reduced,none,40,59,55,51,57
2,male,group E,some college,free/reduced,none,59,60,50,56,55
3,male,group B,high school,standard,none,77,78,68,74,73
4,male,group E,associate's degree,standard,completed,78,73,68,73,70


Now we create three different sets

In [5]:
df_math_score = df.drop(['reading_score', 'writing_score', 'average_score', 'literacy_score'], axis=1)
df_math_score.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score
0,male,group A,high school,standard,completed,67
1,female,group D,some high school,free/reduced,none,40
2,male,group E,some college,free/reduced,none,59
3,male,group B,high school,standard,none,77
4,male,group E,associate's degree,standard,completed,78


In [6]:
df_reading_score = df.drop(['math_score', 'writing_score', 'average_score', 'literacy_score'], axis=1)
df_reading_score.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,reading_score
0,male,group A,high school,standard,completed,67
1,female,group D,some high school,free/reduced,none,59
2,male,group E,some college,free/reduced,none,60
3,male,group B,high school,standard,none,78
4,male,group E,associate's degree,standard,completed,73


In [7]:
df_writing_score = df.drop(['math_score', 'reading_score', 'average_score', 'literacy_score'], axis=1)
df_writing_score.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,writing_score
0,male,group A,high school,standard,completed,63
1,female,group D,some high school,free/reduced,none,55
2,male,group E,some college,free/reduced,none,50
3,male,group B,high school,standard,none,68
4,male,group E,associate's degree,standard,completed,68


### Math Score

Now we can create split these datasets to form train and test sets. First, we import the train_test_split package.

In [8]:
from sklearn.model_selection import train_test_split

In [16]:
math_train_vars, math_test_vars, math_train_score, math_test_score = train_test_split(
    df_math_score.drop(['math_score'], axis=1),
    df_math_score['math_score'],
    test_size = 0.2,
    random_state = 7
)

Now we check to see if this has worked:

In [26]:
print(f"math_train_vars: {math_train_vars.shape}")
print(f"math_test_vars: {math_test_vars.shape}")
print(f"math_train_score: {math_train_score.shape}")
print(f"math_test_score: {math_test_score.shape}")


math_train_vars: (800, 5)
math_test_vars: (200, 5)
math_train_score: (800,)
math_test_score: (200,)


Excellent - we see that the math_train_vars and math_train_score datasets both have 800 entries, equivalent to 80% of the original 1000 record dataset. We also see that the math_test_vars and math_test_score datasets both have 200 entries, equivalent to 20% of the original 1000 record dataset. For good measure, we will inspect the dataframes and series below:

In [27]:
math_train_vars

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course
600,female,group C,some college,free/reduced,none
80,female,group B,bachelor's degree,standard,none
158,female,group D,associate's degree,free/reduced,none
423,male,group D,associate's degree,standard,none
747,female,group A,high school,standard,none
...,...,...,...,...,...
579,female,group B,high school,free/reduced,none
502,male,group D,associate's degree,standard,none
537,female,group D,some high school,free/reduced,completed
196,male,group D,bachelor's degree,standard,none


In [18]:
math_test_vars

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course
778,male,group C,master's degree,free/reduced,none
334,male,group C,high school,standard,none
271,female,group D,some high school,standard,none
802,female,group B,some college,standard,completed
216,female,group B,associate's degree,standard,none
...,...,...,...,...,...
371,female,group B,some college,standard,completed
411,female,group B,bachelor's degree,free/reduced,none
644,female,group C,some high school,standard,none
981,male,group C,some college,standard,none


In [19]:
math_train_score

600    40
80     46
158    82
423    77
747    77
       ..
579    47
502    71
537    60
196    50
175    50
Name: math_score, Length: 800, dtype: int64

In [20]:
math_test_score

778    48
334    48
271    59
802    69
216    53
       ..
371    66
411    62
644    40
981    64
365    60
Name: math_score, Length: 200, dtype: int64

Upon inspection, we see that the data has also been randomised.

### Reading Score

We will now do the same with the reading_score variable

In [22]:
reading_train_vars, reading_test_vars, reading_train_score, reading_test_score = train_test_split(
    df_reading_score.drop(['reading_score'], axis=1),
    df_reading_score['reading_score'],
    test_size = 0.2,
    random_state = 7
)

Now we check to see if these have been created properly

In [28]:
print(f"reading_train_vars: {reading_train_vars.shape}")
print(f"reading_test_vars: {reading_test_vars.shape}")
print(f"reading_train_score: {reading_train_score.shape}")
print(f"reading_test_score: {reading_test_score.shape}")

reading_train_vars: (800, 5)
reading_test_vars: (200, 5)
reading_train_score: (800,)
reading_test_score: (200,)


Excellent - 800 records for the train set and 200 records for the test set. Since we have used the same random state, we can be assured that the data has been randomised

### Writing Score

Finally, we split the writing_score dataset

In [29]:
writing_train_vars, writing_test_vars, writing_train_score, writing_test_score = train_test_split(
    df_writing_score.drop(['writing_score'], axis=1),
    df_writing_score['writing_score'],
    test_size = 0.2,
    random_state = 7
)

In [30]:
print(f"writing_train_vars: {writing_train_vars.shape}")
print(f"writing_test_vars: {writing_test_vars.shape}")
print(f"writing_train_score: {writing_train_score.shape}")
print(f"writing_test_score: {writing_test_score.shape}")

writing_train_vars: (800, 5)
writing_test_vars: (200, 5)
writing_train_score: (800,)
writing_test_score: (200,)


As before, we have 800 records for the train set and 200 for the test set

### Save files

We can now save these various datasets as CSV files in the outputs folder. First, we'll create the necessary folders:

In [40]:
import os
try:
    ! rm -r outputs/datasets/filtered
    ! rm -r outputs/datasets/split
    os.makedirs(name='outputs/datasets/filtered')
    print('directory outputs/datasets/filtered created')
    os.makedirs(name='outputs/datasets/split')
    print('directory outputs/datasets/split created')
except Exception as e:
    print(e)

directory outputs/datasets/filtered created
directory outputs/datasets/split created


Now we save the filtered datasets, before any splitting happens

In [41]:
! rm outputs/datasets/filtered/math-score-dataset.csv
df_math_score.to_csv("outputs/datasets/filtered/math-score-dataset.csv", index=False)
print('math-score-dataset.csv created in outputs/datasets/filtered')

rm: cannot remove 'outputs/datasets/filtered/math-score-dataset.csv': No such file or directory
math-score-dataset.csv created in outputs/datasets/filtered


In [42]:
! rm outputs/datasets/filtered/reading-score-dataset.csv
df_reading_score.to_csv("outputs/datasets/filtered/reading-score-dataset.csv", index=False)
print('reading-score-dataset.csv created in outputs/datasets/filtered')

rm: cannot remove 'outputs/datasets/filtered/reading-score-dataset.csv': No such file or directory
reading-score-dataset.csv created in outputs/datasets/filtered


In [43]:
! rm outputs/datasets/filtered/writing-score-dataset.csv
df_writing_score.to_csv("outputs/datasets/filtered/writing-score-dataset.csv", index=False)
print('writing-score-dataset.csv created in outputs/datasets/filtered')

rm: cannot remove 'outputs/datasets/filtered/writing-score-dataset.csv': No such file or directory
writing-score-dataset.csv created in outputs/datasets/filtered


Now we can save the split datasets

In [44]:
try:
    
    ! rm -r outputs/datasets/split/math
    ! rm -r outputs/datasets/split/reading
    ! rm -r outputs/datasets/split/writing
    os.makedirs(name='outputs/datasets/split/math')
    print('directory outputs/datasets/split/math created')
    os.makedirs(name='outputs/datasets/split/reading')
    print('directory outputs/datasets/split/reading created')
    os.makedirs(name='outputs/datasets/split/writing')
    print('directory outputs/datasets/split/writing created')
except Exception as e:
    print(e)

rm: cannot remove 'outputs/datasets/split/math': No such file or directory
rm: cannot remove 'outputs/datasets/split/reading': No such file or directory
rm: cannot remove 'outputs/datasets/split/writing': No such file or directory
directory outputs/datasets/split/math created
directory outputs/datasets/split/reading created
directory outputs/datasets/split/writing created


In [47]:
! rm outputs/datasets/split/math/math-train-vars.csv
math_train_vars.to_csv("outputs/datasets/split/math/math-train-vars.csv", index=False)
print('math-train-vars.csv created in outputs/datasets/split/math')

! rm outputs/datasets/split/math/math-test-vars.csv
math_test_vars.to_csv("outputs/datasets/split/math/math-test-vars.csv", index=False)
print('math-test-vars.csv created in outputs/datasets/split/math')

! rm outputs/datasets/split/math/math-train-score.csv
math_train_score.to_csv("outputs/datasets/split/math/math-train-score.csv", index=False)
print('math-train-score.csv created in outputs/datasets/split/math')

! rm outputs/datasets/split/math/math-test-score.csv
math_test_score.to_csv("outputs/datasets/split/math/math-test-score.csv", index=False)
print('math-test-score.csv created in outputs/datasets/split/math')

math-train-vars.csv created in outputs/datasets/split/math
math-test-vars.csv created in outputs/datasets/split/math
math-train-score.csv created in outputs/datasets/split/math
rm: cannot remove 'outputs/datasets/split/math/math-test-score.csv': No such file or directory
math-test-score.csv created in outputs/datasets/split/math


In [48]:
! rm outputs/datasets/split/reading/reading-train-vars.csv
reading_train_vars.to_csv("outputs/datasets/split/reading/reading-train-vars.csv", index=False)
print('reading-train-vars.csv created in outputs/datasets/split/reading')

! rm outputs/datasets/split/reading/reading-test-vars.csv
reading_test_vars.to_csv("outputs/datasets/split/reading/reading-test-vars.csv", index=False)
print('math-test-vars.csv created in outputs/datasets/split/math')

! rm outputs/datasets/split/reading/reading-train-score.csv
reading_train_score.to_csv("outputs/datasets/split/reading/reading-train-score.csv", index=False)
print('reading-train-score.csv created in outputs/datasets/split/reading')

! rm outputs/datasets/split/reading/reading-test-score.csv
reading_test_score.to_csv("outputs/datasets/split/reading/reading-test-score.csv", index=False)
print('reading-test-score.csv created in outputs/datasets/split/reading')

rm: cannot remove 'outputs/datasets/split/reading/reading-train-vars.csv': No such file or directory
reading-train-vars.csv created in outputs/datasets/split/reading
rm: cannot remove 'outputs/datasets/split/reading/reading-test-vars.csv': No such file or directory
math-test-vars.csv created in outputs/datasets/split/math
rm: cannot remove 'outputs/datasets/split/reading/reading-train-score.csv': No such file or directory
reading-train-score.csv created in outputs/datasets/split/reading
rm: cannot remove 'outputs/datasets/split/reading/reading-test-score.csv': No such file or directory
reading-test-score.csv created in outputs/datasets/split/reading


In [49]:
! rm outputs/datasets/split/writing/writing-train-vars.csv
writing_train_vars.to_csv("outputs/datasets/split/writing/writing-train-vars.csv", index=False)
print('writing-train-vars.csv created in outputs/datasets/split/writing')

! rm outputs/datasets/split/writing/writing-test-vars.csv
writing_test_vars.to_csv("outputs/datasets/split/writing/writing-test-vars.csv", index=False)
print('writing-test-vars.csv created in outputs/datasets/split/writing')

! rm outputs/datasets/split/writing/writing-train-score.csv
writing_train_score.to_csv("outputs/datasets/split/writing/writing-train-score.csv", index=False)
print('writing-train-score.csv created in outputs/datasets/split/writing')

! rm outputs/datasets/split/writing/writing-test-score.csv
writing_test_score.to_csv("outputs/datasets/split/writing/writing-test-score.csv", index=False)
print('writing-test-score.csv created in outputs/datasets/split/writing')

rm: cannot remove 'outputs/datasets/split/writing/writing-train-vars.csv': No such file or directory
writing-train-vars.csv created in outputs/datasets/split/writing
rm: cannot remove 'outputs/datasets/split/writing/writing-test-vars.csv': No such file or directory
writing-test-vars.csv created in outputs/datasets/split/writing
rm: cannot remove 'outputs/datasets/split/writing/writing-train-score.csv': No such file or directory
writing-train-score.csv created in outputs/datasets/split/writing
rm: cannot remove 'outputs/datasets/split/writing/writing-test-score.csv': No such file or directory
writing-test-score.csv created in outputs/datasets/split/writing
