### Importing Necessary Modules
In order to create a machine learning model that would predict the survival of a passenger 

aboard the Titanic, the cleaned data needs to be split into 

training, validation and test sets. This can be done using the sklearn library.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

### Read in the Data

Reading in the cleaned Titanic Dataset with the pandas "read_csv" function 

and displaying the first 10 rows as shown below.

In [2]:
address = r"...\...\...\titanic_EDA\Datasets\clean_titanic_data.csv"

titanic_data = pd.read_csv(address)
titanic_data.head(10)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Cabin_Indicator,Family_Count
0,0,3,0,22.0,7.25,0,1
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0
5,0,3,0,30.0,8.4583,0,0
6,0,1,0,54.0,51.8625,1,0
7,0,3,0,2.0,21.075,0,4
8,1,3,1,27.0,11.1333,0,2
9,1,2,1,14.0,30.0708,0,1


### Separating the Data
The dataset would now be split into "Features" and "Labels". The "Features" can also be called 

"Predictors" and they would be used to predict the outcome of a passenger's survival

depending on the features related to the particular passenger. The "Labels" can also

be called "Target". This is the outcome or the aim of the prediction; to determine

if a passenger survived or not.

### Separation Method
The "Survived" column is assigned to the "labels" variable and it is 

also dropped from the dataset assigned to the "features" variable.

These two variables can also be called "X" and "y". 60% of the data is assigned to the

train set and 40% to the test set. The test set is further divided into half (50% each), 

one half is asssigned to the validation set and the other half is left to the test set.

In [3]:
features = titanic_data.drop('Survived', axis = 1)
labels = titanic_data['Survived']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.4)

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size = 0.5)

### Displaying the Separation Ratio
As seen in the output, the training set takes 60% of the data,

validation set takes 20% and the test set takes the remaining 20%.

In [4]:
for dataset in [y_train, y_val, y_test]:
    print(round(len(dataset) / len(labels), 2) * 100)

60.0
20.0
20.0


### Saving Split Data
The separated data would now be saved into individual csv files 

to be used in the machine learning model creation.

In [5]:
X_train.to_csv(r'...\...\...\titanic_EDA\Split_data\train_features.csv', index = False)

X_val.to_csv(r'...\...\...\titanic_EDA\Split_data\validation_features.csv', index = False)

X_test.to_csv(r'...\...\...\titanic_EDA\Split_data\test_features.csv', index = False)

y_train.to_csv(r'...\...\...\titanic_EDA\Split_data\train_labels.csv', index = False)

y_val.to_csv(r'...\...\...\titanic_EDA\Split_data\validation_labels.csv', index = False)

y_test.to_csv(r'...\...\...\titanic_EDA\Split_data\test_labels.csv', index = False)