# *Load Data*

In [1]:
import pandas as pd

df = pd.read_csv('../Data/Titanic-Dataset-selected.csv')
df.head()

Unnamed: 0,Survived,Pclass,Age,Sex,Cabin
0,0,3,-0.592481,1,115
1,1,1,0.638789,0,81
2,1,3,-0.284663,0,115
3,1,1,0.407926,0,55
4,0,3,0.407926,1,115


# *Split Data*

A stratified split would ensure that both the training and test sets have proportions of 

A, B, and C that are as close as possible to the original dataset:
- Train: ['A', 'A', 'B'] (66% 'A', 33% 'B', 0% 'C')
- Test: ['A', 'B', 'C'] (33% 'A', 33% 'B', 33% 'C')

Here, each subset more closely mirrors the overall distribution, ensuring that all classes are represented.

but in this example , we split and equalize just One label data (Survived) to make sure the data is split evenly not all label data

In [7]:
# import train_test_split to split the data
from sklearn.model_selection import train_test_split

# split the data into training, testing, and validating
# with ratio 80 training :10 testing : 10 validating
# and in here we use stratify to make sure the data is split evenly
# specifically for the Survived label data
# so that the training, testing, and validating data have the same ratio of Survived label data

# split the 100% data into train 80% and test 20%
df_train, df_unseen = train_test_split(df, test_size=0.2, random_state=0, stratify=df['Survived'])

# split test 20% from original 100% 
# to validation 50% from 20% and test 50% from 20%
df_val, df_test = train_test_split(df_unseen, test_size=0.5, random_state=0, stratify=df_unseen['Survived'])

# check Count each of the data
print(f'Count of Original Data: {df.shape[0]}')
print(f'Count of Training Data: {df_train.shape[0]}')
print(f'Count of Validation Data: {df_val.shape[0]}')
print(f'Count of Testing Data: {df_test.shape[0]}')

# check ratio in Survived label (0 and 1 value) data
# By comparing the counts across the different subsets (training, validation, testing),
# you can ensure that the splits are representative of the original data.
# If one subset has a significantly different distribution,
# it could indicate a problem with the split, leading to biased model performance.
print(f'Count label Original Data: {df.Survived.value_counts()}')
print(f'Count label Training Data: {df_train.Survived.value_counts()}')
print(f'Count label Validation Data: {df_val.Survived.value_counts()}')
print(f'Count label Testing Data: {df_test.Survived.value_counts()}')

Count of Original Data: 891
Count of Training Data: 712
Count of Validation Data: 89
Count of Testing Data: 90
Count label Original Data: Survived
0    549
1    342
Name: count, dtype: int64
Count label Training Data: Survived
0    439
1    273
Name: count, dtype: int64
Count label Validation Data: Survived
0    55
1    34
Name: count, dtype: int64
Count label Testing Data: Survived
0    55
1    35
Name: count, dtype: int64
