Data splitting is the process of dividing a dataset into separate subsets to train and evaluate a machine learning model. The common types of splits are:
- Training Set:	Used to train the model	60-80%
- Validation Set: Used to tune hyperparameters and prevent overfitting	10-20%
- Test Set:	Used for final model evaluation	10-20%

Important Considerations:
* Stratification: Used for classification problems to ensure the same class distribution across splits.
* Shuffle: Helps avoid any bias due to the dataset’s original order.
* Time Series Consideration: In time-series models, data should not be randomly split but preserve chronological order to prevent future data from leaking into training.

In [None]:
# Clasical test and training dataset

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load dataset (Titanic as an example)
df = pd.read_csv("titanic.csv")

# Define features (X) and target variable (y)
X = df.drop(columns=['Survived'])  # Independent variables
y = df['Survived']  # Target variable

# Perform train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the datasets
print(f"Train Set: {X_train.shape}, Test Set: {X_test.shape}")


In [None]:
# Test, training and validation set for hyperparameter tunning

# First, split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the training data further into training and validation sets (80-10-10 rule), 0.125 of 80% = 10% of total data, leaving 70% for training.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.125, random_state=42)

print(f"Train Set: {X_train.shape}, Validation Set: {X_val.shape}, Test Set: {X_test.shape}")


In [None]:
# Stratified Train-Test Split (For Imbalanced Classification)
#If we have imbalanced classes, stratification ensures that the split maintains the original class distribution.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Check class distribution
print("Class Distribution in Training Set:\n", y_train.value_counts(normalize=True))
print("Class Distribution in Test Set:\n", y_test.value_counts(normalize=True))


In [5]:
#Time-Series Data Splitting, Sequential Splitting/ Cross-Validation : Helps in hyperparameter tuning & model selection,it follows an expanding window approach where each fold adds more past data.
#In time-series cross-validation, we cannot randomly shuffle data like in standard k-fold CV because future values should not be used to predict past values

# AAPL.csv dataset (Apple stock price data).

import pandas as pd
from sklearn.model_selection import TimeSeriesSplit

# Load sample time-series dataset (AAPL stock data)
df = pd.read_csv("AAPL.csv", parse_dates=["Date"], index_col="Date")

# Ensure data is sorted by date (crucial for time-series)
df = df.sort_index()
#print(df)

# Selecting features (X) and target (y)
X = df[['Close', 'Volume']]  # Feature: Closing price & volume
y = df['Close'].shift(-1)     # Target: Next day's closing price (forecasting task)

# Drop last row with NaN due to shifting
#print(X.tail(),y.tail())
X, y = X.iloc[:-1], y.iloc[:-1]

# Initialize time-series cross-validation
tscv = TimeSeriesSplit(n_splits=5) #Splits the dataset into 5 consecutive train-test sets.

# Perform time-series split
for fold, (train_index, test_index) in enumerate(tscv.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index] #Ensures that past data is always before future data.
    y_train, y_test = y.iloc[train_index], y.iloc[test_index] 

    print(f"Fold {fold+1}:")
    print(f"Train indices: {train_index[0]} to {train_index[-1]}")
    print(f"Test indices: {test_index[0]} to {test_index[-1]}\n")



Fold 1:
Train indices: 0 to 18
Test indices: 19 to 34

Fold 2:
Train indices: 0 to 34
Test indices: 35 to 50

Fold 3:
Train indices: 0 to 50
Test indices: 51 to 66

Fold 4:
Train indices: 0 to 66
Test indices: 67 to 82

Fold 5:
Train indices: 0 to 82
Test indices: 83 to 98



In [None]:
#K-Fold Cross-Validation (For Robust Model Evaluation)
#When the dataset is small, and we need to use all data efficiently or we want a more reliable evaluation metric instead of a single train-test split.

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, val_index in kf.split(X_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
    
    # Train and evaluate model
    print(f"Training fold: {X_train_fold.shape}, Validation fold: {X_val_fold.shape}")
