## Import Required Libraries

In [7]:
import sys
from pathlib import Path

try:
    import pandas as pd
except ImportError as e:
    print(f"Error: Required libraries not found: {e}")
    sys.exit(1)

print("✓ All libraries imported successfully")

✓ All libraries imported successfully


## Load the Data

In [8]:
def load_dataset(filepath: str) -> pd.DataFrame:
    path = Path(filepath)
    if not path.exists():
        raise FileNotFoundError(f"Dataset not found: {filepath}")
    return pd.read_csv(path)

try:
    train_df = load_dataset('../resources/Train_knight.csv')
    print(f"✓ Dataset loaded: {train_df.shape[0]} rows, {train_df.shape[1]} columns")
except FileNotFoundError as e:
    print(f"✗ Error: {e}")
    sys.exit(1)

✓ Dataset loaded: 398 rows, 31 columns


## Why Split the Data?

We split data into **Training** and **Validation** sets to:
1. **Train** the model on one portion
2. **Test** it on unseen data to check if it generalizes well

### Common Split Ratios:
| Training | Validation | Use Case |
|----------|------------|----------|
| 80% | 20% | Most common, good balance |
| 70% | 30% | When you want more validation data |
| 90% | 10% | When you have lots of data |

**We'll use 80/20 split** - industry standard for most ML tasks.

## Split Function

In [9]:
def split_dataset(df: pd.DataFrame, train_ratio: float = 0.8, random_state: int = 42) -> tuple:
    """
    Randomly split a DataFrame into training and validation sets.
    
    Args:
        df: DataFrame to split
        train_ratio: Proportion for training (0.8 = 80%)
        random_state: Seed for reproducibility
        
    Returns:
        Tuple of (training_df, validation_df)
    """
    df_shuffled = df.sample(frac=1, random_state=random_state).reset_index(drop=True)
    
    split_index = int(len(df_shuffled) * train_ratio)
    
    training_df = df_shuffled[:split_index]
    validation_df = df_shuffled[split_index:]
    
    return training_df, validation_df

In [10]:
training_df, validation_df = split_dataset(train_df, train_ratio=0.8)

print(f"Original dataset: {len(train_df)} rows")
print(f"Training set: {len(training_df)} rows ({len(training_df)/len(train_df)*100:.1f}%)")
print(f"Validation set: {len(validation_df)} rows ({len(validation_df)/len(train_df)*100:.1f}%)")

Original dataset: 398 rows
Training set: 318 rows (79.9%)
Validation set: 80 rows (20.1%)


## Verify Class Distribution
Make sure both sets have similar proportions of Jedi/Sith

In [11]:
print("Class distribution in Training set:")
print(training_df['knight'].value_counts(normalize=True))

print("\nClass distribution in Validation set:")
print(validation_df['knight'].value_counts(normalize=True))

Class distribution in Training set:
knight
Sith    0.632075
Jedi    0.367925
Name: proportion, dtype: float64

Class distribution in Validation set:
knight
Sith    0.5625
Jedi    0.4375
Name: proportion, dtype: float64


## Save the Split Files

In [12]:
training_df.to_csv('Training_knight.csv', index=False)
validation_df.to_csv('Validation_knight.csv', index=False)

print("✓ Files saved:")
print("  - Training_knight.csv")
print("  - Validation_knight.csv")

✓ Files saved:
  - Training_knight.csv
  - Validation_knight.csv


## Summary

- **80% Training**: Enough data for the model to learn patterns
- **20% Validation**: Enough data to reliably test model performance
- **Industry standard**: Widely used, good balance between learning and testing
- **Prevents overfitting**: Model can't just memorize the training data