# Train/Test Split - Pipeline Thickness Loss Dataset
## TASK 5: Baseline Evaluation - Data Splitting

**Date:** December 30, 2025  
**Dataset:** thickness_loss_dataset_engineered.csv  
**Split Ratio:** 80/20 (Train/Test)  
**Random State:** 42

---
## Setup

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print('✓ Libraries loaded')

✓ Libraries loaded


---
## Load Data

In [2]:
# Load engineered dataset (or original if feature engineering not done)
try:
    data = pd.read_csv('thickness_loss_dataset_engineered.csv')
    print('✓ Loaded engineered dataset')
except:
    data = pd.read_csv('thickness_loss_dataset.csv')
    print('✓ Loaded original dataset')

print(f'Shape: {data.shape[0]} rows, {data.shape[1]} columns')
data.head()

✓ Loaded engineered dataset
Shape: 1000 rows, 25 columns


Unnamed: 0,Pipe_Size_mm,Thickness_mm,Material,Grade,Max_Pressure_psi,Temperature_C,Corrosion_Impact_Percent,Thickness_Loss_mm,Material_Loss_Percent,Time_Years,...,Pressure_to_Thickness_Ratio,Corrosion_Time_Interaction,Temp_Pressure_Interaction,Pipe_Size_Category,Temperature_Category,Pressure_Category,Age_Category,Critical_Threshold_Flag,High_Corrosion_Flag,High_Pressure_Flag
0,800,15.48,Carbon Steel,ASTM A333 Grade 6,300,84.9,16.04,4.91,31.72,2,...,19.379845,32.08,25470.0,Large,Medium,Low,New,0,1,0
1,800,22.0,PVC,ASTM A106 Grade B,150,14.1,7.38,7.32,33.27,4,...,6.818182,29.52,2115.0,Large,Low,Low,New,0,0,0
2,400,12.05,Carbon Steel,API 5L X52,2500,0.6,2.12,6.32,52.45,7,...,207.46888,14.84,1500.0,Medium,Low,Very_High,Mature,1,0,1
3,1500,38.72,Carbon Steel,API 5L X42,1500,52.7,5.58,6.2,16.01,19,...,38.739669,106.02,79050.0,Extra_Large,Medium,High,Old,0,0,0
4,1500,24.32,HDPE,API 5L X65,1500,11.7,12.29,8.58,35.28,20,...,61.677632,245.8,17550.0,Extra_Large,Low,High,Old,0,0,0


---
## Prepare Features and Target

In [3]:
# Define target variable
target = 'Condition'

# Separate features and target
X = data.drop(columns=[target])
y = data[target]

print(f'Features (X): {X.shape[1]} columns')
print(f'Target (y): {y.name}')
print(f'\nTarget distribution:')
print(y.value_counts())

Features (X): 24 columns
Target (y): Condition

Target distribution:
Condition
Critical    487
Moderate    299
Normal      214
Name: count, dtype: int64


In [4]:
# Encode categorical variables if needed
categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

if len(categorical_cols) > 0:
    print(f'Encoding {len(categorical_cols)} categorical columns:')
    print(categorical_cols)
    
    # Label encoding for categorical features
    le = LabelEncoder()
    for col in categorical_cols:
        X[col] = le.fit_transform(X[col].astype(str))
    
    print('✓ Categorical features encoded')
else:
    print('No categorical columns to encode')

Encoding 6 categorical columns:
['Material', 'Grade', 'Pipe_Size_Category', 'Temperature_Category', 'Pressure_Category', 'Age_Category']
✓ Categorical features encoded


In [5]:
# Encode target variable
le_target = LabelEncoder()
y_encoded = le_target.fit_transform(y)

print('Target encoding:')
for i, label in enumerate(le_target.classes_):
    print(f'  {label} → {i}')

# Store mapping for later
target_mapping = dict(zip(le_target.classes_, le_target.transform(le_target.classes_)))
print(f'\nMapping: {target_mapping}')

Target encoding:
  Critical → 0
  Moderate → 1
  Normal → 2

Mapping: {'Critical': np.int64(0), 'Moderate': np.int64(1), 'Normal': np.int64(2)}


---
## Train/Test Split (80/20)

In [6]:
# Perform train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded  # Maintain class distribution
)

print('✓ Train/Test split completed')
print(f'\nSplit ratio: 80/20')
print(f'Random state: 42')
print(f'Stratified: Yes (maintains class balance)')

✓ Train/Test split completed

Split ratio: 80/20
Random state: 42
Stratified: Yes (maintains class balance)


---
## Verify Split

In [7]:
# Display split sizes
print('Dataset Sizes:')
print('='*60)
print(f'Total samples: {len(X)}')
print(f'\nTraining set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)')
print(f'Test set:     {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)')
print(f'\nFeatures: {X_train.shape[1]}')

Dataset Sizes:
Total samples: 1000

Training set: 800 samples (80.0%)
Test set:     200 samples (20.0%)

Features: 24


In [8]:
# Check class distribution in splits
print('Class Distribution Verification:')
print('='*60)

# Original distribution
print('\nOriginal:')
for label, code in target_mapping.items():
    count = (y_encoded == code).sum()
    pct = count / len(y_encoded) * 100
    print(f'  {label}: {count} ({pct:.1f}%)')

# Training set distribution
print('\nTraining set:')
for label, code in target_mapping.items():
    count = (y_train == code).sum()
    pct = count / len(y_train) * 100
    print(f'  {label}: {count} ({pct:.1f}%)')

# Test set distribution
print('\nTest set:')
for label, code in target_mapping.items():
    count = (y_test == code).sum()
    pct = count / len(y_test) * 100
    print(f'  {label}: {count} ({pct:.1f}%)')

print('\n✓ Class distribution maintained across splits')

Class Distribution Verification:

Original:
  Critical: 487 (48.7%)
  Moderate: 299 (29.9%)
  Normal: 214 (21.4%)

Training set:
  Critical: 390 (48.8%)
  Moderate: 239 (29.9%)
  Normal: 171 (21.4%)

Test set:
  Critical: 97 (48.5%)
  Moderate: 60 (30.0%)
  Normal: 43 (21.5%)

✓ Class distribution maintained across splits


---
## Save Split Data

In [9]:
# Save train/test sets
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
pd.DataFrame(y_train, columns=['Condition']).to_csv('y_train.csv', index=False)
pd.DataFrame(y_test, columns=['Condition']).to_csv('y_test.csv', index=False)

print('✓ Saved files:')
print('  - X_train.csv')
print('  - X_test.csv')
print('  - y_train.csv')
print('  - y_test.csv')

✓ Saved files:
  - X_train.csv
  - X_test.csv
  - y_train.csv
  - y_test.csv


In [10]:
# Save target mapping for reference
import json

with open('target_mapping.json', 'w') as f:
    json.dump(target_mapping, f, indent=2)

print('✓ Saved target_mapping.json')
print(f'  Mapping: {target_mapping}')

TypeError: Object of type int64 is not JSON serializable

---
## Summary

### Split Configuration:
- **Train/Test Ratio:** 80/20
- **Random State:** 42 (reproducible)
- **Stratification:** Yes (maintains class balance)

### Dataset Sizes:
- **Training:** 800 samples
- **Test:** 200 samples
- **Features:** Check output above

### Files Generated:
- X_train.csv, X_test.csv
- y_train.csv, y_test.csv
- target_mapping.json

### Next Steps:
1. Feature scaling (if needed)
2. Train baseline models
3. Evaluate performance

---
**Train/Test Split Complete!**