# Day 4: Train/Test Split and Feature Scaling

**Course**: Machine Learning with Python: From Basics to Applications  
**Objective**: Learn to split data into training and test sets and apply feature scaling to numerical features to prepare for machine learning models.  
**Prerequisites**: Preprocessed Titanic dataset (`titanic_preprocessed.csv`) from Day 3, with missing values handled and categorical variables encoded.  
**Tools**: Pandas, scikit-learn (install with `pip install pandas scikit-learn`).  
**Dataset**: Titanic dataset (preprocessed, available from Day 3 or Kaggle).  

In this notebook, we will:  
1. Load the preprocessed Titanic dataset.  
2. Split data into training (80%) and test (20%) sets using `train_test_split`.  
3. Apply `StandardScaler` to scale numerical features (`Age`, `Fare`).  
4. Save the resulting datasets as CSVs for future use.  
5. Verify the splits and scaling.  

Let’s get started!

## Step 1: Import Libraries

We need Pandas for data handling and scikit-learn for splitting and scaling.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Step 2: Load Preprocessed Data

Load the preprocessed Titanic dataset from Day 3 (`titanic_preprocessed.csv`). This dataset should have no missing values, with `Sex` encoded as 0/1 and `Embarked` as one-hot encoded columns (`Embarked_C`, `Embarked_Q`).

In [2]:
# Load the preprocessed dataset
df = pd.read_csv('titanic_preprocessed.csv')

# Display first 5 rows to confirm
print("First 5 rows of preprocessed data:")
df.head()

First 5 rows of preprocessed data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,False,False
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,False,True
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,False,True


## Step 3: Separate Features and Target

Separate the dataset into features (`X`) and target (`y`). The target is `Survived`, and all other columns are features.

In [None]:
# Features (drop 'Survived')
X = df.drop('Survived', axis=1)

# Target
y = df['Survived']

# Verify shapes
print("X shape:", X.shape)
print("y shape:", y.shape)

## Step 4: Train/Test Split

Split the data into 80% training and 20% test sets using `train_test_split`. Set `random_state=42` for reproducibility.

In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify shapes
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Test set shape (X_test, y_test):", X_test.shape, y_test.shape)

## Step 5: Feature Scaling

Apply `StandardScaler` to scale numerical features (`Age`, `Fare`). Fit the scaler on the training data and transform both training and test data to avoid data leakage.

In [None]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform on training data (Age, Fare)
X_train[['Age', 'Fare']] = scaler.fit_transform(X_train[['Age', 'Fare']])

# Transform test data (using same scaler)
X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])

# Verify scaling (mean ~0, std ~1 for training data)
print("Training Age mean after scaling:", X_train['Age'].mean())
print("Training Age std after scaling:", X_train['Age'].std())
print("Training Fare mean after scaling:", X_train['Fare'].mean())
print("Training Fare std after scaling:", X_train['Fare'].std())

# Display first few rows of scaled training data
print("\nFirst 5 rows of X_train after scaling:")
X_train.head()

## Step 6: Save Datasets

Save the split and scaled datasets as CSV files for use in future sessions.

In [None]:
# Save training and test datasets
X_train.to_csv('titanic_X_train.csv', index=False)
X_test.to_csv('titanic_X_test.csv', index=False)
y_train.to_csv('titanic_y_train.csv', index=False)
y_test.to_csv('titanic_y_test.csv', index=False)

print("Datasets saved as titanic_X_train.csv, titanic_X_test.csv, titanic_y_train.csv, titanic_y_test.csv")

## Step 7: Verification

Check that the datasets were saved correctly and that scaling worked as expected.

In [None]:
# Verify saved files
print("\nVerify X_train.csv:")
print(pd.read_csv('titanic_X_train.csv').head())

# Confirm no missing values
print("\nMissing values in X_train:", X_train.isnull().sum().sum())
print("Missing values in X_test:", X_test.isnull().sum().sum())

## Assignment

1. Run this notebook to split and scale the Titanic dataset.  
2. Verify the shapes of `X_train`, `X_test`, `y_train`, and `y_test` (should be ~712 for training, ~179 for test).  
3. Confirm that `Age` and `Fare` in `X_train` have mean ~0 and std ~1 after scaling.  
4. Ensure the saved CSV files (`titanic_X_train.csv`, etc.) are created and contain the expected data.  
5. Submit a screenshot of the notebook output showing the shapes and scaling statistics.  

**Next Steps**: On Day 5, you’ll combine all preprocessing steps (Days 3-4) to prepare the Titanic dataset for modeling.