Feature Engineering Philosophy (Very Important)

Every feature must answer:

“What information am I adding that the model didn’t already have?”

We’ll focus on classic Titanic signals that are:

Interpretable

Low-risk

Widely accepted

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score

In [4]:
# Load Data
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df  = pd.read_csv("/kaggle/input/titanic/test.csv")

In [9]:
#Feature engineering

# 1. Family Size: sum of siblings/spouses and parents/children + yourself
train_df["FamilySize"] = train_df["SibSp"] + train_df["Parch"] + 1

# 2. Is Alone: 1 if the passenger has no family on board, 0 otherwise
train_df["IsAlone"] = (train_df["FamilySize"] == 1).astype(int)

# 3. Title from Name: Extract text between the comma and period
# This uses a regular expression to pull titles like "Mr", "Mrs", "Master", etc.
train_df["Title"] = train_df["Name"].str.extract(r",\s*([^\.]+)\.") #(..)->capture group [..] -> The logic
train_df["Title"] = train_df["Title"].str.strip()

# Optional: Map rare titles to a 'Rare' category to help the model generalize
# This prevents overfitting on titles that only appear once or twice
rare_titles = ['Don', 'Rev', 'Dr', 'Major', 'Lady', 'Sir', 'Col', 'Capt', 'the Countess', 'Jonkheer']
train_df["Title"] = train_df["Title"].replace(rare_titles, 'Rare')
train_df["Title"] = train_df["Title"].replace(['Mlle', 'Ms'], 'Miss')
train_df["Title"] = train_df["Title"].replace('Mme', 'Mrs')

# Drop columns that are no longer needed (since we extracted their info)
# This prevents the model from trying to process the raw 'Name' string
train_df.drop(['Name', 'SibSp', 'Parch'], axis=1, inplace=True)


In [10]:
# Define target & features
X = train_df.drop(columns=["Survived"])
y = train_df["Survived"]

In [11]:
# Train/Validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y # to preserve the original class distribution, since the Survival classes are imbalanced
)

In [12]:
#Pre-processing
#Identify column types
numeric_features = X_train.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X_train.select_dtypes(include=["object"]).columns

In [13]:
# Apply imputation & StandardScaler
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())  # to fix the Convergence Warning
])

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ]
)

In [14]:
# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

In [15]:
# Logistic regression model
model = LogisticRegression(max_iter=1000) #sets a hard limit on the number of optimization steps

# Full pipeline
clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", model)
    ]
)

In [16]:
# Cross-Validation
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

# Run Cross-Validation
scores_fe = cross_val_score(
    clf,
    X,
    y,
    cv=cv,
    scoring="accuracy"
)

In [18]:
scores_fe.mean()

np.float64(0.8383780051471973)

In [19]:
scores_fe.std()

np.float64(0.00668036552010111)

## Feature Engineering Results

Added features:
- FamilySize
- IsAlone
- Title

Results:
- Baseline CV accuracy: 0.807 STD:0.033
- Feature-engineered CV accuracy: 0.8384 std:0.007

Observations:
- Which features likely helped?
     The Title: This is almost certainly the primary driver of your improvement. "Title" is a high-signal feature because it captures "hidden" interactions. For example, "Master" (young boys) had a very high survival rate, which a standard model might miss if it only looked at "Male" and "Age" separately.
  
IsAlone / FamilySize: These likely helped by simplifying the relationship between social groups and survival. Logistic Regression performs better when a complex relationship (like household dynamics) is converted into a clear binary signal like IsAlone.
- Any increase in variance?
      No, variance actually decreased.
Your standard deviation (STD) dropped from 0.033 to 0.007.
This is a massive improvement. A lower STD means your model is much more consistent across different folds of data. It suggests that your new features are "universally true" across the dataset, rather than being "lucky" features that only work on certain rows.
- Does the improvement seem robust?
      Accuracy Increase: You gained ~3.1% in accuracy (0.807 \(\rightarrow \) 0.838).Standard Deviation Check: Since your new accuracy (0.838) is more than one baseline standard deviation away from the old mean (\(0.807+0.033=0.840\), you are right at the edge, but the stability is the key), the improvement is likely a real predictive gain rather than random noise.The "Reliability" Win: The drop in STD to 0.007 is the most impressive part. In 2026 machine learning workflows, a model with slightly lower accuracy but a very low STD is often preferred over a "spiky" model because it is more predictable when deployed on truly unseen data.

Summary Comparison
Metric: Mean CV	Baseline: 0.807	Engineered: 0.8384	Statu: Improved
Metric: STD (Variance) Baseline: 0.033 Engineered: 0.0070	Statu: Much More Stable