# Project 4 – Predicting a Continuous Target with Regression (Titanic)

- **Author:** Deb St. Cyr
- **Date:** November 14, 2025

### **Project Overview**
In this project, I explore how different features and modeling techniques impact the accuracy of predicting passenger fare. Unlike earlier modules that used classification to predict whether a passenger survived, this project shifts to **regression**, where the goal is to estimate a numeric value.

I begin by preparing the Titanic dataset, handling missing values, engineering new features, and converting categorical variables into numeric form. Then I build multiple linear regression models using different feature combinations to understand how well each set explains variation in fare.

After identifying the best-performing feature set, I extend the analysis using alternative regression techniques—including **Ridge**, **Elastic Net**, and **Polynomial Regression**—to compare how model complexity and regularization affect predictive performance.

This notebook demonstrates a complete regression workflow:
- Data cleaning and preprocessing  
- Feature engineering  
- Multiple model training and evaluation  
- Interpretation of model behavior  
- Comparison of linear and nonlinear regression methods  

The goal is to understand which features are most useful for fare prediction and how different regression approaches influence accuracy, stability, and generalization.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, ElasticNet
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

plt.style.use("seaborn-v0_8")

## 1. Import and Inspect the Data
Load the Titanic data from seaborn and confirm dataset structure.


In [2]:
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 2. Data Exploration and Preparation

Steps:
- Impute missing age values (median)
- Drop rows missing fare
- Create `family_size`
- Optional categorical encodings (sex, embarked)


In [3]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [4]:
# Impute age
titanic['age'] = titanic['age'].fillna(titanic['age'].median())

# Remove rows with missing fare
titanic = titanic.dropna(subset=['fare'])

# Feature engineering
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# Convert sex to numeric safely using category codes
titanic['sex'] = titanic['sex'].astype('category').cat.codes

# Keep only needed columns to avoid NaN in deck/embark_town/etc.
titanic = titanic[['age', 'family_size', 'pclass', 'sex', 'fare']].dropna()

titanic.head()

Unnamed: 0,age,family_size,pclass,sex,fare
0,22.0,2,3,1,7.25
1,38.0,2,1,0,71.2833
2,26.0,1,3,0,7.925
3,35.0,2,1,0,53.1
4,35.0,1,3,1,8.05


## 3. Feature Selection and Justification

We evaluate four cases:

1. Case 1: age  
2. Case 2: family_size  
3. Case 3: age + family_size  
4. Case 4: My choice (to be defined)  


In [5]:
# Case 1
X1 = titanic[['age']]
y1 = titanic['fare']

# Case 2
X2 = titanic[['family_size']]
y2 = titanic['fare']

# Case 3
X3 = titanic[['age', 'family_size']]
y3 = titanic['fare']

# Case 4 – choose your own
# For now: pclass + sex (common fare predictors)
X4 = titanic[['pclass', 'sex']]
y4 = titanic['fare']

### Section 3 Reflection – Feature Selection

1. **Why might these features affect a passenger’s fare?**

   - **Age:** Age can sometimes relate to fare if ticket pricing differed for children vs. adults, or if traveling adults tended to book higher or lower class accommodations. Although the effect is small, age could still provide a slight relationship to fare.
   - **Family Size:** Larger groups may have purchased tickets together or traveled in lower classes to reduce cost, which could influence fare patterns.
   - **Pclass:** Class is strongly related to fare because 1st class tickets were significantly more expensive than 2nd and 3rd class. This is one of the most important predictors.
   - **Sex:** Sex may relate indirectly to fare because men and women tended to occupy different cabins or travel in different groups, which might correlate with class or ticket type.

2. **List all available features in the original Titanic dataset.**

   The full dataset includes:  
   `survived`, `pclass`, `sex`, `age`, `sibsp`, `parch`, `fare`, `class`,  
   `who`, `adult_male`, `deck`, `embark_town`, `alive`, `alone`, `embarked`,  
   plus engineered features such as `family_size`.

3. **Which other features could improve predictions and why?**

   - **Deck:** Could indicate luxury of accommodations (but contains many missing values).
   - **Embarked:** Different ports may have different fare structures.
   - **Embark_town:** Similar reasoning as embarked, related to departure location.
   - **Class (string version of pclass):** Redundant but correlates with social/economic status.
   
   These features might add explanatory power because ticket prices were strongly tied to socioeconomic status, cabin location, and travel route.

4. **How many variables are in Case 4?**

   Case 4 includes **two variables:** `pclass` and `sex`.

5. **Which variable(s) did you choose for Case 4 and why do you feel these could make good inputs?**

   I chose **pclass** and **sex** for Case 4.  
   - **Pclass** is directly tied to fare and represents one of the strongest predictors of ticket price.  
   - **Sex** is included because passenger gender often correlates with cabin type, ticket grouping, and even travel class. These factors indirectly influence fare and may help improve model performance.  
   
   Together, these variables provide a stronger foundation for predicting fare compared to using demographic features like age or family size alone.


## 4. Linear Regression Models
Train/test split, model fitting, predictions, and evaluation.


In [6]:
splits = {}

for i, (X, y) in enumerate([(X1, y1), (X2, y2), (X3, y3), (X4, y4)], start=1):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
    splits[f"X{i}_train"], splits[f"X{i}_test"] = X_train, X_test
    splits[f"y{i}_train"], splits[f"y{i}_test"] = y_train, y_test

In [7]:
titanic.isna().sum()

age            0
family_size    0
pclass         0
sex            0
fare           0
dtype: int64

In [8]:
titanic['sex'] = titanic['sex'].astype('category').cat.codes

In [9]:
titanic = titanic[['age', 'family_size', 'pclass', 'sex', 'fare']].dropna()

In [10]:
models = {}
preds = {}

for i in range(1, 4 + 1):
    X_train = splits[f"X{i}_train"]
    y_train = splits[f"y{i}_train"]

    model = LinearRegression().fit(X_train, y_train)
    models[f"lr{i}"] = model

    preds[f"y{i}_pred_train"] = model.predict(X_train)
    preds[f"y{i}_pred_test"] = model.predict(splits[f"X{i}_test"])

In [17]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


def evaluate_case(i):
    print(f"----- Case {i} -----")

    # Training predictions
    y_train = splits[f"y{i}_train"]
    y_pred_train = preds[f"y{i}_pred_train"]

    # Test predictions
    y_test = splits[f"y{i}_test"]
    y_pred_test = preds[f"y{i}_pred_test"]

    # Training R2
    print("Train R²:", r2_score(y_train, y_pred_train))

    # Test R2
    print("Test R²:", r2_score(y_test, y_pred_test))

    # Test RMSE (manual computation)
    rmse = mean_squared_error(y_test, y_pred_test) ** 0.5
    print("Test RMSE:", rmse)

    # Test MAE
    print("Test MAE:", mean_absolute_error(y_test, y_pred_test))
    print()

In [18]:
for i in range(1, 5):
    evaluate_case(i)

----- Case 1 -----
Train R²: 0.009950688019452314
Test R²: 0.0034163395508415295
Test RMSE: 37.97164180172938
Test MAE: 25.28637293162364

----- Case 2 -----
Train R²: 0.049915792364760736
Test R²: 0.022231186110131973
Test RMSE: 37.6114940041967
Test MAE: 25.02534815941641

----- Case 3 -----
Train R²: 0.07347466201590014
Test R²: 0.049784832763073106
Test RMSE: 37.0777586646559
Test MAE: 24.284935030470688

----- Case 4 -----
Train R²: 0.30902741887346497
Test R²: 0.339901132876393
Test RMSE: 30.90345156449409
Test MAE: 20.39966564200882



### Section 4 Reflection – Linear Regression Model Performance

#### 1. Did Case 1 overfit or underfit? Explain.
Case 1 used **age** as the only predictor. The Train R² was 0.0099 and the Test R² was 0.0034, which are extremely low. Because both the train and test scores were similarly poor, this case clearly **underfit**. Age alone does not meaningfully explain fare.

#### 2. Did Case 2 overfit or underfit? Explain.
Case 2 used **family_size** only. The Train R² (0.0499) and Test R² (0.0222) were slightly higher than Case 1 but still very low. The model performed poorly on both sets, which again indicates **underfitting**.

#### 3. Did Case 3 overfit or underfit? Explain.
Case 3 combined **age + family_size**. This did improve the performance a little (Train R² = 0.0737, Test R² = 0.0497), but the improvement was small and both values remained low. This model also **underfit**, meaning the features did not provide enough information to explain fare.

#### 4. Did Case 4 overfit or underfit? Explain.
Case 4 used **pclass + sex**, and this model performed much better than the first three. Train R² was 0.3090 and Test R² was 0.3399. The test score was even slightly higher, showing good generalization. This model **did not overfit** and used more meaningful predictors.

---

### Adding Age
1. **Did adding age improve the model?**  
   Adding age in Case 3 improved the R² slightly compared to Cases 1 and 2, but the improvement was small and the model was still weak.

2. **Possible explanation:**  
   Fare was not strongly tied to age. Ticket prices depended more on travel class and accommodations than whether the passenger was older or younger. This explains why age contributed little to predictive power.

---

### Worst Case
1. **Which case performed the worst?**  
   Case 1 (age only) performed the worst.

2. **How do you know?**  
   It had the **lowest R²** on both the training and test sets and the **highest RMSE and MAE**.

3. **Would more training data help?**  
   No. The issue is the **weak feature**, not the dataset size. Age simply does not explain fare well.

---

### Best Case
1. **Which case performed the best?**  
   Case 4 (pclass + sex) performed the best.

2. **How do you know?**  
   It had the **highest R²** and the **lowest RMSE and MAE** across all cases. These features captured meaningful information about fare.

3. **Would more training data help?**  
   Slightly, but the model is already using the strongest available predictors. Larger improvements would require adding additional relevant features, not more rows.


# Section 5 - Alternative Regression Models

#### 5.0 - Set up Best Case Data (Case 4)

In [None]:
X_best = X4
y_best = y4

# Train/test split for best case
X_train_best, X_test_best, y_train_best, y_test_best = train_test_split(
    X_best, y_best, test_size=0.2, random_state=123
)

#### 5.1 - Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_best, y_train_best)

y_pred_ridge = ridge_model.predict(X_test_best)

#### 5.2 - Elastic Net Regression

In [21]:
from sklearn.linear_model import ElasticNet

elastic_model = ElasticNet(alpha=0.3, l1_ratio=0.5)
elastic_model.fit(X_train_best, y_train_best)

y_pred_elastic = elastic_model.predict(X_test_best)

#### 5.3 — Polynomial Regression (degree = 3)

In [22]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train_best)
X_test_poly = poly.transform(X_test_best)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train_best)

y_pred_poly = poly_model.predict(X_test_poly)

#### 5.4 — Compare Model Performance

In [23]:
def report(name, y_true, y_pred):
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    print(f"{name} R²:   {r2:.3f}")
    print(f"{name} RMSE: {rmse:.2f}")
    print(f"{name} MAE:  {mae:.2f}\n")

In [25]:
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_best, y_train_best)

y_pred_ridge = ridge_model.predict(X_test_best)

In [26]:
# Linear Regression baseline (Case 4)
y_pred_lr = models["lr4"].predict(X_test_best)

report("Linear Regression", y_test_best, y_pred_lr)
report("Ridge", y_test_best, y_pred_ridge)
report("Elastic Net", y_test_best, y_pred_elastic)
report("Polynomial (deg 3)", y_test_best, y_pred_poly)

Linear Regression R²:   0.340
Linear Regression RMSE: 30.90
Linear Regression MAE:  20.40

Ridge R²:   0.340
Ridge RMSE: 30.89
Ridge MAE:  20.36

Elastic Net R²:   0.369
Elastic Net RMSE: 30.23
Elastic Net MAE:  19.18

Polynomial (deg 3) R²:   0.446
Polynomial (deg 3) RMSE: 28.30
Polynomial (deg 3) MAE:  17.61



### Section 5 Reflection – Comparing Alternative Regression Models

#### 1. What patterns does the cubic polynomial model seem to capture?

The polynomial model (degree 3) captures a more flexible and curved relationship between the predictors (pclass and sex) and fare. Because polynomial features include interaction terms and nonlinear transformations, the model is able to represent more complex price differences across passenger classes and between males and females. This helped the model fit patterns that a straight linear boundary could not capture.

#### 2. Where does it perform well or poorly?

The polynomial model performed well overall, producing the **highest R² (0.446)** and **lowest errors (RMSE ≈ 28.30, MAE ≈ 17.61)** of all models. It performs best in areas where the data is dense—such as common fare ranges—and where nonlinear effects exist. However, because the Titanic dataset for these features (just two columns) is simple, the model may overfit slightly by trying to force curvature where only limited variation exists.

#### 3. Did the polynomial model outperform linear regression?

Yes.  
The baseline linear regression model (Case 4) produced an R² of **0.340**, while the polynomial model improved this to **0.446**. The errors also decreased significantly. This shows that adding non-linear interactions helped the model pick up additional relationships between class, sex, and fare.

#### 4. Where does the polynomial fit best?

The polynomial model fits best in the middle ranges of the fare values, where the majority of passengers are priced. It is also able to model differences between passenger classes more smoothly. However, because the input variables (pclass and sex) are low-dimensional and mostly categorical/binary, very high-order curves do not add meaningful complexity and may introduce noise at the extremes.

---

### Comparing All Models

**Linear Regression (baseline):**  
- Simple and interpretable  
- R² = 0.340  
- Good starting point, but limited flexibility  

**Ridge Regression:**  
- R² = 0.340 (same as Linear)  
- Ridge does not help much because the model has few coefficients and no large weights to penalize  

**Elastic Net:**  
- R² = 0.369  
- Slight improvement over simple linear regression  
- Shows that mild regularization can help stabilize predictions  

**Polynomial Regression (degree 3):**  
- R² = 0.446  
- Best performing model overall  
- Able to model nonlinear relationships, but may risk mild overfitting if degree is too high  

---

### Summary

Overall, Polynomial Regression performed the best on the Titanic fare prediction task, while Ridge and Linear Regression produced nearly identical results. Elastic Net improved performance slightly. This indicates that a modest amount of nonlinearity helps capture patterns in how class and sex influence fare, but additional regularization or higher-order models should be used carefully to avoid overfitting.


In [27]:
import pandas as pd

results = {
    "Model": ["Linear (Case 4)", "Ridge", "Elastic Net", "Polynomial (deg 3)"],
    "R²": [
        r2_score(y_test_best, y_pred_lr),
        r2_score(y_test_best, y_pred_ridge),
        r2_score(y_test_best, y_pred_elastic),
        r2_score(y_test_best, y_pred_poly),
    ],
    "RMSE": [
        (mean_squared_error(y_test_best, y_pred_lr) ** 0.5),
        (mean_squared_error(y_test_best, y_pred_ridge) ** 0.5),
        (mean_squared_error(y_test_best, y_pred_elastic) ** 0.5),
        (mean_squared_error(y_test_best, y_pred_poly) ** 0.5),
    ],
    "MAE": [
        mean_absolute_error(y_test_best, y_pred_lr),
        mean_absolute_error(y_test_best, y_pred_ridge),
        mean_absolute_error(y_test_best, y_pred_elastic),
        mean_absolute_error(y_test_best, y_pred_poly),
    ],
}

df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Model,R²,RMSE,MAE
0,Linear (Case 4),0.339901,30.903452,20.399666
1,Ridge,0.340454,30.8905,20.361302
2,Elastic Net,0.368549,30.225426,19.184114
3,Polynomial (deg 3),0.446267,28.30432,17.614741


## Section 6: Final Thoughts & Insights

### 6.1 Summarize Findings

This project explored multiple regression models to predict Titanic passenger fares using different feature combinations.  
The early cases (1–3) performed poorly, with R² scores close to zero, showing that features such as **age** and **family_size** alone have very limited predictive power. The results highlight that these variables do not meaningfully explain fare variation.

The best traditional model in Section 4 was **Case 4**, which used **pclass** and **sex**. These features showed a much stronger relationship with fare (Test R² ≈ 0.34), confirming that socioeconomic status (represented by passenger class) is the most influential factor in fare pricing.

In Section 5, comparing alternative models revealed that:

- **Linear Regression** and **Ridge Regression** performed similarly.  
- **Elastic Net** offered a modest improvement.  
- **Polynomial Regression (degree 3)** provided the best performance overall, raising R² to **~0.446** and reducing error significantly.

This indicates that a modest amount of **nonlinearity** helps capture additional structure in the relationship between the predictors and fare.

---

### 6.2 Discuss Challenges

Predicting fare turned out to be more challenging than expected. Although the Titanic dataset is well-known and commonly used, fare values are affected by a number of underlying factors that were not included in the simplified feature set used for this project. For example:

- Fare distributions are **highly skewed**, with a few extremely expensive tickets.  
- Some ticket types were bundled or shared among family members, making the individual fare less straightforward.  
- Missing or inconsistent cabin and deck information limits the ability to capture the true effects of accommodations.  

These factors contributed to modest predictive performance across models.  
Even the best-performing model (Polynomial Regression) explained less than half of the variation in fare.

---

### 6.3 Optional Next Steps

If time allowed, several improvements could further enhance model performance:

1. **Include additional features** such as embarkation port, deck, ticket group size, or “alone/not alone” status.
2. **Try predicting age instead of fare**, a target that may yield clearer linear relationships.
3. **Apply log transformation to fare** to reduce skew and stabilize variance, which could improve linear regression performance.
4. **Use more advanced models**, such as Random Forest or Gradient Boosting, which often perform better on structured/tabular data.
5. **Investigate feature interactions systematically**, especially those involving passenger class and travel group structure.

---

### Summary

Overall, this project demonstrated the importance of thoughtful feature selection, the usefulness of evaluating multiple regression approaches, and the value of incorporating nonlinear modeling techniques. Passenger class and sex emerged as the strongest predictors of fare, and Polynomial Regression achieved the best performance by allowing the model to learn more complex patterns within those features.
