# Project 4 – Predicting a Continuous Target with Regression: Titanic Dataset
**Name:** Lindsay Foster 
**Date:** 11/11/2025
- This project builds on the Titanic dataset used in Project 3. This project will predict fare and the amount of money paid for the journey. This will predict a continuous numeric target.

# Section 1: Import and Inspect the Data

In [17]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import ElasticNet, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

In [18]:
# Load Titanic dataset from seaborn and verify
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Section 2: Data Exploration and Preparation

In [19]:
titanic["age"] = titanic["age"].fillna(titanic["age"].median())

titanic = titanic.dropna(subset=["fare"])

titanic["family_size"] = titanic["sibsp"] + titanic["parch"] + 1

In [20]:
# Convert sex to numeric (male = 0, female = 1)
titanic["sex_num"] = titanic["sex"].map({"male": 0, "female": 1})

# Convert embarked to numeric codes
titanic["embarked_num"] = titanic["embarked"].astype("category").cat.codes


# Section 3: Feature Selection and Justification

In [21]:
# Case 1. age
X1 = titanic[["age"]]
y1 = titanic["fare"]

In [22]:
# Case 2. family_size
X2 = titanic[["family_size"]]
y2 = titanic["fare"]

In [23]:
# Case 3. age, family_size
X3 = titanic[["age", "family_size"]]
y3 = titanic["fare"]

In [24]:
# Case 4. passenger class
X4 = titanic[["pclass"]]
y4 = titanic["fare"]

## Reflection
- **Why might these features affect a passenger’s fare:** Age may affect the fare if they offer discounts to children or seniors. Family size may affect fare as families may pay more if they are larger. Passenger class may affect fare as 1st class, 2nd class, and 3rd class all have different costs. 
- **List all available features:** survived, pclass, sex, age, sibsp, parch, fare, embarked, class, who, adult_male, deck, embark_town, alive, alone
- **Which other features could improve predictions and why:** Deck my be a good feature as class and deck could determine fare cost. Also the embark town may be a feature that could be used in connection with fare if they raised prices for larger cities. 
- **How many variables are in your Case 4:** 1 - pclass
- **Which variable(s) did you choose for Case 4 and why do you feel those could make good inputs:** pclass (passenger class) is a good variable to help determine correlation with fare as tthe Titanic had three passenger classes — 1st, 2nd, and 3rd. These directly affect fare as they paid more for higher accomodations. 

# Section 4: Train a Regression Model (Linear Regression)

### 4.1 Split the Data

In [25]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=123)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=123)

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=123)

X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2, random_state=123)

### 4.2 Train and Evaluate Linear Regression Models (all 4 cases)

In [26]:
# Train models
lr_model1 = LinearRegression().fit(X1_train, y1_train)
lr_model2 = LinearRegression().fit(X2_train, y2_train)
lr_model3 = LinearRegression().fit(X3_train, y3_train)
lr_model4 = LinearRegression().fit(X4_train, y4_train)

# Predictions for Case 1
y_pred_train1 = lr_model1.predict(X1_train)
y_pred_test1 = lr_model1.predict(X1_test)

# Predictions for Case 2
y_pred_train2 = lr_model2.predict(X2_train)
y_pred_test2 = lr_model2.predict(X2_test)

# Predictions for Case 3
y_pred_train3 = lr_model3.predict(X3_train)
y_pred_test3 = lr_model3.predict(X3_test)

# Predictions for Case 4
y_pred_train4 = lr_model4.predict(X4_train)
y_pred_test4 = lr_model4.predict(X4_test)


### 4.3 Report Performance

In [27]:
# Case 1
print("Case 1: Training R²:", r2_score(y1_train, y_pred_train1))
print("Case 1: Test R²:", r2_score(y1_test, y_pred_test1))
print("Case 1: Test RMSE:", np.sqrt(mean_squared_error(y1_test, y_pred_test1)))
print("Case 1: Test MAE:", mean_absolute_error(y1_test, y_pred_test1))
print("------------------------------------------------------")

# Case 2
print("Case 2: Training R²:", r2_score(y2_train, y_pred_train2))
print("Case 2: Test R²:", r2_score(y2_test, y_pred_test2))
print("Case 2: Test RMSE:", np.sqrt(mean_squared_error(y2_test, y_pred_test2)))
print("Case 2: Test MAE:", mean_absolute_error(y2_test, y_pred_test2))
print("------------------------------------------------------")

# Case 3
print("Case 3: Training R²:", r2_score(y3_train, y_pred_train3))
print("Case 3: Test R²:", r2_score(y3_test, y_pred_test3))
print("Case 3: Test RMSE:", np.sqrt(mean_squared_error(y3_test, y_pred_test3)))
print("Case 3: Test MAE:", mean_absolute_error(y3_test, y_pred_test3))
print("------------------------------------------------------")

# Case 4
print("Case 4: Training R²:", r2_score(y4_train, y_pred_train4))
print("Case 4: Test R²:", r2_score(y4_test, y_pred_test4))
print("Case 4: Test RMSE:", np.sqrt(mean_squared_error(y4_test, y_pred_test4)))
print("Case 4: Test MAE:", mean_absolute_error(y4_test, y_pred_test4))
print("------------------------------------------------------")


Case 1: Training R²: 0.009950688019452314
Case 1: Test R²: 0.0034163395508415295
Case 1: Test RMSE: 37.97164180172938
Case 1: Test MAE: 25.28637293162364
------------------------------------------------------
Case 2: Training R²: 0.049915792364760736
Case 2: Test R²: 0.022231186110131973
Case 2: Test RMSE: 37.6114940041967
Case 2: Test MAE: 25.02534815941641
------------------------------------------------------
Case 3: Training R²: 0.07347466201590014
Case 3: Test R²: 0.049784832763073106
Case 3: Test RMSE: 37.0777586646559
Case 3: Test MAE: 24.284935030470688
------------------------------------------------------
Case 4: Training R²: 0.3005588037487471
Case 4: Test R²: 0.3016017234169923
Case 4: Test RMSE: 31.7873316928033
Case 4: Test MAE: 20.653703671484056
------------------------------------------------------


## Reflection
*Compare the train vs test results for each.*

- **Did Case 1 overfit or underfit? Explain:** Case 1 underfit as R^2 for both the test and training are close to zero.
- **Did Case 2 overfit or underfit? Explain:** Case 2 underfit as R^2 for both the test and training are higher than Case 1 but still close to zero. 
- **Did Case 3 overfit or underfit? Explain:** Case 3 underfit as R^2 for both the test and training are, again, better than Case 1 and 2 but still too close to zero.
- **Did Case 4 overfit or underfit? Explain:** Case 4 did not overfit or underfit as it generalizes well. Test and training are very close and we can see that passenger class explains about 30% of the fare data.

*Adding Age*

- **Did adding age improve the model:** It did improve the model slightly but didn't make enough of a difference to explain fare. It is better than using only one feature. 
- **Propose a possible explanation (consider how age might affect ticket price, and whether the data supports that):** The larger the family, the more they may pay since they will buy more tickets. Or if they offer cheaper tickets to children or seniors, that may be a factor. Only  about 5% of fare variation is explained. This is a small improvement over Case 1 (age) or Case 2 (familysize).
  
*Worst:*

- **Which case performed the worst:** Case 1
- **How do you know:** Almost zero fare variation is explained and prediction errors are large. 
- **Do you think adding more training data would improve it (and why/why not):** No there is enough data here, age just isn't a strong predictor of fare. 

*Best*

- **Which case performed the best:** Case 4
- **How do you know:** About 30% of fare variation is explained and the prediction errors are smaller. 
- **Do you think adding more training data would improve it (and why/why not):** It could help a little but not much as passenger class is a strong predictor of fare. 

# Section 5: Compare Alternative Models (Ridge, Elastic Net, Polynomial Regression)

# Section 6: Final Thoughts & Insights