# Dowdle's Titanic Survival Prediction
**Author:** Brittany Dowdle  
**Date:** March, 19, 2025  
**Objective:** To inspect, explore, and split data. Compare data splitting methods: train/test split and stratified shuffle split, by evaluating model performance.


## Introduction
This project uses the Titanic dataset to predict survival based on features such as class, sex, and family size. I will clean the data, do some feature engineering, and explore ways to improve performance. This project highlights the importance of balanced data representation.

****

## Imports
In the code cell below, import the necessary Python libraries for this notebook. All imports should be at the top of the notebook. 

In [3]:
# Import pandas for data manipulation and analysis (we might want to do more with it)
import pandas as pd

# Import pandas for data manipulation and analysis  (we might want to do more with it)
import numpy as np

# Import matplotlib for creating static visualizations
import matplotlib.pyplot as plt

# Import seaborn for statistical data visualization (built on matplotlib)
import seaborn as sns

# Import train_test_split for splitting data into training and test sets
from sklearn.model_selection import train_test_split

# Import LinearRegression for building a linear regression model
from sklearn.linear_model import LinearRegression

# Import performance metrics for model evaluation
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score

****
## Section 1. Load and Explore the Data

### 1.1 Load the dataset and display the first 10 rows
Load the California housing dataset directly from `scikit-learn`.
- The `fetch_california_housing` function returns a dictionary-like object with the data.
- Convert it into a pandas DataFrame.
- Display just the first 10 rows using `head()`.

In [None]:
# Load the data
housing = fetch_california_housing(as_frame=True)

# Convert the fetched data into a pandas DataFrame
df = housing.frame

# Might be large. Display just the first 10 rows (you can change this number)
df.head(10)


### 1.2 Check for missing values and display summary statistics

In the cell below:
1. Use `info()` to check data types and missing values.
2. Use `describe()` to see summary statistics.
3. Use `isnull().sum()` to identify missing values in each column.

In [None]:
# Use info() to check data types and missing values
print("Info:")
df.info()

# Use describe() to see summary statistics
print("Describe:")
df.describe()
print(df.describe())

# Use isnull().sum() to check for missing values in each column
print("IsNull:")
df.isnull().sum()

Analysis: 

1) How many data instances (also called data records or data rows) are there? **20,640 data instances**

2) How many features (also columns or attributes) are there? **9 features**

3) What are the names of the features? ("Feature" is used most often in ML projects.) **MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, MedHouseVal**

4) Which features are numeric? **dtypes:float64 so all features are numeric**

5) Which features are categorical (non-numeric)? **None since all are numeric**

6) Are there any missing values? How should they be handled? Should we delete a sparsely populated column? Delete an incomplete data row? Substitute with a different value? **IsNull output is 0 for all features so there are no missing values, so no action is needed**

7) What else do you notice about the dataset? Are there any data issues? **Possible data outliers, unusual values, and highly skewed distributions. Might need to investigate the extreme values to decide if they should be trimmed or transformed and check for feature correlations.**
****

## Section 2. Visualize Feature Distributions
### 2.1 Create histograms, boxplots, and scatterplots

- Create histograms for all numeric features using `data_frame.hist()` with 30 bins.
- Create a boxenplots using `sns.boxenplot()`.
- Create scatter plots using `sns.pairplot()`.

First, histograms:

In [None]:
# Create histograms for all numeric features
df.hist(bins=30, figsize=(12, 8))
plt.tight_layout()

# Show the plot
plt.show()

Second, Boxenplots:

In [None]:
# Create a boxenplot for each column
for column in df.columns:
    plt.figure( figsize=(6, 4))
    sns.boxenplot(x=df[column])
    plt.title(f"Boxenplot for {column}")
    plt.tight_layout()  # Adjust the layout to avoid overlap
    # Show the plot
    plt.show()

Third, Scatter Plots:

*Pro Tip: Comment out after analysis to speed up the notebook.*

In [None]:
# Generate all scatter plots (pairwise plot for numerical columns)
sns.pairplot(df)

# Show the plot
plt.show()

****
## Section 3. Feature Selection and Justification
### 3.1 Choose two input features for predicting the target

- Select `MedInc` and `AveRooms` as predictors.
- Select `MedHouseVal` as the target variable.

In the following, 
X is capitalized because it represents a matrix (consistent with mathematical notation).
y is lowercase because it represents a vector (consistent with mathematical notation).

First:
- Create a list of contributing features and the target variable
- Define the target feature string (the variable we want to predict)
- Define the input DataFrame
- Define the output DataFrame


In [None]:
# Define the input features (predictors)
features = ['MedInc', 'AveRooms']

# Define the target variable (target variable)
target = 'MedHouseVal'

# Define the input DataFrame (X) - Matrix of features
df_X = df[features]

# Define the output DataFrame (y) - Vector of target variable
df_y = df[target]


# Display first few rows
print("Features:")
print(df_X.head())
print("Targets:")
print(df_y.head())

****
## Section 4. Train a Linear Regression Model
### 4.1 Split the data
Split the dataset into training and test sets (80% train / 20% test).

Call train_test_split() by passing in: 

- df_X – Feature matrix (input data) as a pandas DataFrame
- df_y – Target values as a pandas Series
- test_size – Fraction of data to use for testing (e.g., 0.1 = 10%)
- random_state – Seed value for reproducible splits

We'll get back four return values:

- X_train – Training set features (DataFrame)
- X_test – Test set features (DataFrame)
- y_train – Training set target values (Series)
- y_test – Test set target values (Series)

In [None]:
# Split the data into training and test sets (80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    df_X, df_y, test_size=0.2, random_state=42
)

# Display confirmation of the split
print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

### 4.2 Train the model
Create and fit a `LinearRegression` model.

LinearRegression – A class from sklearn.linear_model that creates a linear regression model.

model – An instance of the LinearRegression model. This object will store the learned coefficients and intercept after training.

fit() – Trains the model by finding the best-fit line for the training data using the Ordinary Least Squares (OLS) method.

X_train – The input features used to train the model.

y_train – The target values used to train the model.


In [None]:
# Create the model
model = LinearRegression()

# Train (fit) the model using training data
model.fit(X_train, y_train)

# Display the coefficients and intercept
print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)

Make predictions for the test set.

The model.predict() method applies this equation to the X test data to compute predicted values.

y_pred = model.predict(X_test)

y_pred contains all the predicted values for all the rows in X_test based on the linear regression model.


In [None]:
# Make predictions using the trained model
y_pred = model.predict(X_test)

# Create a DataFrame to show Actual and Predicted values
compare_df = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred    
})

# Display the first few rows of the comparison
print("Compare the values:")
print(compare_df.head())

### 4.3 Report R^2, MAE, RMSE
Evaluate the model using R^2, MAE, and RMSE.

First:

- Coefficient of Determination (R^2) - This tells you how well the model explains the variation in the target variable. A value close to 1 means the model fits the data well; a value close to 0 means the model doesn’t explain the variation well.



In [None]:
# Calculate R² (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)

# Display the R² value
print(f"R² Score: {r2:.2f}")

Second:

- Mean Absolute Error (MAE) - This is the average of the absolute differences between the predicted values and the actual values. A smaller value means the model’s predictions are closer to the actual values.

In [None]:
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Display the MAE value
print(f"Mean Absolute Error (MAE): {mae:.2f}")

Third:

- Root Mean Squared Error (RMSE) - This is the square root of the average of the squared differences between the predicted values and the actual values. It gives a sense of how far the predictions are from the actual values, with larger errors having more impact.

In [None]:
# Calculate Root Mean Squared Error (RMSE)
rmse = root_mean_squared_error(y_test, y_pred)

# Display the RMSE value
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

****

Analysis: How well did the model perform? Any surprises in the results?

* The R² score explains about 46% of the variance. This means that more than half remains unexplained. Some ways to improve would be including additional relative features or explore more complex modeling approaches.
* The MAE and RSME are relatively close. The RSME is sensitive to larger errors which would explain why it is slightly higher. For example, one of the rows shows an actual value of 5.00001 being predicted as 1.95573 - which is a considerable underestimation.
* The linear regression might not be fully capturing the underlying patterns. The R², MAE, and RSME hint that the underlying relationships might be non-linear!
* Switching to a non-linear model or investigating data points and applying data scaling or outlier treatment might improve predictions.