<a href="https://colab.research.google.com/github/CharlotteHoyt/KWK-Goldman-Sachs-ML-Titanic-Survival-Data/blob/main/Charlotte_Hoyt_KWK_Titanic_Survival_Data_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#KWK Machine Learning Challenge 2025
##Titanic Survival Data Notebook

Make a copy of this notebook and follow along with the guided curriculum lessons!

Portions of code for you to complete are marked with a #TODO comment.

### Mounting files from Google Colab

In [7]:
# For more information on this, see this link: https://colab.research.google.com/notebooks/io.ipynb
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
# Basic import statements
import pandas as pd
from scipy import stats

# Loading Data into a Dataframe
First, we load the data from the CSV file into a pandas dataframe. Read more about pandas dataframes [here](https://pandas.pydata.org/docs/user_guide/dsintro.html).

After we load the data using the `read_csv` function, we preview the dataframe using `df.head()` and visually inspect the results to ensure it was loaded as we expected.

In [9]:
# TODO: Replace the path below with the path to the file on your own Google Drive.
df = pd.read_csv("/content/drive/My Drive/KWK_ML/titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
#This should have some output if your dataframe was loaded correctly.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#STAT1.1.1 - Data Types

Below is a list of the columns in the Titanic dataset.
Record what kind of data each represents.
Use the following categories to classify each variable:

* Continuous quantitative data
* Discrete quantitative data
* Nominal qualitative data
* Ordinal qualitative data
* Other common types (select which)

| Column Name   | Description                          | Data Type                      | Explanation                                                                          |
| ------------- | ------------------------------------ | ------------------------------ | ------------------------------------------------------------------------------------ |
| `PassengerId` | Unique identifier for each passenger | **Nominal Qualitative** | Numerical data with no natural order. |
| `Survived`    | Survival status (0 = No, 1 = Yes)    | **Binary/Boolean** | Only has two possible values. |
| `Pclass`      | Ticket class (1st, 2nd, 3rd)         | **Ordinal Qualitative** | Categorical data with a meaningful order. |
| `Name`        | Passengerâ€™s full name                |  **Text/String** | Names are a type of free-form text that can't be categorized. |
| `Sex`         | Gender of passenger                  | **Nominal Qualitaitve** | A type of categorical data with no natural order. |
| `Age`         | Age in years                         |  **Continuous Quanitative** | Any numerical value, including possibly fractions/decimal values. |
| `SibSp`       | Number of siblings/spouses aboard    |  **Discrete Quantitative** | Whole numbers, can't have a partial count of siblings/spouses. |                                                          
| `Parch`       | Number of parents/children aboard    |  **Discrete Quantitative** | Similarly, counted using whole numbers. |
| `Ticket`      | Ticket number                        | **Nominal Qualitative** | An ID, numbers but not quantitative. |
| `Fare`        | Ticket price                         | **Continuous Qualitative** | A ticket price could have a fraction/decimal.  |
| `Cabin`       | Cabin number                         | **Nominal Qualitative** | An identifier, includes numbers but is not quantitative. |
| `Embarked`    | Port of embarkation (C, Q, S)        |  **Nominal Qualitiative** | Categories representing locations (unordered). |


#STAT1.1.2 - Calculating Measures of Central Tendency
Let's start by finding the measures of central tendency of each numeric variable in the Titanic dataset.

By calculating the mean, median, and mode for columns like `Age`, `SibSp`, `Parch`, `Fare`, and `Pclass`, we can get a sense of what was typical for passengers on board: their average age, common ticket class, and usual family size.
These measures help summarize large amounts of data into a few meaningful numbers that tell the story of who was on the ship.

In [12]:
# Select numeric columns to analyze
# Weâ€™ll focus on three continuous or discrete quantitative variables
cols = ["Age", "SibSp", "Parch", "Fare", "Pclass"]

# Loop through each column and compute measures of spread
for col in cols:
    print(f"\n--- {col.upper()} ---")

    # Calculate Mean
    #TODO:  calculate the mean age and assign to the variable `mean_age`
    mean_age = df[col].mean();
    print(f"Mean: {mean_age}")

    # Calculate Median
    #TODO:  calculate the median age and assign to the variable `median_age`
    median_age = df[col].median();
    print(f"Median: {median_age}")

    # Calculate Mode
    # This example is a bit more complicated, so it's done for you below
    mode_result = stats.mode(df[col], keepdims=True)
    mode_age = mode_result.mode[0]
    count_mode = mode_result.count[0]
    print(f"Mode: {mode_age} (appears {count_mode} times)")


--- AGE ---
Mean: 29.69911764705882
Median: 28.0
Mode: nan (appears 177 times)

--- SIBSP ---
Mean: 0.5230078563411896
Median: 0.0
Mode: 0 (appears 608 times)

--- PARCH ---
Mean: 0.38159371492704824
Median: 0.0
Mode: 0 (appears 678 times)

--- FARE ---
Mean: 32.204207968574636
Median: 14.4542
Mode: 8.05 (appears 43 times)

--- PCLASS ---
Mean: 2.308641975308642
Median: 3.0
Mode: 3 (appears 491 times)


### ðŸ’­ Reflection: Interpreting Correlations

**What do these averages tell us about who was on board the Titanic?**

TODO - answer

**What do the results for SibSp and Parch suggest about how people traveled?**

TODO - answer

#STAT1.1.3 - Calculating Spread and Variation

Now that we know what was typical for passengers aboard the Titanic, let's look at how much variation there was between them.

By calculating the **range**, **interquartile range** (IQR), and **standard deviation** for features like `Age`, `Fare`, and `Pclass`, we can see how widely passengers differed in age, ticket price, and social class.

In [None]:
# Select numeric columns to analyze
# Weâ€™ll focus on three continuous or discrete quantitative variables
cols = ["Age", "SibSp", "Parch", "Fare", "Pclass"]

# Loop through each column and compute measures of spread
for col in cols:
    print(f"\n--- {col.upper()} ---")

    # TODO - calculate and print min, max, range, IQR, and standard deviation for each column

Now that we understand how measures like range, IQR, and standard deviation describe variation, let's see what that spread looks like visually. Box plots are a simple way to compare how much values differ within and across variables.

In [None]:
import matplotlib.pyplot as plt

# Define numeric columns to visualize
cols = ["Age", "SibSp", "Parch", "Fare", "Pclass"]

#TODO - create a box plot for each feature


A box plot shows how data is distributed at a glance. The box itself represents the middle 50% of the data (from the 25th percentile (Q1) to the 75th percentile (Q3)) with the line inside marking the median. The "whiskers" extend to show the overall range of typical values, and any points beyond them are outliers. A taller box means more variability, while a shorter one means the data is more consistent.

## ðŸ’­ Reflection: Variation and Spread

**Which variable shows the most variation, and what does that tell us about conditions aboard the Titanic?**

TODO - answer

**Which variables show the least variation, and what might that suggest about passenger demographics?**

TODO - answer

# STAT1.1.4 - Correlation and Scatter Plots

Next, let's explore how different numeric features in the Titanic dataset relate to one another.
We'll calculate a correlation matrix using the Pearson correlation coefficient.

In [None]:

# Select numeric columns to explore
cols = ["Survived", "Age", "SibSp", "Fare", "Parch", "Pclass"]

# Display the correlation matrix
# TODO - display the correlation matrix

### ðŸ’­ Reflection: Interpreting Correlations

**Fill in the following table to indicate which variables have the strongest correlation with Survival**

|Variable|Direction of Correlation|Magnitude of Correlation|Interpretation|
|:-|:-|:-|:-|
|Age| TODO - direction | TODO - magnitude | TODO-interpretation|
|Passenger Class| TODO - direction | TODO - magnitude | TODO-interpretation|
|Siblings/Spouses Aboard| TODO - direction | TODO - magnitude | TODO-interpretation|
|Parents/Children Aboard| TODO - direction | TODO - magnitude | TODO-interpretation|
|Fare| TODO - direction | TODO - magnitude | TODO-interpretation|


**Which variable has the strongest relationship with survival, and what might explain it?**

TODO - answer

**How does `Fare` relate to `Survived`, and why does this make sense historically?**

TODO - answer

**Why do variables like Age, SibSp, and Parch show such weak correlations with survival?**

TODO - answer

#STAT1.2.1 - Cleaning data

In [None]:
df.info()

The result shows that the following columns contain some missing values:


*   **Age**: 177 rows missing (20% of rows)
*   **Embarked**: 2 rows missing (0.2% of rows)
*   **Cabin**: 687 rows missing (77% of rows)

## Handling Missing Age Values

For `Age`, we can assign the median age of the overall dataset to the missing.

Using the median to fill in missing data is usually a safe bet for a few reasons:
*   It is resistant to outliers.
*   It preserves the overall shape of the age distribution.
*   It keeps the dataset usable without biasing survival predictions too much. Dropping 20% of rows would shrink our training data and might remove meaningful patterns; filling them with the median is a low-distortion compromise.

## Handling Missing Embarked Values
The `Embarked` field contains the port that passengers embarked from. Since only 2 rows are missing this information, we could either drop them, or fill them in with the most common port. We'll do the latter.

## Handling Missing Cabin Values
The majority (77%) of these values are missing. It's best to drop this column altogether.



In [None]:
# Fill missing Age values with the median age
# TODO - fill as described

In [None]:
# Fill missing Embarked values with the most common port (mode)
# TODO - fill as described

In [None]:
# Drop the Cabin column since most values are missing
# TODO - drop as described

In [None]:
# re-run df.info() and see that the columns have updated as we expect.
df.info()

#STAT1.2.2 - Feature Engineering

Use the existing `SibSp` and `Parch` columns in the dataframe to engineer a new column, or feature, called `FamilySize` that captures the total number of family members onboard for a given passenger.

In [None]:
# Add the columns `Parch` and `SibSp` and assign them to a
# new column called `FamilySize`
# TODO - create new column as described

# Re-run df.info() to confirm the new column has been created
df.info()

In [None]:
# Compute the mean, median, mode, range, IQR,
# and standard deviation for the new column

# TODO - do as described above
# (hint, you already did something like this earlier in the notebook!)

# ML1.1.1 - Linear Regression

Train and evaluate a linear regression model that predicts a passenger's Fare depending on their Age.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# TODO: Fill out the below sections, following the steps described.

# --- 1. Define features (X) and target (y) ---


# --- 2. Split into training and testing sets (80% train, 20% test) ---


# --- 3. Train the model ---


# --- 4. Make predictions on test set ---


# --- 5. Evaluate performance ---
# Print coefficient, intercept, MSE, and Rsquared score


# --- 6. Visualize results ---



Next, let's add an additional independent variable / input feature: Pclass. Your code will look very similar, but will have two features in the input array rather than just one.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# TODO: Fill out the below sections, following the steps described.


# --- 1. Define features (X) and target (y) ---


# --- 2. Split into training and testing sets ---


# --- 3. Train model ---


# --- 4. Predict on test set ---


# --- 5. Evaluate performance ---


# --- 6. Visualize (optional) ---
# Here, we'll plot how well the model predicts fares for different Ages,
# using one color per Pclass to make the trend easier to see.



## ðŸ’­ Reflection Questions
**Based on your findings, how would you describe the relative correlation of Age and Pclass with Fare? Does one seem to have more potential than the other to predict a given passenger's fare? Why might that be?**

TODO - answer

#ML1.2.1 - Logistic Regression

In this exercise, you'll train a logistic regression model to predict whether a passenger survived the Titanic disaster.

You'll go step-by-step, selecting features, training the model, and evaluating its performance.

##Step 1: Select features and encode categorical data
Logistic regression only works with numeric features.
We'll convert Sex and Embarked into numeric form using [one-hot encoding](https://www.geeksforgeeks.org/machine-learning/ml-one-hot-encoding/).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import pandas as pd

# TODO: Fill out the below sections, following the steps described.

# --- 1. Define features and target ---


# --- 2. One-hot encode categorical columns ---


# --- 3. Split into training and testing sets ---



## Step 2: Train and evaluate the logistic regression model
Train the model and view coefficients, then evaluate model performance against test data.

In [None]:
# TODO: Fill out the below sections, following the steps described.

# --- 4. Train the logistic regression model ---


# --- 5. Make predictions on the test set ---


# --- 6. Evaluate model performance ---


# --- 7. Examine feature influence on survival ---


# Optional: Visualization



## ðŸ’­ Reflection Questions
**What percentage of passengers in the test dataset does the model correctly classify? (accuracy)**

TODO - answer


**When the model predicts survival, how frequently is it right? (precision)**

TODO - answer


**What percentage of true surviors does the model correctly catch? (recall)**

TODO - answer


**Overall, how well does the model capture the relationship between the independent variables and `Survived`? (F1)**

TODO - answer

