# Survival Rate of Passengers on the RMS Titanic

Let's try to predict the survival rate of passengers that were on the RMS Titanic using our random forest model. The Titanic dataset contains passenger information from the RMS Titanic's maiden voyage in 1912. It includes details like passenger age, sex, ticket class, number of siblings/spouses aboard, fare paid, and whether they survived the ship's sinking. We can use this to build a predictive model that can determine a passenger's survival chance based on these demographic and travel characteristics.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
# Import our local Random Forest and some helper functions
from randomforest import RandomForest, train_test_split, accuracy


# Download Titanic dataset and preview the dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Some interesting statistics regarding the survival rate of passengers

In [13]:
# Survival rate by passenger class
class_survival = df.groupby('Pclass')['Survived'].mean()
print("Survival Rates by Passenger Class:")
print(class_survival)

# Survival rate by sex
sex_survival = df.groupby('Sex')['Survived'].mean()
print("\nSurvival Rates by Sex:")
print(sex_survival)

# Survival rate by age ranges
df['AgeGroup'] = pd.cut(df['Age'], 
    bins=[0, 12, 18, 35, 60, 100], 
    labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
age_survival = df.groupby('AgeGroup')['Survived'].mean()
print("\nSurvival Rates by Age Group:")
print(age_survival)

Survival Rates by Passenger Class:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

Survival Rates by Sex:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

Survival Rates by Age Group:
AgeGroup
Child          0.579710
Teenager       0.428571
Young Adult    0.382682
Adult          0.400000
Senior         0.227273
Name: Survived, dtype: float64


Now, let's first clean and preprocess the dataset before we begin training our random forest model.

In [11]:
def preprocess_titanic_data(df):
    # Select features
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
    X = df[features].copy()
    y = df['Survived']

    # Handle categorical variables
    le = LabelEncoder()
    X['Sex'] = le.fit_transform(X['Sex'])
    X['Embarked'] = le.fit_transform(X['Embarked'].fillna(X['Embarked'].mode()[0]))

    # Impute missing values
    imputer = SimpleImputer(strategy='median')
    X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

    # Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    return X, y.values

# Prepare data for random forest
X, y = preprocess_titanic_data(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now we can train our Random Forest model.

In [12]:
# Train random forest
rf = RandomForest(n_trees=100, max_depth=10, n_jobs=-1)
rf.fit(X_train, y_train)

# Evaluate
predictions = rf.predict(X_test)
print(f"Accuracy: {accuracy(y_test, predictions):.4f}")

Accuracy: 0.8371
