## Random Forests and Bagged and Boosted Decision Trees 
### Predicting Survival with the Titanic Dataset

Bronwyn Bowles-King

### 1. Introduction

This Jupyter notebook explores the use of various random forest models in predicting the survival of passengers in the sinking of the Titanic, a well-known maritime disaster that occurred in 1912. The dataset contains information on demographic and ticket details, which can be contributing factors influencing survival. 

In this case, I am concerned with demonstrating the particular machine learning process, fine-tuning random forest, bagged and boosted models, and finding the most accurate and efficient option. The notebook includes data preparation, model training, and evaluation of model accuracy, providing an overview of the modelling process. 

A basic random forest is first produced with default settings and the features that contribute most to predicting whether a passenger survives or not are identified. The modelling process is then run again with different n_estimator (number of trees in the model) and max_depth (maximum depth of each tree) parameters and the accuracy of different iterations is evaluated. 

In the last two sections, the process is repeated with the ensemble methods known as bagged and boosted tree models and the model performance is assessed. 

### 2. Preparation steps

##### 2.1 Import libraries and load dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier

titanic_df = pd.read_csv("titanic.csv")

#### 2.2 Check for duplicates and missing values and impute with the median

The number of missing values and duplicates are checked and then the missing values in the Age column are replaced with the median age. 

In [2]:
print(f'Duplicate rows: {titanic_df.duplicated().sum()}')

print(f'\nMissing values by column: \n{titanic_df.isnull().sum()}')  

titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].median())

Duplicate rows: 0

Missing values by column: 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


### 3. Create baseline random forest model 

#### 3.1 Isolate and prepare features and target for prediction

The features for prediction are gender ('Sex'), passenger class ('Pclass'), the fare paid ('Fare'), age in years ('Age'), number of family relations on the ship ('SibSp' and 'Parch'), and where the person boarded the boat ('Embarked'). Together, these are called 'features' and they are the X variable. Some of these aspects will prove more useful in prediction than others. The target (y) that will be predicted upon is the 'Survived' column.

The features and target are assigned below and categorical columns are converted to numeric or boolean data with the pandas function get_dummies. This is necessary so that the machine learning process can work better with the data.

In [3]:
X = titanic_df.drop(['Survived', 'Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1)

X = pd.get_dummies(X)  # Categorical columns converted to numeric or boolean

print(X.dtypes)  # Check data types for ML suitability

y = titanic_df['Survived']

Pclass          int64
Age           float64
SibSp           int64
Parch           int64
Fare          float64
Sex_female       bool
Sex_male         bool
Embarked_C       bool
Embarked_Q       bool
Embarked_S       bool
dtype: object


#### 3.2 Split the data into training and test sets

The data is split into a training set of 80% and a test set of 20% and the final size of the sets is printed to show the number of samples that the model will be working with in each case. 

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train)} samples")  
print(f"Test set size: {len(X_test)} samples") 

Training set size: 712 samples
Test set size: 179 samples


#### 3.3 Fit the model on training data

A random forest (rf) is instantiated below with default settings and then fit on the training data. A random_state is set so that the results are consistent when running the code more than once. This baseline model is then required to make predictions on the test data and the model's accuracy is printed. The baseline random forest model performs well with an accuracy of 0.804. 

In [5]:
rf = RandomForestClassifier(random_state=42)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

print(f"Baseline random forest accuracy: {accuracy_score(y_test, y_pred):.3f}")

Baseline random forest accuracy: 0.804


### 4. Determine features contributing most to passenger survival and prediction accuracy

The RandomForest ensemble method in sklearn allows us to calculate the importance of features in accurate prediction. The code below gets the scores for each feature. The higher the score, the more closely related to correct survival prediction the feature is. 

We can see that the more a passenger paid, the more likely they were to survive. This is not surprising as it is known that wealthier people in the upper passenger classes were able to board rescue boats first because their cabins were closer to the boats. 

Age is the second-most important factor, followed by gender, and passenger class, which is related to the fare paid. The importance of age and gender is also not surprising as women and children were given preference in boarding the rescue boats as well. The number of family relationships and place of embarkation have little influence on the prediction accuracy. 

In [6]:
importances = rf.feature_importances_
feature_names = X.columns

indices = np.argsort(importances)[::-1]  # Sort indices

for index in indices:
    print(f"{feature_names[index]}: {importances[index]:.3f}")

Fare: 0.262
Age: 0.243
Sex_male: 0.164
Sex_female: 0.122
Pclass: 0.086
SibSp: 0.050
Parch: 0.037
Embarked_S: 0.014
Embarked_C: 0.014
Embarked_Q: 0.007


### 5. Test and tune the random forest model parameters 

The number of trees in the model (n_estimators) and maximum depth (number of branches) of each tree (max_depth) are parameters that can be adjusted to improve the model outcomes. The GridSearchCV function was used here to test various n_estimators and max_depth values and find the optimum ones. 

Various values were checked from 2 to 9 for max_depth and 50 to 200 trees for n_estimators. Only the optimum and slightly less accurate results before and after the optimum are shown below for simplicity.

The results show that the best parameters are up to 6 branches (max_depth) and 130 trees (n_estimators). With these parameters, the cross-validation (CV) accuracy is 0.829, an improvement of 0.025 over the baseline random forest model accuracy (0.804) (section 3.3).

The accuracy for all the combinations of parameters tested is also displayed below to give a sense of how much the model improves with each increment in the different combinations of max_depth and n_estimators values. 

The same process will be followed in the next two sections to test two other models, the bagged and boosted tree models.

In [7]:
param_grid = {
    'n_estimators': [120, 130, 140],
    'max_depth': [5, 6, 7]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.3f}")

Best parameters: {'max_depth': 6, 'n_estimators': 130}
Best CV accuracy: 0.829


In [8]:
results = pd.DataFrame(grid_search.cv_results_)

print("Random forest model accuracy results:")
for _, row in results.iterrows():
    print(f"Params: {row['params']} | CV accuracy: {row['mean_test_score']:.3f}")

Random forest model accuracy results:
Params: {'max_depth': 5, 'n_estimators': 120} | CV accuracy: 0.822
Params: {'max_depth': 5, 'n_estimators': 130} | CV accuracy: 0.822
Params: {'max_depth': 5, 'n_estimators': 140} | CV accuracy: 0.823
Params: {'max_depth': 6, 'n_estimators': 120} | CV accuracy: 0.826
Params: {'max_depth': 6, 'n_estimators': 130} | CV accuracy: 0.829
Params: {'max_depth': 6, 'n_estimators': 140} | CV accuracy: 0.823
Params: {'max_depth': 7, 'n_estimators': 120} | CV accuracy: 0.817
Params: {'max_depth': 7, 'n_estimators': 130} | CV accuracy: 0.822
Params: {'max_depth': 7, 'n_estimators': 140} | CV accuracy: 0.823


### 6. Create a bagged tree model, test accuracy and fine-tune parameters

In this section, an ensemble approach is taken with bootstrap aggregating, also known as bagging. It involves training multiple decision trees on different random subsets of the data and then averaging their predictions. This reduces variance and helps prevent overfitting.

A similar process is followed below whereby a baseline model is created, the model accuracy is checked, and the values for n_estimators and max_depth are adjusted. With the baseline bagged model, the accuracy is 0.765, lower than the baseline random forest above (0.804), and not yet showing an improvement by using bagging for this data specifically. 

However, after tuning the model, a similar accuracy of 0.826 is possible with a lower max_depth of 5 and the same number of trees (130) compared to the random forest model from section 5. 

In [9]:
# Baseline bagged tree 
bagged = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    random_state=42
)
bagged.fit(X_train, y_train)
bagged_acc = bagged.score(X_test, y_test)
print(f"Baseline bagged tree accuracy: {bagged_acc:.3f}")

Baseline bagged tree accuracy: 0.765


In [10]:
# Test parameters 
bag_param_grid = {
    'n_estimators': [120, 130, 140],
    'estimator__max_depth': [4, 5, 6]
}

bagging = BaggingClassifier(estimator=DecisionTreeClassifier(random_state=42), random_state=42)
bag_grid = GridSearchCV(bagging, bag_param_grid, cv=5, scoring='accuracy')
bag_grid.fit(X_train, y_train)

print("Best bagging params:", bag_grid.best_params_)
print(f"Best bagging CV accuracy: {bag_grid.best_score_:.3f}")

Best bagging params: {'estimator__max_depth': 5, 'n_estimators': 130}
Best bagging CV accuracy: 0.826


In [11]:
# Compare accuracy results
bag_results = pd.DataFrame(bag_grid.cv_results_)
print("Bagging results:")
for _, row in bag_results.iterrows():
    print(f"Params: {row['params']}, CV accuracy: {row['mean_test_score']:.3f}")

Bagging results:
Params: {'estimator__max_depth': 4, 'n_estimators': 120}, CV accuracy: 0.822
Params: {'estimator__max_depth': 4, 'n_estimators': 130}, CV accuracy: 0.824
Params: {'estimator__max_depth': 4, 'n_estimators': 140}, CV accuracy: 0.822
Params: {'estimator__max_depth': 5, 'n_estimators': 120}, CV accuracy: 0.824
Params: {'estimator__max_depth': 5, 'n_estimators': 130}, CV accuracy: 0.826
Params: {'estimator__max_depth': 5, 'n_estimators': 140}, CV accuracy: 0.826
Params: {'estimator__max_depth': 6, 'n_estimators': 120}, CV accuracy: 0.823
Params: {'estimator__max_depth': 6, 'n_estimators': 130}, CV accuracy: 0.823
Params: {'estimator__max_depth': 6, 'n_estimators': 140}, CV accuracy: 0.822


#### 8. Create a boosted tree model, test accuracy and fine-tune parameters

Boosting is now used with the decision tree model as a different ensemble approach. With boosting, trees are built sequentially, and each new tree tries to correct the errors of the previous ones, giving more weight to misclassified samples to improve further prediction. Here, the Adaptive Boosting (AdaBoost) ML algorithm is applied to create stronger 'learners' from weaker ones. Models are trained until relatively low error is achieved.

Despite the advantages of this model, the baseline boosted tree accuracy is 0.76, the lowest of all the models. The accuracy is also slightly higher for the optimised baseline bagged model by 0.066 (section 7) and for the optimised random forest by 0.069 (section 5) compared to the best accuracy with optimised parameters for the AdaBoost model (0.822).
 
However, the GridSearchCV algorithm found that an accuracy of 0.822 can be achieved with only 3 branches (max_depth) and 120 trees (n_estimators). This shows that this model can achieve high accuracy on the test data with a lower number of iterations and node splits than the other two models. This can be highly useful when a tree is becoming overly complex with another method applied. 

Overall, the AdaBoost model is the most efficient model, and the slightly lower accuracy score can be an acceptable trade-off between model complexity and performance.

In [12]:
# Baseline boosted tree with AdaBoost
boosted = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    random_state=42
)
boosted.fit(X_train, y_train)
boosted_acc = boosted.score(X_test, y_test)
print(f"Boosted tree accuracy: {boosted_acc:.3f}")

Boosted tree accuracy: 0.760


In [13]:
boost_param_grid = {
    'n_estimators': [110, 120, 130],
    'estimator__max_depth': [2, 3, 4]
}

# Test parameters
boosting = AdaBoostClassifier(estimator=DecisionTreeClassifier(random_state=42), random_state=42)
boost_grid = GridSearchCV(boosting, boost_param_grid, cv=5, scoring='accuracy')
boost_grid.fit(X_train, y_train)

print("Best AdaBoost params:", boost_grid.best_params_)
print(f"Best AdaBoost CV accuracy: {boost_grid.best_score_:.3f}")

Best AdaBoost params: {'estimator__max_depth': 3, 'n_estimators': 120}
Best AdaBoost CV accuracy: 0.822


In [14]:
# Compare accuracy results
boost_results = pd.DataFrame(boost_grid.cv_results_)

print("Boosting results (sorted):")
for _, row in boost_results.iterrows():
    print(f"Params: {row['params']}, CV accuracy: {row['mean_test_score']:.3f}")

Boosting results (sorted):
Params: {'estimator__max_depth': 2, 'n_estimators': 110}, CV accuracy: 0.810
Params: {'estimator__max_depth': 2, 'n_estimators': 120}, CV accuracy: 0.809
Params: {'estimator__max_depth': 2, 'n_estimators': 130}, CV accuracy: 0.808
Params: {'estimator__max_depth': 3, 'n_estimators': 110}, CV accuracy: 0.812
Params: {'estimator__max_depth': 3, 'n_estimators': 120}, CV accuracy: 0.822
Params: {'estimator__max_depth': 3, 'n_estimators': 130}, CV accuracy: 0.820
Params: {'estimator__max_depth': 4, 'n_estimators': 110}, CV accuracy: 0.799
Params: {'estimator__max_depth': 4, 'n_estimators': 120}, CV accuracy: 0.796
Params: {'estimator__max_depth': 4, 'n_estimators': 130}, CV accuracy: 0.791


#### References

Arora, A. (2023). Random Forest: Exploration of Bagging and Ensemble. Medium. https://medium.com/@ashisharora2204/random-forest-exploration-of-bagging-and-ensemble-a08efa5f608c

Geeks for Geeks. (2025). Pandas DataFrame iterrows() Method. https://www.geeksforgeeks.org/pandas/pandas-dataframe-iterrows

Geeks for Geeks. (2025). Single and Double Underscores in Python. https://www.geeksforgeeks.org/python/single-aad-double-underscores-in-python

Geeks for Geeks. (2024). How to fit categorical data types for random forest classification? https://www.geeksforgeeks.org/machine-learning/how-to-fit-categorical-data-types-for-random-forest-classification

Hunt, G. (2016). Titanic Dataset Investigation. https://ghunt03.github.io/DAProjects/DAP02/TitanicDatasetInvestigation.html

HyperionDev. (2025). Supervised Learning â€“ Random Forests. Course materials. Private repository, GitHub.

Navlani, A. (2018). AdaBoost Classifier in Python. DataCamp. https://www.datacamp.com/tutorial/adaboost-classifier-python

numpy. (2024). np.argsort. https://numpy.org/doc/2.2/reference/generated/numpy.argsort.html

Paul, S. (2018). Ensemble learning: Bagging, boosting, stacking and cascading classifiers in machine learning using SKLEARN and MLEXTEND libraries. Medium. https://medium.com/@saugata.paul1010/ensemble-learning-bagging-boosting-stacking-and-cascading-classifiers-in-machine-learning-9c66cb271674

scikit-learn. (2024). GridSearchCV. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

scikit-learn. (2024). RandomForestClassifier. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

scikit-learn. (2024). sklearn.ensemble. https://scikit-learn.org/stable/api/sklearn.ensemble.html#module-sklearn.ensemble

scikit-learn. (2024). train_test_split. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Steiger, T. (2017). Analysis of Titanic Survival Data. https://ttsteiger.github.io/projects/titanic_report.html