Titanic: Machine Learning from Disaster - Predict survival on the Titanic

For further info on the kaggle competition see https://www.kaggle.com/c/titanic

In [6]:
'''
Import necessary libraries

Note: First make sure you are in base conda and install sklearn  (pip install scikit-learn)

Scikit-learn (sklearn) is a popular Python library used for predictive data analysis. It offers a wide range of supervised and unsupervised learning algorithms, 
including classification, regression, clustering, and dimensionality reduction. 
Sklearn is known for its clean API and seamless integration with NumPy and SciPy, making it a powerful tool for tasks like model fitting, preprocessing, and evaluation.
'''

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Adjust console presentation of output
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [7]:

'''
## Load the Data

Start by loading our Titanic dataset. This dataset includes various information about passengers on the Titanic, 
including whether or not they survived, their class, their age, their fare, and more. 
We'll use this information to try to predict whether each passenger survived.

In a real project, data might come from various sources and in various formats, and assembling it into a single dataset can be a significant part of the data preprocessing task.
'''

df = pd.read_csv('data/titanic_data.csv')

In [9]:
'''
## Initial Data Exploration

Now that we have loaded our data, the first step is to explore it and understand what we're working with. 
We'll check out the first few rows of our dataframe, look at the summary statistics, and see what data types we have.
'''

df.head()

Unnamed: 0.1,Unnamed: 0,passenger_id,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
0,0,1,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,"St Louis, MO"
1,1,2,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,"Montreal, PQ / Chesterville, ON"
2,2,3,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,"Montreal, PQ / Chesterville, ON"
3,3,4,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,"Montreal, PQ / Chesterville, ON"
4,4,5,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,"Montreal, PQ / Chesterville, ON"


In [10]:
df.describe()

Unnamed: 0.1,Unnamed: 0,passenger_id,pclass,survived,age,sibsp,parch,fare
count,1309.0,1309.0,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,654.0,655.0,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479
std,378.020061,378.020061,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668
min,0.0,1.0,1.0,0.0,0.1667,0.0,0.0,0.0
25%,327.0,328.0,2.0,0.0,21.0,0.0,0.0,7.8958
50%,654.0,655.0,3.0,0.0,28.0,0.0,0.0,14.4542
75%,981.0,982.0,3.0,1.0,39.0,1.0,0.0,31.275
max,1308.0,1309.0,3.0,1.0,80.0,8.0,9.0,512.3292


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    1309 non-null   int64  
 1   passenger_id  1309 non-null   int64  
 2   pclass        1309 non-null   int64  
 3   survived      1309 non-null   int64  
 4   name          1309 non-null   object 
 5   sex           1309 non-null   object 
 6   age           1046 non-null   float64
 7   sibsp         1309 non-null   int64  
 8   parch         1309 non-null   int64  
 9   ticket        1309 non-null   object 
 10  fare          1308 non-null   float64
 11  cabin         295 non-null    object 
 12  embarked      1307 non-null   object 
 13  home.dest     745 non-null    object 
dtypes: float64(2), int64(6), object(6)
memory usage: 143.3+ KB


There are a number of variables within this dataset:
* pclass = Passenger class of travel.
* survived = 1 if the passenger survived the sinking, 0 if not.
* name = Full name of the passenger, including title.
* sex = Passenger gender.
* age = Passenger age.
* sibsp = Count of siblings or spouse also aboard.
* Parch = Count of parents or children also aboard.
* ticket = Ticket reference.
* fare = Fare paid.
* cabin = Cabin number.
* embarked = Port of embarkation. (S = Southampton (UK); C = Cherbourg (France); Q = Queenstown (Cobh, Ireland))

In [14]:
'''
## Data Preprocessing and Feature Engineering

In this basic example, we're going to fill missing numerical data with the median value and drop all other missing values. 
In a more complex scenario, you might use more sophisticated techniques to handle missing values, 
like filling in missing values based on other data or using a machine learning algorithm to predict them.

Feature engineering is the process of creating new features or modifying existing ones to improve model performance. 
In this example, we won't be doing any complex feature engineering, but it's an important step in many machine learning projects.
'''

numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# After filling numeric columns, can drop rows with missing data in any (non-numeric) column:
df.dropna(inplace=True)

In [15]:
'''
## Train-Test Split

Before we start building our model, we'll split our data into a training set and a test set. 
This allows us to evaluate our model's performance on unseen data, which gives us a sense of how well our model is likely to perform on new data in the future.

We're using a 70/30 split for our data, meaning 70% of our data will go to the training set and 30% will go to the test set.

Also, we will be handling the categorical variables in our features. In the Titanic dataset, 'sex' and 'embarked' are categorical variables. 
'Sex' can have 'male' or 'female' as values and 'embarked' can have 'C', 'Q', or 'S'. 

To make these usable in our model, we use a technique called one-hot encoding, which creates new columns for each unique category in each categorical variable. 
For each record, the column corresponding to its category will have a value of 1, and all other created columns will have values of 0.

We perform one-hot encoding using the pandas function `get_dummies()`. 
When used on a DataFrame, `get_dummies()` only converts the object or category dtype columns, and leaves the numerical columns as they are. 
So in our case, it will create dummy variables for 'sex' and 'embarked', while leaving 'pclass', 'age', 'sibsp', 'parch', and 'fare' unchanged.

In the following code:
- `X` represents our features or independent variables. This is the data that the model will learn from.
- `y` represents our target variable or dependent variable, which in this case is 'survived'. This is the outcome we are trying to predict.

We use the function `train_test_split` from the sklearn library to create our training and testing sets. 
We set a random seed (random_state=42) to ensure that the splits generate the same way each time we run the code.
'''

# Selecting features and target variable
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
X = pd.get_dummies(df[features])
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # use a random seed so we can reproduce the results 

In [17]:
'''
## Model Selection and Training

Now that we've prepared our data, we can build our model. In this example, we're going to use Logistic Regression, which is a good starting point for binary classification problems like this one. 

In a more complex project, you might try several different models, compare their performance, and even combine them into an ensemble model. You might also use techniques like cross-validation to get a better estimate of your model's performance.
'''

model = LogisticRegression(max_iter=1000)  # default is 100
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [21]:
'''
## Model Evaluation

Finally, we're going to evaluate our model. We'll use accuracy as our metric, which tells us the proportion of passengers for whom our model correctly predicted their survival. 

In different scenarios, other metrics may be more appropriate. For example, in a problem with imbalanced classes, precision, recall, or the F1 score might be a better measure of performance.
'''

y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7948717948717948
Confusion Matrix:
[[22  5]
 [11 40]]


Accuracy is a measure of overall model performance. Accuracy is defined as the number of correct predictions (both positive and negative) divided by the total number of predictions. 
In this case, the accuracy is approximately 0.795 or 79.5%, which means the model correctly predicted 79.5% of the cases.

Confusion Matrix: This is a 2x2 matrix used for binary classification that describes the performance of a classification model. 
True Negatives (TN): The classifier correctly predicted 22 negatives (predicted that they would not occur, and they did not occur).
False Positives (FP): The classifier incorrectly predicted 5 positives (predicted that they would occur, but they did not occur).
False Negatives (FN): The classifier incorrectly predicted 11 negatives (predicted that they would not occur, but they did occur).
True Positives (TP): The classifier correctly predicted 40 positives (predicted that they would occur, and they did occur).

'''
## Hands on Learning

Now that we've gone through the process of building and evaluating a basic logistic regression model on the Titanic dataset, 
it's time for some hands-on practice!
'''

## Task 1: Try different classification models 
Try using a different classification model, such as a Decision Tree or Random Forest, and compare the performance with the Logistic Regression model.
How to do it: Import the new model from sklearn, train it on the training data, make predictions on the test data, and compute the accuracy like before.

In [None]:
# required libraries for these models:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [26]:
'''
## Task: Train a Decision Tree Model

We will now train a Decision Tree classifier on our data. The process is similar to how we trained our Logistic Regression model: we create the model, then fit it on our training data. 
'''

# Create an instance of DecisionTreeClassifier


# Fit the model on the training data


'''
## Make Predictions and Evaluate the Decision Tree Model

With our Decision Tree model trained, we can now make predictions on our test data. We will compare these predictions with the actual outcomes to evaluate the performance of our model. 
'''

# Make predictions on the test set


# Calculate the accuracy of the Decision Tree model


# Print the accuracy of the Decision Tree model


# Print the confusion matrix of the Decision Tree model



'\n## Make Predictions and Evaluate the Decision Tree Model\n\nWith our Decision Tree model trained, we can now make predictions on our test data. We will compare these predictions with the actual outcomes to evaluate the performance of our model. \n'

In [27]:


'''
## Task: Train a Random Forest Model

Now we will train a Random Forest classifier on our data. Again, the process is similar to our previous models.
'''

# Create an instance of RandomForestClassifier


# Fit the model on the training data


'''
## Make Predictions and Evaluate the Random Forest Model

With our Random Forest model trained, we can now make predictions on our test data and evaluate its performance.
'''

# Make predictions on the test set


# Calculate the accuracy of the Random Forest model


# Print the accuracy of the Random Forest model


# Print the confusion matrix of the Random Forest model



'\n## Make Predictions and Evaluate the Random Forest Model\n\nWith our Random Forest model trained, we can now make predictions on our test data and evaluate its performance.\n'

## Task 2: Feature Engineering
Feature engineering is the process of using domain knowledge to create new features from the existing ones, with the goal of improving the performance of our machine learning model. 
It's a critical step in any data science project, and has the potential to significantly improve our model's predictive power.

Our current model uses the following features: passenger class (pclass), sex, age, number of siblings/spouses aboard (sibsp), number of parents/children aboard (parch), fare, and port of embarkation (embarked). But we have more information available in our dataset that we could potentially use to create new features.

For example, from the name column, we could extract titles like 'Mr', 'Mrs', 'Miss', etc. which might provide information about the passenger's social status, and in turn, their likelihood of survival.

Another example would be combining sibsp and parch into a single feature that represents the total size of the family aboard.

Try out some feature engineering ideas. Specifically, create a 'Title' feature and a 'Family Size' feature, include them in your model, and see how they impact the model's performance.

'''

In [32]:
# Create 'Family Size' feature
df['family_size'] = df['sibsp'] + df['parch'] + 1  # +1 to include the passenger themselves

# Include 'family_size' in your features and train your model



# Fit the model


# Predict and evaluate performance




In [33]:
# Extract the title from the name
df['title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# Check the counts of each unique title
print(df['title'].value_counts())

# You might want to combine some of the rare titles together or with more common ones
df['title'] = df['title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df['title'] = df['title'].replace('Mlle', 'Miss')
df['title'] = df['title'].replace('Ms', 'Miss')
df['title'] = df['title'].replace('Mme', 'Mrs')

# Now, include 'title' in your features, perform one-hot encoding, and train your model




# Fit the model

# Predict and evaluate performance



Mr          120
Mrs          72
Miss         42
Master        8
Dr            4
Col           3
Major         2
Mme           1
Capt          1
Lady          1
Sir           1
Mlle          1
Countess      1
Name: title, dtype: int64


In [34]:
# Select only a subset of features
features = ['pclass', 'sex', 'age', 'fare', 'title']

# Train your model using only these features
X = pd.get_dummies(df[features])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and evaluate performance
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))


Accuracy: 0.782051282051282


For further investigation and looking into which features are most important and the distributions of features good resources are here:

 https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8
 https://towardsdatascience.com/machine-learning-with-the-titanic-dataset-7f6909e58280