# Predictive Modeling using the Titanic Dataset
---

The goal of this project is to predict who survived the Titanic disaster. The dataset includes passenger information like age, sex, ticket class, and whether or not they survived.

Firstly, you'll need the Titanic dataset. It is available in several places. For our use case, we'll download it from a public GitHub repository using Python's requests library.

In [2]:
import pandas as pd
import requests
from io import StringIO

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
data_string = requests.get(url).text
data = StringIO(data_string) 
titanic_df = pd.read_csv(data)

titanic_df.to_csv('./data/titanic.csv', index=False)

Now that we have the data, let's do some data cleaning and feature engineering with Python.

In [3]:
# Load the data
titanic_df = pd.read_csv('./data/titanic.csv')

# Fill missing age data with median values
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)

# Convert 'Sex' to a numerical variable (male:0, female:1)
titanic_df['Sex'] = titanic_df['Sex'].map({'male': 0, 'female': 1})

# Feature engineering: create a new feature 'FamilySize'
titanic_df['FamilySize'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1

# Drop columns we don't need
titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)

# Save the cleaned data
titanic_df.to_csv('./data/cleaned_titanic.csv', index=False)


Next, we will use R and the `randomForest` package to build a predictive model. We can call R scripts from Python using `rpy2`.

In [16]:
import os
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects.vectors import StrVector

# Set the CRAN mirror
robjects.r('''
options(repos = c(CRAN = "https://cloud.r-project.org"))
''')

# Check if the randomForest package is installed
if 'randomForest' not in robjects.r('installed.packages()'):
    # Install the randomForest package
    utils = importr('utils')
    utils.install_packages(StrVector(['randomForest']))

# Import the randomForest package
randomForest = importr("randomForest")

# Get the absolute path to the cleaned_titanic.csv file
current_dir = os.getcwd()
cleaned_titanic_path = os.path.join(current_dir, 'data', 'cleaned_titanic.csv')

# Define your R script
r_script = f"""
# Load necessary libraries
library(randomForest)

# Load the data
titanic_df <- read.csv('{cleaned_titanic_path}')

# Convert 'Survived' to a factor
titanic_df$Survived <- as.factor(titanic_df$Survived)

# Split the data into training and testing sets
set.seed(123)
train_indices <- sample(1:nrow(titanic_df), nrow(titanic_df)*0.7)
train_df <- titanic_df[train_indices, ]
test_df <- titanic_df[-train_indices, ]

# Build the model
rf_model <- randomForest(Survived ~ ., data=train_df, ntree=500, mtry=3, importance=TRUE)

# Make predictions on the test set
predictions <- predict(rf_model, test_df)

# Evaluate the model
conf_matrix <- table(test_df$Survived, predictions)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

print(accuracy)
"""

# Run the R script
robjects.r(r_script)


[1] 0.8395522


This script builds a Random Forest model, uses it to make predictions on the test data, and evaluates the accuracy of these predictions.

You may need to install the necessary R packages if you don't have them yet. You can install them in R with `install.packages('randomForest')`. Also, remember that you can get more detailed model evaluation metrics, tune model parameters, and try different models to potentially get better results. This is just a simple demonstration of a data science project involving Python, R, and VS Code. 