# Titanic competition - Kaggle


This notebook is part of the kaggle project "Titanic - Machine Learning from Disaster".

The goal is to create a machine learning model that will predict the passengers who will survive the wreck.

## Import Dependencies

In [1]:
import pandas as pd
import numpy as np

## Load dataset

In [2]:
train_data = pd.read_csv('../data/raw/train.csv') # training dataset will be used for training our model

train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test_data = pd.read_csv('../data/raw/test.csv') # test dataset will be used for making predictions after we train our model on the training dataset

test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Exploring a pattern

In [4]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [5]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


As seen, exploring a single pattern might show consistent results. 
But since it's based on a single column, it might not reflect the reality and be based on limited information to create outcomes.

## Creating a machine learning model

In [6]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"] # target variable we want to predict

features = ["Pclass", "Sex", "SibSp", "Parch"] # features we want to use for making predictions. We will use the same features from the training dataset to make predictions on the test dataset, so we need to make sure that these features are present in both datasets.
X = pd.get_dummies(train_data[features]) # get_dummies is a function that converts categorical variables into dummy/indicator variables.
X_test = pd.get_dummies(test_data[features]) # we need to make sure that the features in the test dataset are in the same order as the features in the training dataset. We can do this by using the get_dummies function on both datasets and then reordering the columns of the test dataset to match the order of the columns in the training dataset.

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1) # selecting a machine learning model. In this case, we are using a random forest classifier. We can experiment with different models and different hyperparameters to see which one gives us the best results.
model.fit(X, y) # training the model on the training dataset. The fit method takes the features (X) and the target variable (y) as input and trains the model.
predictions = model.predict(X_test) # making predictions on the test dataset. The predict method takes the features of the test dataset as input and returns the predicted values for the target variable.

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions}) # creating a dataframe with the PassengerId and the predicted values for the target variable. This is the format that Kaggle expects for the submission file.
output.to_csv('submission.csv', index=False) # saving the dataframe to a csv file. The index=False argument is used to prevent pandas from writing row indices to the csv file.
print("Your submission was successfully saved!")

Your submission was successfully saved!


## Measuring Model Accuracy

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Target
y = train_data["Survived"]

# Features
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])

# Train / validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42
)

model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_val)

# Accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")


Validation Accuracy: 0.7709


## Conclusion

- The model selected for prediction was the Random Forest.
- The accuracy level was approximately 77%
- This project intented to create a simple machine learning model for predicting the survivors of the shipwreck.
- Even though the target should always be 1 (100%) this notebook can present how to choose and operate the most viable model for a prediction.