The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask to build a predictive model that answers the question: “What sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Our data is from the Kaggle website. The data has been split into two groups:
- training set (train.csv)
- test set (test.csv)
For the training set, they provide the outcome (also known as the “ground truth”) for each passenger. Our model will be based on “features” like passengers’ gender and class. You can also use feature engineering
to create new features.
For the test set, we do not provide the ground truth for each passenger. It is our job to predict these outcomes. For each passenger in the test set, use the model we trained to predict whether or not they survived the sinking of the Titanic.
survival ► 0 = No, 1 = Yes
pclass ► Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex ► Sex
Age ► Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp ► The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch ►
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children traveled only with a nanny, therefore parch=0 for them.
ticket ► Ticket number
fare ► Passenger fare
cabin ► Cabin number
embarked ► Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Your score is the percentage of passengers you correctly predict. This is known as accuracy. We should achieve 1.0 accuracy for being among the best submissions.
You see other phases throughout the notebook.
Hope you enjoy reading my code and I would be happy to hear your opinions and suggestions!