My analysis of the Kaggle Titanic Dataset
- Drop
Name
,Ticket
andCabin
columns. - Transform
Fare
column to indicate difference from the median fare by passenger class. - Imputes missing
Age
values with median based on sex and passenger class. - Transform
SibSp
andParch
intoFamSize
feature by taking a sum. - Scores:
- Random Forest: 0.79904 with
n_estimators=300, max-depth=6
. - Logistic Regression: 0.77512 with
degree=3, C=0.005
. - SVM: 0.77033 with
gamma='auto', C=3
.
- Random Forest: 0.79904 with
- Imputed using the median from both the train and test sets, instead of just the train set.
- Scores:
- Random Forest: 0.78947 with
n_estimators=400, max-depth=7
. - Logistic Regression: 0.77033 with
degree=3, C=0.003
. - SVM: 0.78947 with
gamma='auto', C=1
.
- Random Forest: 0.78947 with
- Added 1 to
FamSize
and log transformedFare
. Addedmax_features
parameters for random forests. - Scores:
- Random Forest: 0.79426 with
n_estimators=400, max_features=4, max-depth=5
. - Logistic Regression: 0.77033 with
degree=2, C=0.03
. - SVM: 0.77990 with
gamma='auto', C=3
.
- Random Forest: 0.79426 with
- Create a
Title
feature fromName
. Then dropName
. - Create a
TicketSize
feature: Size of each group sharing aTicket
number.- Divide
Fare
byTicketSize
to get per-person fare. - Then drop
TicketSize
andTicket
.
- Divide
- Impute
Age
values using median, grouped byTitle
andSex
.- Then create a
Child
feature and dropAge
.
- Then create a
- Transform
SibSp
andParch
intoFamSize
by addition.- Then transform
FamSize
intoLargeFam
andSmallFam
indicators.
- Then transform
- Drop the
Embarked
andCabin
features.