In [13]:
import pandas as pd
import numpy as np
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")
submission = pd.read_csv("./data/gender_submission.csv")

- We want to find patterns in train.csv that help us predict whether the passengers in test.csv survived

### Exploring a pattern
- the sample submission file in gender_submission.csv assumes that all female passengers survived
- let's check if this pattern holds true in the data

In [14]:
women = train_df.loc[train_df.Sex == 'female']['Survived']
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [15]:
men = train_df.loc[train_df.Sex == 'male']['Survived']
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


- Nice guess, but this gender-based submission bases its predictions on only a single column
- As we can imagine, by considering multiple columns, we can discover more complex patterns that can potentially yield better-informed predictions
- Since it is quite difficult to consider several columns at once (or, it would take a long time to consider all possble patterns in many different columns simultaneously), let's use machine learning to automate this for us

## Random forest model

- We will construct 100 trees that will individually consider each passenger's data and vote on whether the individual survived
- Then, the outcome with the most votes win
- We look for patterns in 4 different columns: Pclass, Sex, SibSp and Parch
- It constructs the trees in the random forest model based on patterns in the train.csv file, before generating predictions for the passengers in test.csv

In [18]:
from sklearn.ensemble import RandomForestClassifier

y = train_df["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_df[features])
X_test = pd.get_dummies(test_df[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_df.PassengerId, 'Survived': predictions})
output.to_csv('./submit/titanic_submit_answer2.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


In [17]:
output

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [19]:
from sklearn.ensemble import RandomForestClassifier

y = train_df["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_df[features])
X_test = pd.get_dummies(test_df[features])

model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=2)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_df.PassengerId, 'Survived': predictions})
output.to_csv('./submit/titanic_submit_test.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


결과가 1.0 안 나오는건 당연한거임: 1.0 나온 애들은 그냥 답지 제출한거임