# Titanic Survival Prediction

## Introduction

This is my first kernel experiment machine learning. I will use Titanic dataset from Kaggle to predict how likely people survive through disaster.

## Steps
1. Load Data
2. Feature engineering
3. Train model using Decision Tree
4. Conclusion

### Load data

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, accuracy_score


In [2]:
train = pd.read_csv("./input/train.csv")
test = pd.read_csv("./input/test.csv")

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Some observations:
- We can drop Name,Ticket column
- We can drop Fare due to pclass already represent
- Cabin contains NaN value but its also potental relate to survival
- Combine column: SibSp/Parch

### Feature Engineering

In [4]:
train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Check missing values

In [5]:
print(pd.isnull(train).sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


- Drop Cabin due to having many Null values


In [6]:
train.pop('Name')
train.pop('Ticket')
train.pop('Cabin')

0              NaN
1              C85
2              NaN
3             C123
4              NaN
5              NaN
6              E46
7              NaN
8              NaN
9              NaN
10              G6
11            C103
12             NaN
13             NaN
14             NaN
15             NaN
16             NaN
17             NaN
18             NaN
19             NaN
20             NaN
21             D56
22             NaN
23              A6
24             NaN
25             NaN
26             NaN
27     C23 C25 C27
28             NaN
29             NaN
          ...     
861            NaN
862            D17
863            NaN
864            NaN
865            NaN
866            NaN
867            A24
868            NaN
869            NaN
870            NaN
871            D35
872    B51 B53 B55
873            NaN
874            NaN
875            NaN
876            NaN
877            NaN
878            NaN
879            C50
880            NaN
881            NaN
882         

In [7]:
print(pd.isnull(train).sum())

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64


In [8]:
test["Age"].fillna(np.random.randint(low=30,high=60,size=1),inplace=True) 
train=train.dropna(how='any')  

ValueError: invalid fill value with a <class 'numpy.ndarray'>

### plot data

In [None]:
plt.scatter(train[train['Survived']==0]['Age'],train[train['Survived']==0]['Pclass'],color='r', s=5, marker="o")
plt.scatter(train[train['Survived']==1]['Age'],train[train['Survived']==1]['Pclass'],color='g', s=5, marker="x")
plt.xlabel("age")
plt.ylabel("pclass")


plt.show()

In [None]:
train.tail()

In [None]:
def age_to_int(value):
    if(value == 0):
        return 1
    if(0<value and value <=20):
        return 2
    if(20<=value and value<=38):
        return 3
    if(value>38):
        return 4
def fare_to_int(value):
    
    if(0<value and value <=128):
        return 1
    else:
        return 2
    

In [None]:


train['Sex']= train['Sex'].astype('category')
train['Sex_cat']=train['Sex'].cat.codes
train['Embarked']= train['Embarked'].astype('category')
train['Embarked_cat']=train['Embarked'].cat.codes
train['Age_cat']=train['Age'].apply(age_to_int)
train['Fare_cat']=train['Fare'].apply(fare_to_int)

In [None]:
train.tail()

## Train data

In [None]:
X=train.drop(['PassengerId','Survived','Sex','Age','Embarked','Fare'],axis=1)
y=train['Survived']

In [None]:
X.tail()


In [None]:
from sklearn.linear_model import LogisticRegression
model= LogisticRegression();
model=model.fit(X,y);

In [None]:
model.score(X, y)

### Prepare test data

In [None]:
test.pop('Name')

test.pop('Ticket')
test.pop('Cabin')
test["Age"].fillna(np.random.randint(low=30,high=60,size=1),inplace=True) 
test["Embarked"].fillna('S',inplace=True) 
test['Sex']= test['Sex'].astype('category')
test['Sex_cat']=test['Sex'].cat.codes
test['Embarked']= test['Embarked'].astype('category')
test['Embarked_cat']=test['Embarked'].cat.codes
test['Age_cat']=test['Age'].apply(age_to_int)
test['Fare_cat']=test['Fare'].apply(fare_to_int)

In [None]:
test.tail()

In [None]:
test.count()

In [None]:
passengerid=test['PassengerId']
test = test.drop(['PassengerId','Sex','Age','Embarked','Fare'],axis=1)

In [None]:
predictions = model.predict(test)

In [None]:

output = pd.DataFrame({ 'PassengerId' : passengerid, 'Survived': predictions })
output.to_csv('./ouput/submission.csv', index=False)