# The data has been split into two groups:

- Training set (train.csv)  
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.


- Test set (test.csv)  
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

#### Data Dictionary

Variable	| Definition Key
------------|-----------------
survival	| Survival	0 = No, 1 = Yes
pclass	    | Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	        | Sex	
Age	        | Age in years	
sibsp       | # of siblings / spouses aboard the Titanic	
parch       |	# of parents / children aboard the Titanic	
ticket	    | Ticket number	
fare	    | Passenger fare	
cabin	    | Cabin number	
embarked	| Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes
- pclass: A proxy for socio-economic status (SES)
    - 1st = Upper
    - 2nd = Middle
    - 3rd = Lower
- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.


### Read DataSet

In [2]:
%matplotlib inline
import pandas as pd

In [3]:
df = pd.read_csv('train.csv')

#### Show Data

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Show columns

In [4]:
cols = df.columns[3]
cols

'Name'

### Clean dataset

#### Remove columns

_Name_

In [5]:
df.drop('Name', axis=1, inplace=True)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


_PassengerId_

In [6]:
ids = df['PassengerId']
ids[0:10]

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: PassengerId, dtype: int64

In [7]:
df.drop('PassengerId', axis=1, inplace=True)

In [8]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,female,35.0,1,0,113803,53.1,C123,S
4,0,3,male,35.0,0,0,373450,8.05,,S


#### Removing Null Values

_Checking fields_

In [9]:
df.notnull().all()

Survived     True
Pclass       True
Sex          True
Age         False
SibSp        True
Parch        True
Ticket       True
Fare         True
Cabin       False
Embarked    False
dtype: bool

_Change values of Cabin_

In [10]:
cabines = df['Cabin'].isnull()

In [11]:
df['Cabin'] = cabines
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,True,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,False,C
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,True,S
3,1,1,female,35.0,1,0,113803,53.1,False,S
4,0,3,male,35.0,0,0,373450,8.05,True,S


_Set Age null values to Mean_

In [12]:
media = df['Age'].mean()
media

29.69911764705882

In [13]:
df['Age'].fillna(media, inplace=True)

_Set Embarked null values to Mode_

In [14]:
moda = df['Embarked'].mode()[0]
moda

'S'

In [15]:
df['Embarked'].fillna(moda, inplace=True)

_Checking fields again_

In [16]:
df.notnull().all()

Survived    True
Pclass      True
Sex         True
Age         True
SibSp       True
Parch       True
Ticket      True
Fare        True
Cabin       True
Embarked    True
dtype: bool

### Discrete data

In [17]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,A/5 21171,7.25,True,S
1,1,1,female,38.0,1,0,PC 17599,71.2833,False,C
2,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,True,S
3,1,1,female,35.0,1,0,113803,53.1,False,S
4,0,3,male,35.0,0,0,373450,8.05,True,S


_Sex_

In [18]:
df = df.replace('male', 0)
df = df.replace('female', 1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,A/5 21171,7.25,True,S
1,1,1,1,38.0,1,0,PC 17599,71.2833,False,C
2,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,True,S
3,1,1,1,35.0,1,0,113803,53.1,False,S
4,0,3,0,35.0,0,0,373450,8.05,True,S


_Embarked_

In [19]:
df = df.replace('S', 1)
df = df.replace('C', 2)
df = df.replace('Q', 3)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,A/5 21171,7.25,True,1
1,1,1,1,38.0,1,0,PC 17599,71.2833,False,2
2,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,True,1
3,1,1,1,35.0,1,0,113803,53.1,False,1
4,0,3,0,35.0,0,0,373450,8.05,True,1


#### Statistics

In [20]:
df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.352413,29.699118,0.523008,0.381594,32.204208,1.361392
std,0.486592,0.836071,0.47799,13.002015,1.102743,0.806057,49.693429,0.635673
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,1.0
25%,0.0,2.0,0.0,22.0,0.0,0.0,7.9104,1.0
50%,0.0,3.0,0.0,29.699118,0.0,0.0,14.4542,1.0
75%,1.0,3.0,1.0,35.0,1.0,0.0,31.0,2.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,3.0


In [21]:
df.shape

(891, 10)

In [22]:
df.corr()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
Survived,1.0,-0.338481,0.543351,-0.069809,-0.035322,0.081629,0.257307,-0.316912,0.106811
Pclass,-0.338481,1.0,-0.1319,-0.331339,0.083081,0.018443,-0.5495,0.725541,0.045702
Sex,0.543351,-0.1319,1.0,-0.084153,0.114631,0.245489,0.182333,-0.140391,0.116569
Age,-0.069809,-0.331339,-0.084153,1.0,-0.232625,-0.179191,0.091566,-0.233123,0.007461
SibSp,-0.035322,0.083081,0.114631,-0.232625,1.0,0.414838,0.159651,0.04046,-0.059961
Parch,0.081629,0.018443,0.245489,-0.179191,0.414838,1.0,0.216225,-0.036987,-0.078665
Fare,0.257307,-0.5495,0.182333,0.091566,0.159651,0.216225,1.0,-0.482075,0.062142
Cabin,-0.316912,0.725541,-0.140391,-0.233123,0.04046,-0.036987,-0.482075,1.0,-0.013774
Embarked,0.106811,0.045702,0.116569,0.007461,-0.059961,-0.078665,0.062142,-0.013774,1.0


### Preprocessing

In [23]:
from sklearn import preprocessing

In [40]:
dados = df.values
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,7.25,True,1
1,1,1,1,38.0,1,0,71.2833,False,2
2,1,3,1,26.0,0,0,7.925,True,1
3,1,1,1,35.0,1,0,53.1,False,1
4,0,3,0,35.0,0,0,8.05,True,1


# TESTE

In [25]:
df.drop('Ticket', inplace=True, axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,7.25,True,1
1,1,1,1,38.0,1,0,71.2833,False,2
2,1,3,1,26.0,0,0,7.925,True,1
3,1,1,1,35.0,1,0,53.1,False,1
4,0,3,0,35.0,0,0,8.05,True,1


In [33]:
df.replace('W./C. 6607', df['Fare'].mean())
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,7.25,True,1
1,1,1,1,38.0,1,0,71.2833,False,2
2,1,3,1,26.0,0,0,7.925,True,1
3,1,1,1,35.0,1,0,53.1,False,1
4,0,3,0,35.0,0,0,8.05,True,1


#### Features

In [41]:
X = dados[:, 1:]
X

array([[3, 0, 22.0, ..., 7.25, True, 1],
       [1, 1, 38.0, ..., 71.2833, False, 2],
       [3, 1, 26.0, ..., 7.925, True, 1],
       ..., 
       [3, 1, 29.69911764705882, ..., 23.45, True, 1],
       [1, 0, 26.0, ..., 30.0, False, 2],
       [3, 0, 32.0, ..., 7.75, True, 3]], dtype=object)

#### Labels

In [42]:
y = dados[:, 0]
y[0:10]

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=object)

In [43]:
df.dtypes

Survived      int64
Pclass        int64
Sex           int64
Age         float64
SibSp         int64
Parch         int64
Fare        float64
Cabin          bool
Embarked      int64
dtype: object

### Learning

In [44]:
from sklearn import tree

In [45]:
clf = tree.DecisionTreeRegressor()

In [46]:
clf.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

In [None]:
tdf = pd.re