<center><h1>Titanic: Machine Learning from Disaster</h1></center>
![Titanic](img/titanic.jpg)

# Introduction to decision trees
- In the previous chapter we found subsets of individuals that have a higher chance of surviving. 
- A decision tree automates this process for you and outputs a classification model or classifier.

- Starts with all the data at the root node
- Scans all the variables for the best one to split on. 
- Once a variable is chosen, you do the split and go down one level (or one node) and repeat. 
- The final nodes at the bottom of the decision tree are known as terminal nodes
- The majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

![Decision Tree](img/decisionTree.png)

# Instructions 

- Import the <b>numpy</b> library as <b>np</b>
- From <b>sklearn</b> import the <b>tree</b>

In [4]:
# Import the Numpy library

# Import 'tree' from scikit-learn library
from sklearn 

In [22]:
# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn import tree

## Cleaning and Formatting your Data

- Before you can start we have to clean up our data so we can use all features available to us
- The Age variable includes missing values
- Substitute missing values by the median of all present values


    train["Age"] = train["Age"].fillna(train["Age"].median())

- The <b>Sex</b> and <b>Embarked</b> variables include missing values, but are non numeric
- You can assign each class in <b>Sex</b> an unique integer
- and substitute the missing values in <b>Embarked</b> by with the most common class,which is <b>"S"</b>

## Instructions

- Assign the integer 1 to all females
- Impute missing values in <b>Embarked</b> with class <b>S</b> and in <b>Age</b> by its median. Use <b>.fillna()</b> method.
- Replace each class of Embarked with a uniques integer. <b>0</b> for <b>S</b>, <b>1</b> for <b>C</b>, and <b>2</b> for <b>Q</b>.
- Print the Sex and Embarked columns

In [8]:
########################## Ignore this part #####################################
import pandas as pd

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)
#################################################################################

# Convert the male and female groups to integer form
train.ix[train.Sex == "male", "Sex"] = 0

# Impute the Embarked variable
train["Embarked"] = 

# Convert the Embarked classes to integer form
train.ix[train.Embarked == "S", "Embarked"] = 0

# Print the Sex and Embarked columns


In [15]:
########################## Ignore this part #####################################
import pandas as pd

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

#################################################################################

# Convert the male and female groups to integer form
train.ix[train.Sex == "male", "Sex"] = 0
train.ix[train.Sex == "female", "Sex"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Impute the Age variable
train["Age"] = train["Age"].fillna(train["Age"].median())

# Convert the Embarked classes to integer form
train.ix[train.Embarked == "S", "Embarked"] = 0
train.ix[train.Embarked == "C", "Embarked"] = 1
train.ix[train.Embarked == "Q", "Embarked"] = 2

#Print the Sex and Embarked columns
print(train[["Sex", "Embarked"]].head())

## Creating your first decision tree
- Build your first decision tree, using scikit-learn and numpy. 
- Tree objects can be generated using the DecisionTreeClassifier class. 
- The methods that we will use take numpy arrays as inputs 
- We will reformat the DataFrame that we already have. 

We will need the following to build a decision tree

- <b>target</b>: A one-dimensional numpy array containing the target/response from the train data. (<b>Survival</b> in your case)
- <b>features</b>: A multidimensional numpy array containing the features/predictors from the train data. (ex. <b>Sex</b>, Age)

Take a look at the sample code below to see what this would look like:

In [14]:
target = train["Survived"].values
features = train[["Sex", "Age"]].values
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)

- Sneak peak: See the importance of the features that are included in the decision tree
- This can be done by requesting the <b>.feature\_importances_</b> attribute of your tree object
- Another quick metric is the mean accuracy that you can compute using the <b>.score()</b> function with features_one and target as arguments.

Ok, time for you to build your first decision tree in Python! The train and testing data from chapter 1 are available in your workspace.

### Instructions

- Build the <b>target</b> and <b>features_one</b> numpy arrays. The target will be based on the <b>Survived</b> column in <b>train</b>. The features array will be based on the variables Passenger Class, Sex, Age, and Passenger Fare
- Build a decision tree <b>my_tree_one</b> to predict survival using <b>features_one</b> and <b>target</b>
- Look at the importance of features in your tree and compute the score

In [3]:
# Print the train data to see the available features
print(train)

# Create the target and features numpy arrays: target, features_one
target = train[___].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(___, ___)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(___, ___))

In [16]:
# Print the train data to see the available features
#print(train.head())

# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
#print()
#print(my_tree_one.feature_importances_)
#print(my_tree_one.score(features_one, target))

## Interpeting your decision tree
- The <b>feature\_importances_</b> attribute make it simple to interpret the significance of the predictors you include.
- What variable plays the most important role in determining whether or not a passenger survived? 

Have a look at your model (<b>my_tree_one</b>) to find out!

1. [ ] Passenger Class

2. [ ] Sex/Gender

3. [ ] Passenger Fare

4. [ ] Age

1. [ ] Passenger Class

2. [ ] Sex/Gender

3. [ x ] Passenger Fare

4. [ ] Age

## Predict on unseen data
- In the last excercise we created simple predictions based on a single subset. 
- We can make use of some simple functions to "generate" our answer without having to manually perform subsetting.

1. First, you make use of the <b>.predict()</b> method. 
2. You provide it the model (<b>my_tree_one</b>), the values of features from the dataset for which predictions need to be made (<b>test</b>). 
3. Extract features using a numpy array (same as for training)

<b>BUT NOT SO FAST</b>: There are missing values in the <b>Fare</b> and <b>Age</b> feature, which will have to be imputed first.

### Instructions

- Impute the missing value for Fare and Age with the median of the respective column.
- Make a prediction on the test set using the <b>.predict()</b> method and <b>my_tree_one</b>. Assign the result to <b>my_prediction</b>.
- Create a data frame <b>my_solution</b> containing the solution and the passenger ids from the test set. Make sure the solution is in line with the standards set forth by Kaggle by naming the column appropriately.

In [4]:
########################## Ignore this part ######################################
test.ix[test.Sex == "male", "Sex"] = 0
test.ix[test.Sex == "female", "Sex"] = 1
#################################################################################

# Impute the missing value with the median
test["Fare"] = 
test["Age"] =

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[[___, ___, ___, ___]].values

# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

In [None]:
########################## Ignore this part #####################################
test.ix[test.Sex == "male", "Sex"] = 0
test.ix[test.Sex == "female", "Sex"] = 1
#################################################################################

# Impute the missing value with the median
test["Fare"] = test["Fare"].fillna(test["Fare"].median())
test["Age"] = test["Age"].fillna(test["Age"].median())

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution.head())

## Overfitting and how to control it

- Using default parameters for <b>max_depth</b> and <b>min_samples_split</b> is likely to overfit our data
- Our model describes the training data extremely well, but it does not generalize to new data
- We can improve by setting thresholds for <b>max_depth</b> and <b>min_samples_split</b>
- <b>max_depth</b>: Determines when the splitting up of the decision tree stops.
- <b>min_samples_split</b>: Monitors the amount of observations in a bucket (e.g minimum 10 passengers).

## Instructions

- Include the Siblings/Spouses Aboard, Parents/Children Aboard, and Embarked features in a new set of features.
- Fit your second tree <b>my_tree_two</b> with the new features, and control for the model compelexity by toggling the <b>max_depth</b> and <b>min_samples_split</b> arguments.

In [5]:
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", ___, ___, ___]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 
min_samples_split =
my_tree_two = tree.DecisionTreeClassifier(max_depth = ___, min_samples_split = ____, random_state = 1)
my_tree_two = 

#Print the score of the new decison tree


In [None]:
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

# Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth, min_samples_split = min_samples_split, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)
my_tree_two_score = my_tree_two.score(features_two, target)

# Print the score of the new decison tree
print(my_tree_two_score)

## Feature-Engineering for our Titanic data set
While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: <b>family_size</b>.

Assumption: "Larger families need more time to get together on a sinking ship"
Hence: Lower chance of survival

- Family size is determined by the variables <b>SibSp</b> and <b>Parch</b>
- Add a new variable <b>family_size</b>, which is the sum of <b>SibSp</b> and - <b>Parch</b> plus one (the observation itself), to the test and train set.

## Instructions
- Create a new train set train_two that differs from train only by having an extra column with your feature engineered variable family_size.
- Add your feature engineered variable <b>family_size</b> in addition to <b>Pclass</b>, <b>Sex</b>, <b>Age</b>, <b>Fare</b>, <b>SibSp</b> and <b>Parch</b> to <b>features_three</b>.
- Create a new decision tree as <b>my_tree_three</b> and fit the decision tree with your new feature set <b>features_three</b>. Then check out the score of the decision tree.

In [6]:
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = 

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", ___]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = 

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))

In [None]:
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two[["SibSp", "Parch"]].sum(axis=1) + 1

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))

## A Random Forest analysis in Python
A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. 

- Grows multiple (very deep) classification trees using the training set. 
- Each tree is used to come up with a prediction
- Every outcome is counted as a vote. 

For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. 

This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

- Use <b>RandomForestClassifier()</b> class instead of the <b>DecisionTreeClassifier()</b> class.
- <b>n_estimators</b> needs to be set when using the <b>RandomForestClassifier()</b> class. This argument allows you to set the number of trees you wish to plant and average over.

## Instructions

- Build the random forest with n_estimators set to 100.
- Fit your random forest model with inputs features_forest and target.
- Compute the classifier predictions on the selected test set features.

In [9]:
########################## Ignore this part ######################################
test.ix[test.Embarked == "S", "Embarked"] = 0
test.ix[test.Embarked == "C", "Embarked"] = 1
test.ix[test.Embarked == "Q", "Embarked"] = 2
##################################################################################

# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = ___, random_state = 1)
my_forest = forest.fit(___, ___)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(___)
print(pred_forest)

In [None]:
# Ignore this part
test.ix[test.Embarked == "S", "Embarked"] = 0
test.ix[test.Embarked == "C", "Embarked"] = 1
test.ix[test.Embarked == "Q", "Embarked"] = 2
################################################################################

# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

pred_forest = my_forest.predict(test_features)
print(pred_forest)

## Interpreting and Comparing
Remember how we looked at <b>.feature\_importances_</b> attribute for the decision trees? 

- Well, you can request the same attribute from your random forest as well 
- You might also want to compare the models in some quick and easy way. For this, we can use the <b>.score()</b> method. 

For this exercise, you have <b>my_forest</b> and <b>my_tree_two</b> available to you. The features and target arrays are also ready for use.

## Instructions
- Explore the feature importance for both models
- Compare the mean accuracy score of the two models

In [7]:
# Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print()

# Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print()

In [None]:
#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)

#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_two, target))

## Conclude and Submit
Based on your finding in the previous exercise determine which feature was of most importance, and for which model. After this final exercise. 

Use <b>my_forest</b>, <b>my_tree_two</b>, and <b>feature\_importances_</b> to answer the following questions.

1. [ ] The most important feature was "<b>Age</b>", but it was more significant for "<b>my_tree_two</b>"

2. [ ] The most important feature was "<b>Sex</b>", but it was more significant for "<b>my_tree_two</b>"

3. [ ] The most important feature was "<b>Sex</b>", but it was more significant for "<b>my_forest</b>"

4. [ ] The most important feature was "<b>Age</b>", but it was more significant for "<b>my_forest</b>"

1. [ ] The most important feature was "<b>Age</b>", but it was more significant for "<b>my_tree_two</b>"

2. [ x ] The most important feature was "<b>Sex</b>", but it was more significant for "<b>my_tree_two</b>"

3. [ ] The most important feature was "<b>Sex</b>", but it was more significant for "<b>my_forest</b>"

4. [ ] The most important feature was "<b>Age</b>", but it was more significant for "<b>my_forest</b>"