# Project: Titanic Data Exploration

In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. I will explore the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive. To complete this project, I will need to implement several conditional predictions and answer the questions below.

#### Loading  - To begin working with the RMS Titanic passenger data, we'll first need to import the functionality we need, and load our data into a pandas DataFrame.


In [8]:
import numpy as np
import pandas as pd
from IPython.display import display
%matplotlib inline

#Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

#Print the first few entries of the RMS Titanic data
full_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
#We need to store the survived features in a new variable, and remove it from the initial dataset
outcomes = full_data['Survived']
data = full_data.drop('Survived', axis = 1)

#Show the new dataset with the target removed
data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### Out of the first five passengers, if we predict that all of them survived, what would you expect the accuracy of our predictions to be?


In [11]:
def accuracy_score(truth, pred):
    """Returns accuracy score for input truth predictions"""
    
    #Ensure that the number of predictions matches the number of outcomes
    if len(truth) == len(pred):
        
        #Calculate and return the accuracy as a percent
        return "The predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
    else:
        return "The numver of predictions do not match"
    
# Test the function
predictions = pd.Series(np.ones(5, dtype = int))
print (accuracy_score(outcomes[:5], predictions))

The predictions have an accuracy of 60.00%.


#### The majority of passengers did not survive, so what would our accuracy be if we assumed all passengers died aboard the ship?


In [14]:
def prediction_0(data):
    """Model with no features. Always predicts a passenger did not survive"""
    
    predictions = []
    for _, passenger in data.iterrows():
        
        #Predict the survival of 'passenger'
        predictions.append(0)
        
    #Return our predictionns
    return pd.Series(predictions)

#Make the predictions
predictions = prediction_0(data)
print(accuracy_score(outcomes, predictions))

The predictions have an accuracy of 61.62%.


#### Exploring the dataset, a large majority of males did not survive, however a majority of females did survive, if we predicted that if our passenger was female, she survived, what would our accuracy be?


In [18]:
def prediction_1(data):
    """Model with one feature: 
                - Predicted a passenger survived if they are female"""
    predictions = []
    for _, passenger in data.iterrows():

        predictions.append(1 if passenger['Sex'] == 'female' else 0)
        
    # Return our predictions
    return pd.Series(predictions)
# Make the predictions
predictions = prediction_1(data)
print( accuracy_score(outcomes, predictions))

The predictions have an accuracy of 78.68%.


#### Examining the survival statistics, the majority of males younger than 10 survived the ship sinking, whereas most males age 10 or older did not survive the ship sinking. Let's continue to build on our previous prediction: If a passenger was female, then we will predict they survive. If a passenger was male and younger than 10, then we will also predict they survive. Otherwise, we will predict they do not survive.


In [26]:
def predictions_2(data):
    """ Model with two features: 
            - Predict a passenger survived if they are female.
            - Predict a passenger survived if they are male and younger than 10. """
    
    predictions = []
    for _, passenger in data.iterrows():
        
        predictions.append(1 if passenger['Sex'] == 'female' \
                           or (passenger['Sex'] == 'male' and passenger['Age'] < 10) \
                           else 0)
    
    # Return our predictions
    return pd.Series(predictions)

# Make the predictions
predictions = predictions_2(data)
print (accuracy_score(outcomes, predictions))


The predictions have an accuracy of 79.35%.


#### Find a series of features and conditions to split the data on to obtain an outcome prediction accuracy of at least 80%

In [41]:
def predictions_3(data):
    """ Model with multiple features. Makes a prediction with an accuracy of at least 80%. """
    
    predictions = []
    for _, passenger in data.iterrows():
        
        # Remove the 'pass' statement below 
        # and write your prediction conditions here
        predictions.append(1 if (passenger['Sex'] == 'female') \
                           or (passenger['Sex'] == 'male' and passenger['Age'] < 10 and passenger['Pclass'] <= 2) \
                           or (passenger['Sex'] == 'female' and passenger['Parch'] == 0) \
                           or (passenger['Sex'] == 'male' and passenger['Fare'] > 500) else 0)
    
    # Return our predictions
    return pd.Series(predictions)

# Make the predictions
predictions = predictions_3(data)
print (accuracy_score(outcomes, predictions))


The predictions have an accuracy of 80.13%.


## Conclusion

After several iterations of exploring and conditioning on the data, I have built a useful algorithm for predicting the survival of each passenger aboard the RMS Titanic. The technique applied in this project is a manual implementation of a simple machine learning model, the decision tree. A decision tree splits a set of data into smaller and smaller groups (called nodes), by one feature at a time. Each time a subset of the data is split, our predictions become more accurate if each of the resulting subgroups are more homogeneous (contain similar labels) than before. The advantage of having a computer do things for us is that it will be more exhaustive and more precise than our manual exploration above.

A decision tree is just one of many models that come from supervised learning. In supervised learning, we attempt to use features of the data to predict or model things with objective outcome labels. That is to say, each of our data points has a known outcome value, such as a categorical, discrete label like 'Survived', or a numerical, continuous value like predicting the price of a house.