# Titanic: Machine Learning from Disaster

## Getting Started

When the Titanic sank, 1502 of the 2224 passengers and crew were killed. One of the main reasons for this high level of casualties was the lack of lifeboats on this self-proclaimed "unsinkable" ship.

Those that have seen the movie know that some individuals were more likely to survive the sinking (lucky Rose) than others (poor Jack). In this course, you will learn how to apply machine learning techniques to predict a passenger's chance of surviving using Python.

Let's start with loading in the training and testing set into your Python environment. You will use the training set to build your model, and the test set to validate it. The data is stored on the web as csv files; their URLs are already available as character strings in the sample code. You can load this data with the <it>read_csv()</it> method from the Pandas library.

### Instructions 
    
- First, import the Pandas library as pd.
- Load the test data similarly to how the train data is loaded.
- Inspect the first couple rows of the loaded dataframes using the .head() method with the code provided.

In [1]:
# Import the Pandas library
import pandas as pd

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

#Print the `head` of the train and test dataframes
print(train.head())
print(test.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
  

## Understanding your data

Before starting with the actual analysis, it's important to understand the structure of your data. Both <b>test</b> and <b>train</b> are DataFrame objects, the way pandas represent datasets. You can easily explore a DataFrame using the <b>.describe()</b> method. <b>.describe()</b> summarizes the columns/features of the DataFrame, including the count of observations, mean, max and so on. Another useful trick is to look at the dimensions of the DataFrame. This is done by requesting the <b>.shape</b> attribute of your DataFrame object. (ex. <b>your_data.shape</b>)

The training and test set are already available in the workspace, as <b>train</b> and <b>test</b>. Apply <b>.describe()</b> method and print the <b>.shape</b> attribute of the training set. How many observations and variables, does the training set include and what is the count for the Age variable?

In [26]:
print(train.describe())
print(train.shape)

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000         NaN    0.000000   
50%     446.000000    0.000000    3.000000         NaN    0.000000   
75%     668.500000    1.000000    3.000000         NaN    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
(891, 12)


## Possible Answers
1. [ x ] The training set has 891 observations and 12 variables, count for Age is 714.

2. [ ] The training set has 418 observations and 11 variables, count for Age is 891.

3. [ ] The testing set has 891 observations and 11 variables, count for Age is 891.

4. [ ] The testing set has 418 observations and 12 variables, count for Age is 714.

## Rose vs Jack, or Female vs Male

How many people in your training set survived the disaster with the Titanic? To see this, you can use the <b>value_counts()</b> method in combination with standard bracket notation to select a single column of a DataFrame:

    # absolute numbers
    train["Survived"].value_counts()

    # percentages
    train["Survived"].value_counts(normalize = True)

If you run these commands in the console, you'll see that 549 individuals died (62%) and 342 survived (38%). A simple way to predict heuristically could be: "majority wins". This would mean that you will predict every unseen observation to not survive.

To dive in a little deeper we can perform similar counts and percentage calculations on subsets of the Survived column. For example, maybe gender could play a role as well? You can explore this using the <b>.value_counts()</b> method for a two-way comparison on the number of males and females that survived, with this syntax:
    
    train["Survived"][train["Sex"] == 'male'].value_counts()
    train["Survived"][train["Sex"] == 'female'].value_counts()
    
To get proportions, you can again pass in the argument <b>normalize = True</b> to the <b>.value_counts()</b> method.

### Instructions

- Calculate and print the survival rates in absolute numbers using <b>values_counts()</b> method.
- Calculate and print the survival rates as proportions by setting the <b>normalize</b> argument to <b>True</b>.
- Repeat the same calculations but on subsets of survivals based on Sex.

In [27]:
# Passengers that survived vs passengers that passed away
print(train["Survived"].value_counts())

# As proportions
print(train["Survived"].value_counts(normalize=True))

# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True))

# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True))

0    549
1    342
Name: Survived, dtype: int64
0    0.616162
1    0.383838
Name: Survived, dtype: float64
0    468
1    109
Name: Survived, dtype: int64
1    233
0     81
Name: Survived, dtype: int64
0    0.811092
1    0.188908
Name: Survived, dtype: float64
1    0.742038
0    0.257962
Name: Survived, dtype: float64


## Does age play a role?

Another variable that could influence survival is age; since it's probable that children were saved first. You can test this by creating a new column with a categorical variable <b>Child</b>. <b>Child</b> will take the value 1 in cases where age is less than 18, and a value of 0 in cases where age is greater than or equal to 18.

To add this new variable you need to do two things (i) create a new column, and (ii) provide the values for each observation (i.e., row) based on the age of the passenger.

Adding a new column with Pandas in Python is easy and can be done via the following syntax:

    your_data["new_var"] = 0
    
This code would create a new column in the <b>train</b> DataFrame titled <b>new_var</b> with <b>0</b> for each observation.

To set the values based on the age of the passenger, you make use of a boolean test inside the square bracket operator. With the <b>[]</b>-operator you create a subset of rows and assign a value to a certain variable of that subset of observations. For example,

    train.ix[train.Fare > 10 ,"new_var"] = 1
    
would give a value of <b>1</b> to the variable <b>new_var</b> for the subset of passengers whose fares greater than 10. Remember that <b>new_var</b> has a value of 0 for all other values (including missing values).

A new column called <b>Child</b> in the train data frame has been created for you that takes the value <b>NaN</b> for all observations.

### Instructions

- Set the values of <b>Child</b> to <b>1</b> is the passenger's age is less than 18 years.
- Then assign the value <b>0</b> to observations where the passenger is greater than or equal to 18 years in the new <b>Child</b> column.
- Compare the normalized survival rates for those who are <18 and those who are older. Use code similar to what you had in the previous exercise.    

In [28]:
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train.ix[train.Age < 18, "Child"] = 1
train.ix[train.Age >= 18, "Child"] = 0

# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize=True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize=True))

1    0.539823
0    0.460177
Name: Survived, dtype: float64
0    0.618968
1    0.381032
Name: Survived, dtype: float64


## First Prediction
In one of the previous exercises you discovered that in your training set, females had over a 50% chance of surviving and males had less than a 50% chance of surviving. Hence, you could use this information for your first prediction: all females in the test set survive and all males in the test set die.

You use your test set for validating your predictions. You might have seen that contrary to the training set, the test set has no <b>Survived</b> column. You add such a column using your predicted values.

### Instructions

- Create a variable <b>test_one</b>, identical to dataset <b>test</b>
- Add an additional column, <b>Survived</b>, that you initialize to zero.
- Use vector subsetting like in the previous exercise to set the value of <b>Survived</b> to 1 for observations whose <b>Sex</b> equals "female".
- Print the <b>Survived</b> column of predictions from the <b>test_one</b> dataset.


In [3]:
# Create a copy of test: test_one
test_one = test.copy()

# Initialize a Survived column to 0
test_one["Survived"] = 0

# Set Survived to 1 if Sex equals "female" and print the `Survived` column from `test_one`
test_one.ix[train.Sex == "female", "Survived"] = 1
print(test_one["Survived"].head())

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
