<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>


<center><h1>Building a Model to Predict Survival for Titanic Passengers</h1></center>


**Welcome to _DS2: Introduction to Machine Learning_**!  This course will be all about _predictive analytics_--that is, using data and algorithms to make accurate predictions.  For our introductory exercise for this course, we're going to focus on the one of the areas where machine learning really shines--**_Classification_**.  We're going to examine the data and build a simple model to predict whether or not a passenger survived the Titanic disaster.  Here's the catch: before we use any machine learning, we're going to build a classifier by hand to gain an intuition about how classification actually works.  
<br>
<br>
<center><h2>The Gameplan</h2></center>

We're going to start by building the simplest model possible, and then slowly add complexity as we notice patterns that can make our classifier more accurate.  

Recall that we've investigated this dataset before, in DS1. We're going to use our _Data Analysis_ and _Visualization_ skills from DS1 to investigate our dataset and see if we can find some patterns that we can use in our prediction algorithm. In order to successfully build a prediction algorithm, we'll use the following process:

**1.  Load and explore the data.**  
    --We'll begin by reading our data into a dataframe, and then visualizing our data to see if we can find certain groups that had higher survival rates than others.  At this step, we'll also remove the `Survived` column from the dataframe and store it in a separate variable.  
    
**2.Write a prediction function.** 
<br>
    -- We'll write a function that takes in a dataframe and predicts 0 (died) or 1(survived) for each passenger based on whatever we decide is important.  This function should output a vector containing only 0's and 1's, where the first element is the prediction for the first passenger in the dataframe, the 2nd element is the prediction for the second passenger, etc.  
    
**3.  Write an evaluation function.**
<br>
    -- In order to evaluate how accurate our prediction function is, we'll need to track how it does.  To do this, we'll create a _confusion matrix_.  This matrix will exist as a dictionary that tracks the number of _True Positives_, _True Negatives_, _False Positives_, and _False Negatives_ our algorithm makes--don't worry if you haven't seen these terms before. We'll define them in a later section.
    
**4. Tweak our prediction function until we're happy!**
    --once we've built out the functions that underpin our predictive algorithm, we'll tweak them until we hit our desired accuracy metric.  In this case, **_we'll shoot for an accuracy of at least 80%._**
<br>
<br>
<center>Let's get started!</center>

In [73]:
#Import everything needed for the project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
%matplotlib inline

<center><h2>Step 1: Load and Explore the Data</h2></center>

In this section, we'll:

1. Read the data from `titanic.csv` and store it in a dataframe (you'll find this file in the `/datasets` folder).
2. Remove the `Survived` column from the dataframe and store it as a Pandas Series in a variable. 
3. Create a general purpose function that visualizes survivors vs deaths in any data frame passed in.
4. Clean our dataframe (remove unnecessary columns, deal with null values, etc).  
5. Explore our data and figure out which groups are most likely to survive.


NOTE: There are many ways to successfully visualize survival rates across the different features. The most inuitive way to visualize survival rates as a stacked bar chart, where 'survived' and 'dead' are different colors on the same bar.  For an easy explanation of how to make these bar charts, see [this Stack Overflow question](https://stackoverflow.com/questions/41622054/stacked-histogram-of-grouped-values-in-pandas).

In [87]:
# Read in the titanic.csv dataset from the /datasets folder.  
raw_df = pd.read_csv('datasets/titanic.csv')

# Store the survived column in the labels variable, and then drop the column from the data frame.  
labels = None


#Don't forget to remove these columns from the dataframe!
columns_to_remove = ['PassengerId', 'Name', 'Ticket', 'Cabin']
titanic_df= raw_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'],axis=1)
titanic_df.head()

 
#study material for this dataset
#https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.zscore.html


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


## Survival rate based on Sex

### 74% percent of women survived
### 18% of men survived


In [88]:

titanic_df
#titanic_age_df.plot.hist()

titanic_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


## Survival rate based on Pclass
### 62.9 % of 1st class people survived
### 47.3 % of 2nd class
### 24 % of 3rd class

In [89]:
titanic_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


## Survival rate based on where they embarked from
### 55.4% of people who boarded at Cherbourg, France survived
### 40% of people who boarded at  Queenstown, Ireland survived
### 33.6 % of people who boarded at SouthHampton, England Survived

In [96]:
titanic_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index = False).mean().sort_values(by ='Survived', ascending = False)


Unnamed: 0,Embarked,Survived
0,C,0.553571
1,Q,0.38961
2,S,0.336957


## Survival rate based on how many spouses or siblings were on board
### The more Spouses or siblings you had, the more likely you were to die
### 53% of people who had only 1 spouse or sibling survived
### 46% of people who had 2 spouses or siblings  survived
### 25% of people who had 3 spouses or siblings survived
### 16% of people who had 4 spouses or siblings survived
### 0 % of 5 and the max of 6 spouses or siblings survived

In [97]:
titanic_df[['SibSp', 'Survived']].groupby(['SibSp'], as_index = False).mean().sort_values(by ='Survived', ascending = False)

Unnamed: 0,SibSp,Survived
1,1,0.535885
2,2,0.464286
0,0,0.345395
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


## Survival rate based on how many parents or children were on board
### 34% of people who had no parents or siblings on board survived
### 55% of people who had 1 parent or sibling survived
### 50% of people who had 2 parents or siblings survived
### 60% of people who had 3 parents or siblings survived
### 0% of people who had 4 parents or siblings survived
### 20% of people who had 5 parents or siblings survived
### 0% of people who had 6 parents or siblings survived

In [100]:
titanic_df[['Parch', 'Survived']].groupby(['Parch'], as_index = False).mean().sort_values(by ='Survived', ascending = True)

Unnamed: 0,Parch,Survived
4,4,0.0
6,6,0.0
5,5,0.2
0,0,0.343658
2,2,0.5
1,1,0.550847
3,3,0.6


## REMOVING AND STORING SURVIVED AS ITS OWN DATAFRAME

In [110]:
survived__column_df = titanic_df['Survived']
titanic_df.drop(['Survived'],axis=1)
titanic_df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
5,0,3,male,,0,0,8.4583,Q
6,0,1,male,54.0,0,0,51.8625,S
7,0,3,male,2.0,3,1,21.0750,S
8,1,3,female,27.0,0,2,11.1333,S
9,1,2,female,14.0,1,0,30.0708,C


In [108]:
survived__column_df

0      0
1      1
2      1
3      1
4      0
5      0
6      0
7      0
8      1
9      1
10     1
11     1
12     0
13     0
14     0
15     1
16     0
17     1
18     0
19     1
20     0
21     1
22     1
23     1
24     0
25     1
26     0
27     0
28     1
29     0
      ..
861    0
862    1
863    0
864    0
865    1
866    1
867    0
868    0
869    1
870    0
871    1
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    1
881    0
882    0
883    0
884    0
885    0
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

# -----------------------------------------------------------------------------------------------------------

### 2.Write a prediction function. 

### We'll write a function that takes in a dataframe and predicts 0 (died) or 1(survived) for each passenger based on whatever we decide is important. This function should output a vector containing only 0's and 1's, where the first element is the prediction for the first passenger in the dataframe, the 2nd element is the prediction for the second passenger, etc.

Next, we'll create a function that allows us to quickly visualize the survival rates of any dataframe of passengers.  This way, we can iterate quickly by slicing our dataframe and visualizing the survival rate to see if we can find any patterns that will be useful to us.  

As an example, if we wanted to visualize the survival rates of men versus women, we would create a dataframe object that contains only the information that matters to us, and then pass it into this function.  When completed, this function should output a histogram plot that looks like the ones seen in the Stack Overflow link listed above.  

In [4]:
# Create a function used to visualize survival rates for the data frame passed in
def visualize_survival_rates(dataframe, xlabel=None, ylabel="Count"):
    """    
    Inputs: dataframe--a pandas dataframe object consisting of the things you want visualized.  
            labels--a pandas series object that tells us whether each passenger died (0) or survived(1)
            
    Outputs: A 2 color histogram that visualizes the survival rate of passengers based on the values contained 
    within the dataframe.  For instance, if we pass in a visualization 
    
    NOTE: You should rely on the dataframe's .hist() method to do most of the heavy lifting for visualizations.  
    Any slicing of the dataframe should be done BEFORE you call this function.  For instance, if you want to visualize
    survival rates of men under 30 vs women under 30, you should create a dataframe containing only these rows and 
    columns before passing it into this function, rather than passing in the full original dataframe.  This will 
    allow you to keep the logic in this function simple.
    """
    pass
    

<center><h3>Building a Prediction Function</h3></center>

Next, we'll write a prediction function.  We'll use basic control flow to examine each row in the data set and make a prediction based on whatever we think is important.  If you explored the data set, you may have stumbled upon a few interesting discoveries, such as:

* Women were more likely to survive than men.  
* Rich people were more likely to survive than poor people.  
* Young people were more likely to survive than others.  

(NOTE: We made these up--don't automatically assume they're true without investigating first!)

These may seem obvious, but don't discount their usefulness! We can use these facts to build a prediction function that has decent accuracy! For instance, let's pretend that we found that 80% of all women survived.  Knowing this, if we then tell our algorithm to predict than all female passengers survived, we'll be right 80% of the time for female passengers! 

Complete the following prediction function.  It should take in a dataframe of titanic passengers.  Based on the things you think are important (just use a bunch of nested control flow statements), you'll output a 1 if you think this passenger survived, or a if you think they died.  

The function should output an array where the first item is the prediction for the first row in the dataframe, the 2nd item in the array is the prediction for the seconf row in the dataframe, etc.  

In [5]:
def predict_survival(dataframe):
    predictions = []
    # WRITE YOUR PREDICTION CODE BELOW!
    
    
    return predictions

<center><h3>Evaluating Your Predictions</h3></center>

Great! Now we've evaluated our data and made a bunch of predictions--but predictions are only interesting if they're accurate.  In order to do this, we're going to create a **_Confusion Matrix_** to track what we got right and wrong (and _how_ we were right and wrong).  

There are 4 different possible outcomes for each prediction:

1. **True Positive** -- You predicted they survived (1), and they actually survived (1). 
2. **True Negative** -- You predicted they died (0), and they actually died (0).
3. **False Positive** -- You predicted they survived (1), and they actually died (0).
4. **False Negative** -- You predicted they died (0), and they actually survived (1).

We're going to write a function that takes in our predictions and the actual labels (the "Survived" column we removed from the actual data frame), and determines which possible outcome we had for each prediction.  We will keep track of how many times each outcome happened by incrementing a counter for each in our _Confusion Matrix_ dictionary.


In [6]:
def create_confusion_matrix(predictions, labels):
    confusion_matrix = {"TP": 0, "TN": 0, "FP": 0, "FN": 0}
    
    # Recall each index in both 'predictions' and 'labels' are referring to the corresponding row.  
    # E.G. predictions[0] and label [0] both refer to row 0 in the dataframe that was passed into the 
    # prediction function.
    
    #TODO: Create the confusion matrix by comparing the values in predictions to the corresponding values in labels.  
    # Use the definitions in the text above to determine which item in the dictionary you should increment.  
    
    return confusion_matrix


def get_accuracy(confusion_matrix):
    # Create a function that returns the accuracy score for your classifier.  
    # The formula for accuracy = TP + TN / TP + TN + FP + FN
    pass

<center><h3>Where to Go From Here</h3></center>

Now that you have a way to evaluate your predictions, modify your prediction function until you can achieve an evaluation score above 80%!