# Data Science: Making Predictions
---

## 1- Introduction
### We can use data to learn, infer or predict future outcomes
---
__Prediction:__ With Data Science we make predictions by using the past outcome information.


__Your Turn__: What you would like to predict?

__Past information from Titanic dataset:__ We know who survided or not and its characteristics, e.g. gender, age and class.

__Prediction:__ Imagine we build another Titanic, an accident happens and Titanic starts to sink.  
We can predict if you would survive or not based on the past outcomes from the first accident (in the first Titanic).  


### We will use a reduced Titanic dataset for this task

In [None]:
import pandas
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

Now we load in the data from a CSV file.

In [None]:
data = pandas.read_csv('data/reduced_titanic.csv')

In [None]:
data.head(10)

## 2 - Data Science Models
### Learning from the data to make a prediction
---

__Data Science Models:__
They use a set of rules and learns from the data to produce a result of interest  
For example, if a new passanger would survive based on his/her characteristics...


![](images/DT_Titanic.png)

## Decision Trees 
### Models possible consequences for the observations using a tree structure
---

![](images/tree.png)

* A decision tree uses an upside down tree structure to represent a number of decision paths, and an outcome for each path   
* Decision trees are sort of a more complex "if-and-else statement"

![](images/tree2.png)

On the code below we can choose the "path" or part of code to use depending on a "decision" or condition.

__Your Turn__: Run the code below  
If it is cold today, which code block are we accessing in the code below? 
Which part of the code we access if we writte sunny instead?

In [None]:
# What is happening to the weather today?

weather = input("What is the weather like: ")

if weather == "raining":
    print("Take an umbrella")
    
elif weather == "cold":
    print("Take a coat")
    
elif weather == "sunny":
    print("Wear sunscreen")
    
else:
    print("Have a good day, whatever the weather!")

### Back to decision trees: Have you played Guess who game?

![](images/guesswho1.JPG)

__Your Turn__: What questions would eliminate most candidates?

a) Is it efficient to ask "Does he/she has a big nose"?

b) Is it efficient to ask "Is he/she bald"? 

### Decision Tree Models

 __Guess who game and feature ordering in a decision tree__

Just like how much information you can get depending on what are your first  questions in the game above, 
efficiency on trainning the model is mostly dictated by which feature goes on the top node, second level nodes, and so on.  
Information gain strategies: http://www.saedsayad.com/decision_tree.htm

Decision tree models are used to predict the value of a target variable based on several features (input variables).   
Each interior node corresponds to one of the features. Branches represent feature values.  
Each leaf represents a possible outcome of the target. 


![](images/forecast.jpg)

## 3- Training our Model
### Since we don't have new data we can split our data into training and test sets:

* training (to train our model)
* test (used as new data to test our model)


The test data can check if our model works well for a new dataset, how close our predictions get to the real results

Now we split the data and drop the results column "Survived" since that is what we aim to predict 

In [None]:
# Split the data into X_train, X_test, y_train, and y_test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                   data.drop('Survived', axis=1), 
                                   data['Survived'], 
                                   test_size=0.33, 
                                   random_state=42)
X_train.head()

Next, we train our model using the training data. 


In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()

Training our model with two features, Gender and Class.

In [None]:
features = ['Gender', 'Class']
model = decision_tree.fit(X_train[features], y_train)

In [None]:
# How good is the model?
model.score(X_test[features], y_test)

## 4- Making Predictions
### Let's check if you would survive or not... :O

 
__Your Turn:__ For the cells below run all for different cases:   
e.g. use female (denominated by 0) and class 1  
or use male (denominated by 1) and class 3   


In [None]:
Gender = ? 
Class = ? # how much can you pay for the tickets?  

Result = decision_tree.predict([[Gender, Class]])

In [None]:
Survive_Prob = decision_tree.predict_proba([[Gender, Class]])[0,1]*100

Would our new passanger survive or not?

In [None]:
####################### Decision Tree Prediction #######################
if Result==1: 
    print('He/She has a {0:.2f}% chance of survival! :)'.format(Survive_Prob))
else: 
    print('He/She has a {0:.2f}% chance of survival :('.format(Survive_Prob))

As you can see the model is predicting what would happend to a new person based on what happened to all the passengers

__Exercise__: Train your model using all features

In [None]:
features = ['Gender', # complete this list]
model = decision_tree.fit(# give the paremeters needed here)

In [None]:
# How good is our model? Is it better or worse than the one with less features? Why?
model.score(X_test[features], y_test)