# What is Machine Learning??

![](images/machine_learning.png)

**_Machine learning is the science of programming computers so they can learn how to generalize from data_** 

#### Self driving cars 
![](images/autoauto_self_driving_cars.png)
*Training a model by **AutoAuto***

---

## Computer Science Approach VS Machine Learning Approach

![alt text][csalg]

[csalg]: https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_0101.png "Figure 1.1"

*Computer Science Algorithm Approach*

![alt text][mlalg]

[mlalg]: https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_0102.png "Figure 1.2"

*Machine Learning Approach*

### Common major steps
__1__ Data collection    
__2__ Data exploration    
__3__ Data preprocessing or featuring   
__4__ Training & testing a model  
__5__ Validating results  

![alt text][mlupdate]

[mlupdate]: https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_0103.png "Figure 1.3"

*Machine Learning Has Ability to Update*

---

## __Data Collection__

__Data__  
Pretend we collected the data below by asking a group of students the following:  
What activity are you enrolled in?  

In [None]:
activities = ['soccer', 'swimming', 'rock climbing'] # what are we creating in this line?
students = [5, 3, 8]

## __Data Exploration__

In [None]:
# Make a dataframe called student_activities
import pandas
student_activities = pandas.DataFrame({'activity': activities, 'number_students': students})
student_activities

In [None]:
# We use this special command at the top to tell Python we want to show a plot. 
%matplotlib inline 
import seaborn 
seaborn.barplot(x = 'activity', y = 'number_students', data = student_activities)

__Your Turn:__  
Change the names in __activities__ list and change the numbers in the __students__ list.  
Then run that cell again and all the other cells up to here. Is the plot different? :)

## __Data preprocessing or featuring__

![](images/data_featuring.png)
*Data Preprocessing by **AutoAuto***

## Training & Testing a Model


![alt text][mlemojis]

[mlemojis]: https://static1.squarespace.com/static/57293859b09f959325ac2e33/t/574cc10db654f95fce4cda28/1464650023923/ "Figure 1.4"

*Machine Learning with Emojis by **Emily Barry***

---

### The titanic dataset

![](images/titanic_boat.jpg)

In [None]:
import pandas
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

Now we load in the data from a CSV file.

In [None]:
data = pandas.read_csv('data/reduced_titanic.csv')

Let's see what the data looks like! 

In [None]:
data

If you want to look at one column at a time, here's how we do it...

In [None]:
data.Survived

__Your turn:__ Make the code in the cell above show a different column 

#### We can use data to learn, infer or predict future outcomes
---
__Prediction:__ With Data Science we make predictions by using the past outcome information.  

__Past information from Titanic dataset:__ We know who survided or not and its characteristics, e.g. gender, age and class.  

__Prediction:__ Imagine we build another Titanic, an accident happens and Titanic starts to sink.  
We can predict if you would survive or not based on the past outcomes from the first accident (in the first Titanic). 

### Decision Trees

![](images/DT_Titanic.png)

#### Example: Guess who game

![](images/guesswho1.JPG)

__Your Turn__: What questions would eliminate most candidates?  

a) Is it efficient to ask "Does he/she has a big nose"?  

b) Is it efficient to ask "Is he/she bald"?  

### Train / Test Data Split

You don't want to use the same data to test a model that you used to train it because that introduces bias.

Enter the concept of splitting the data into a training set and a testing set
   * Debates exist on the right split for training/testing (scikit-learn uses 75%/25%)
   * Data is typically denoted as **X** while labels are denoted with **y**  
   
Now we split the data and drop the results column "Survived" since that is what we aim to predict 

In [None]:
# Split the data into X_train, X_test, y_train, and y_test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
                                   data.drop('Survived', axis=1), 
                                   data['Survived'], 
                                   test_size=0.33, 
                                   random_state=42)
X_train.head()


Next, we train our model using the training data. 


In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()

Training our model with two features, Gender and Class.

In [None]:
features = ['Gender', 'Class']
model = decision_tree.fit(X_train[features], y_train)

In [None]:
# How good is the model?
model.score(X_test[features], y_test)

#### Making predictions

In [None]:
from sklearn import neighbors, datasets
import warnings
warnings.filterwarnings('ignore')

iris = datasets.load_iris()
X, y = iris.data, iris.target

knn = neighbors.KNeighborsClassifier(n_neighbors=3)

knn.fit(X, y)

# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
result = knn.predict([[3, 5, 4, 2],])

knn.predict_proba([[3, 5, 4, 2],])

__Your turn__:  
For the cells below run all for different cases:     
e.g. use female (denominated by 0) and class 1    
or use male (denominated by 1) and class 3 

In [None]:
Gender = ? 
Class = ? # how much can you pay for the tickets?  

Result = decision_tree.predict([Gender, Class])

In [None]:
Survive_Prob = decision_tree.predict_proba([Gender, Class])[0,1]*100

In [None]:
####################### Decision Tree Prediction #######################
if Result==1: 
    print('He/She has a {0:.2f}% chance of survival! :)'.format(Survive_Prob))
else: 
    print('He/She has a {0:.2f}% chance of survival :('.format(Survive_Prob))

As you can see the model is predicting what would happend to a new person based on what happened to all the passengers

__Exercise__: Train your model using all features

In [None]:
features = ['Gender', # complete this list]
model = decision_tree.fit(# give the paremeters needed here)

In [None]:
# How good is our model? Is it better or worse than the one with less features? Why?
model.score(X_test[features], y_test)

## Validating results

### 1. Is it possible to validate?
### 2. Is the way you are validating matching the problem you are claiming to solve?
### 3. If validation is failing:
#### Bad Data
> Data does not accurately represent unknown population, it is not sufficient, need to improve data featuring, etc.  

#### Bad Model
> Model is innadequate for the problem you are trying to solve, etc. 
