# Your first Scikit-Learn model

- Adapted for AI Black Belt - Yellow (May 2019).
- [Tutorial](https://github.com/amueller/ml-workshop-1-of-4/) created by Andreas Mueller (2019). MIT License.

---

## Loading data

In [None]:
import numpy as np
import pandas as pd

In [48]:
df = pd.read_csv("data/titanic.csv")

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Create a 2D Numpy array <code>X</code> from the columns 'Pclass' and 'Age'.</li>
    <li>Create a 1D Numpy array <code>y</code> from the column 'Survived'.</li>
</ul>
</div>

In [14]:
# %load solutions/day1-05-01.py

The resulting data should look this:

In [15]:
X[:10]

array([[ 3., 22.],
       [ 1., 38.],
       [ 3., 26.],
       [ 1., 35.],
       [ 3., 35.],
       [ 3., nan],
       [ 1., 54.],
       [ 3.,  2.],
       [ 3., 27.],
       [ 2., 14.]])

In [16]:
y[:10]

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])

## Really sample API

We will now build our first Scikit-Learn classifier. But before that, we need split the data into two parts:
- training: this is used for training the classifier
- test: this is used for testing the classifier

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Now the recipe is as follows:

0) Import your model class.

In [18]:
from sklearn.linear_model import LogisticRegression

1) Instantiate an object and set parameters.

In [19]:
clf = LogisticRegression()

2) Fit the model.

In [20]:
clf.fit(X_train, y_train)



ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Oops! Things don't seem to work as expected... The `X` array has missing entries, encoded as `NaN`. These are not handled by default in Scikit-Learn.

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Debug the data loading procedure above to convert `NaN` to numerical values.</li>
</ul>
</div>

3) Apply / evaluate

In [55]:
print(clf.predict(X_train))
print(y_train)

[0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
 0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
 1 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0
 0 0 0 1 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0 1 0
 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0
 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0
 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0
 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0
 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0
 0 1 0 0 0 1 0 0 0 0 1 0 

In [56]:
clf.score(X_train, y_train)

0.687125748502994

In [57]:
clf.score(X_test, y_test)

0.7130044843049327

## And again

In [58]:
from sklearn.ensemble import RandomForestClassifier

In [61]:
clf = RandomForestClassifier(n_estimators=100)

In [62]:
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [63]:
clf.score(X_train, y_train)

0.7739520958083832

In [64]:
clf.score(X_test, y_test)

0.6995515695067265

## Exercises

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Load the iris dataset from the <code>data/iris.tsv</code>.</li>
    <li>Split it into training and test set using <code>train_test_split</code>.</li>
</ul>
</div>

In [28]:
# %load solutions/day1-05-02.py

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Then train an evaluate <code>KNeighborsClassifier</code>, <code>RandomForestClassifier</code> and <code>LogisticRegression</code> on the iris dataset.</li>
    <li>How do these perform on the training set vs the test set?</li> 
    <li>Which one is the best on the training set, which one is the best on the test set?</li>
    <li>Are these results stable?</li>
</ul>
</div>

In [None]:
# %load solutions/day1-05-03.py

<div class="alert alert-success">

<b>EXERCISE</b> (optional):

 <ul>
    <li>Can you construct a binary classification dataset (using <code>np.random</code> for example) on which <code>LogisticRegression</code> achieves an accuracy of 1?</li>
    <li>Can you construct a binary classification dataset on which it achieves accuracy 0.5?</li>
</ul>
</div>

In [None]:
# %load solutions/day1-05-04.py