# Your first Scikit-Learn model

- Adapted for AI Black Belt - Yellow (May 2019).
- [Tutorial](https://github.com/amueller/ml-workshop-1-of-4/) created by Andreas Mueller (2019). MIT License.

---

## Loading data

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("data/titanic.csv")

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Create a 2D Numpy array <code>X</code> from the columns 'Pclass' and 'Age'.</li>
    <li>Create a 1D Numpy array <code>y</code> from the column 'Survived'.</li>
</ul>
</div>

In [None]:
# %load solutions/day1-05-01.py

The resulting data should look this:

In [None]:
X[:10]

In [None]:
y[:10]

<div class="alert alert-success">

<b>EXERCISE</b>:

How many elements do <code>X</code> and <code>y</code> contain?
</div>

## Really sample API

We will now build our first Scikit-Learn classifier. But before that, we need split the data into two parts:
- training: this is used for training the classifier,
- test: this is used for testing the classifier.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Now the recipe is as follows:

0) Import your model class.

In [None]:
from sklearn.linear_model import LogisticRegression

1) Instantiate an object and set parameters.

In [None]:
clf = LogisticRegression()

2) Fit the model.

In [None]:
clf.fit(X_train, y_train)

Oops! Things don't seem to work as expected... The `X` array has missing entries, encoded as `NaN`. These are not handled by default in Scikit-Learn.

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Debug the data loading procedure above to convert `NaN` to numerical values.</li>
</ul>
</div>

3) Apply / evaluate

In [None]:
print(clf.predict(X_train))
print(y_train)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

## And again

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf = RandomForestClassifier(n_estimators=100)

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

## Exercises

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Load the iris dataset from the <code>data/iris.tsv</code>.</li>
    <li>Split it into training and test set using <code>train_test_split</code>.</li>
</ul>
</div>

In [None]:
# %load solutions/day1-05-02.py

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Then train an evaluate <code>KNeighborsClassifier</code>, <code>RandomForestClassifier</code> and <code>LogisticRegression</code> on the iris dataset.</li>
    <li>How do these perform on the training set vs the test set?</li> 
    <li>Which one is the best on the training set, which one is the best on the test set?</li>
    <li>Are these results stable?</li>
</ul>
</div>

In [None]:
# %load solutions/day1-05-03.py

<div class="alert alert-success">

<b>EXERCISE</b> (optional):

 <ul>
    <li>Can you construct a binary classification dataset (using <code>np.random</code> for example) on which <code>LogisticRegression</code> achieves an accuracy of 1?</li>
    <li>Can you construct a binary classification dataset on which it achieves accuracy 0.5?</li>
</ul>
</div>

In [None]:
# %load solutions/day1-05-04.py