# Recap on Scikit-Learn

AI Black Belt - Yellow (May 2019).

---

## Loading data

In [1]:
import numpy as np
import pandas as pd

In [10]:
df = pd.read_csv("data/titanic.csv")
df = df.fillna(value=-1)  # Replace NaN with -1

X = df[['Pclass', 'Age']] 
y = df['Survived'] 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.5)

## Workflow

The recipe is as follows:

0) Import your model class.

In [11]:
from sklearn.linear_model import LogisticRegression

1) Instantiate an object and set parameters.

In [12]:
clf = LogisticRegression(solver="lbfgs")

2) Fit the model.

In [13]:
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

3) Apply

In [14]:
clf.predict(X_train[:5])

array([0, 0, 0, 0, 0])

In [15]:
clf.predict_proba(X_train[:5])

array([[0.55951759, 0.44048241],
       [0.6279641 , 0.3720359 ],
       [0.76343521, 0.23656479],
       [0.7723757 , 0.2276243 ],
       [0.76644207, 0.23355793]])

In [16]:
clf.score(X_train, y_train)

0.7033707865168539

In [21]:
clf.score(X_test, y_test)

0.6883408071748879

This is the same as the following:

In [20]:
from sklearn.metrics import accuracy_score
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_test_pred))

0.6883408071748879


<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Reproduce the workflow above with two classifiers of your choice. Check the <a href="https://scikit-learn.org/stable/modules/classes.html">documentation</a> for the full reference.</li>
    <li>Which one is the best?</li>
    <li>Are these results stable when you change <code>random_state</code> in <code>train_test_split</code>?</li>
</ul>
</div>