In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

%matplotlib inline

# 6.2 - Testing and Validation
In the previous notebook, we fit, or *trained* an estimator using a feature matrix *X* and a label vector *y*. We then tested the accuracy of this estimator by comparing its predictions based on *X* with *y*. This is called the estimator's **training accuracy**. Training accuracy is an important, but deceptive metric. Estimators may sometimes overfit their data, resulting in very high training performance only to do poorly on new data. The notion that a model should perform well on previously unseen data is called **generalizability**.
### We test for generalizability by splitting our data into a *training set* and *test set*.

In [276]:
# We take a shortcut this time in creating our X's and y's
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

# the train_test_split() function returns a 4-tuple
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y)

### Let's see how this works

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression()

# fit model to training data
lr.fit(X_train, y_train)

# evaluate model's training performance
y_train_pred = lr.predict(X_train)
print('Training accuracy:', accuracy_score(y_train, y_train_pred))

# evaluate model's test performance
y_test_pred = lr.predict(X_test)
print('Testing accuracy:', accuracy_score(y_test, y_test_pred))

Training accuracy: 0.975
Testing accuracy: 0.933333333333


# Your turn
Where as predicting the species in the iris dataset was a **classification** problem where the prediction can only be from a set of distinct categories, the following is a regression problem where the prediction is a continuous number.

In [178]:
X = np.random.rand(1000,3) * 100
y = (X[:,0] + 2 * X[:,1] + X[:,2]) + 10 * np.random.random(1000)

- Use Ridge, it is located in sklearn.linear_model
- In a regression problem, accuracy does not apply. Use mean_squared_error instead in sklearn.metrics.

In [196]:
# Import the relevant estimator
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Create your training and testing data from X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)

# Instantiate Ridge estimator object


# fit model to training data


# evaluate model's training performance


# evaluate model's test performance


# Let's go back to iris.

In [212]:
iris = load_iris()
X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

# the train_test_split() function returns a 4-tuple
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

# One of the commandments of predictive analysis is you never change your model after seeing the testing performance.
Why? Because changing model parameters to get the testing performance higher introduces the possibility of overfitting to your test data. You will never be able to test the generalizability of your model. In order to test the performance of a model without touching the test data, we use **cross-validation**. Instead of splitting data into two groups, training and test, we split it into three: training, validation, and test. Testing on the validation set allows us to assess a model's performance and save the test data for a final evaluation.

In [283]:
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score

svm = SVC(C = 2000)

# fit model to training data
svm.fit(X_train, y_train)

# evaluate model's training performance
y_train_pred = svm.predict(X_train)
print('Training accuracy:', accuracy_score(y_train, y_train_pred))

# evaluate model's validation performance
print('Validation accuracy:', cross_val_score(estimator = svm, X = X_train, y = y_train, cv = 5).mean())

Training accuracy: 1.0
Validation accuracy: 0.929004329004


In [284]:
y_test_pred = svm.predict(X_test)
print('Testing accuracy:', accuracy_score(y_test, y_test_pred))

Testing accuracy: 0.921052631579
