We haven't yet worked with the Iris dataset in this module, so we'll start there. In the following example, we load the data and then separate out the feature petal_width and the target petal_length. Having plotted this data earlier, we know there is a linear relationship between the petal width and length: the wider the petal, the greater the length. We'll use a linear model to predict our target.

In [1]:
# Import numpy and seaborn
import numpy as np
import seaborn as sns

iris = sns.load_dataset("iris")
display(iris.head())

x = iris['petal_width']
X = np.array(x)[:, np.newaxis]
y = iris['petal_length']

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


First, we'll hold back a subset of the data just for the test data. We'll do this with the scikit-learn utility. We'll call it something different from "train" so that we don't confuse it with the actual training data later.

In [2]:
# Import the train_test_split utility
from sklearn.model_selection import train_test_split

# Create the "remaining" and test datasets
X_remain, X_test, y_remain, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Then we'll create a training set and validation set from the remaining data. We could have done this in one step but we're breaking it down here so it's easier to see that we removed a test subset and will not accidentally use it for evaluation until we're ready to test.

In [3]:
# Create the train and validation datasets

X_train, X_val, y_train, y_val = train_test_split(
    X_remain, y_remain, test_size=0.25, random_state=42)

# Print out sizes of train, validate, test datasets

print('Training data set samples:', len(X_train))
print('Validation data set samples:', len(X_val))
print('Test data set samples:', len(X_test))

Training data set samples: 90
Validation data set samples: 30
Test data set samples: 30


Now we can fit our model and evaluate it on our validation set.



In [4]:
# Import the predictor and instantiate the class
from sklearn.linear_model import LinearRegression

# Instantiate the model
model = LinearRegression()

# Fit the model
model.fit(X_train, y_train)

# Use the VALIDATION set for prediction
y_predict = model.predict(X_val)

# Calculate the accuracy score
from sklearn.metrics import r2_score
r2_score(y_val, y_predict)

0.9589442606386026

Well, that's a pretty good model score (R-squared), which we expect because we know the Iris dataset has a strong linear trend between the petal width and petal length. Now would be the time to change any of the model hyperparameters and evaluate on the validation set again. We'll continue with the default model parameters for now. Hyperparameter tuning is something that will be introduced in the later Sprints.

Now, let's use the test set we held back above.

In [5]:
# Use the TEST set for prediction
y_predict_test = model.predict(X_test)

# Calculate the accuracy score

r2_score(y_test, y_predict_test)

0.9287783612248339

The R-squared score is a little lower than it was for the validate set. If we were to run the model and test again with a different random seed, the scores would be different and the test score might be higher.