---
title: Analysis
subtitle: The model development process in scikit-Learn
authors:
  - name: Ian Carroll
    affiliations:
      - University of Maryland Baltimore County
      - NASA Goddard Space Flight Center
  - name: Rachel Wegener
    affiliations:
      - University of Maryland College Park
github: nasa-sarp/lesson-analysis-i-east
---

::::{grid}

:::{card}
:header: Context 🤔
You've created or have access to great data! Now what? This lesson starts moving from exploration to analysis.
:::

:::{card}
:header: Outcome 🎓
Building a model to make a prediction will become a concrete, well-defined process.
:::

:::{card}
:header: Skills 🤓
A new tool for understanding why your code produces an error, the `%debug` command!
:::

::::

In [None]:
import holoviews
import hvplot.xarray
import numpy
import sklearn.datasets
import sklearn.linear_model
import sklearn.model_selection
import xarray

options = xarray.set_options(display_style='text')
holoviews.opts.defaults(
    holoviews.opts.Histogram(active_tools=[], toolbar=None),
    holoviews.opts.Scatter(active_tools=[], toolbar=None),
    holoviews.opts.Image(active_tools=[]),
    holoviews.opts.Points(active_tools=[]),
    holoviews.opts.Layout(toolbar=None),
    holoviews.opts.ErrorBars(upper_head=None, lower_head=None)
)

In [None]:
holoviews.extension('bokeh')

## Models: Conceptual, Mathematical, and Numerical

intro stuff

## A Dataset for Regression

Toy dataset from scikit-learn concerning diabetes. The "sample" dimension of the dataset corresponds to different study participants. The "feature" dimension of the dataset correspons to the different variables that may predict disease progression.

In [None]:
sk_diabetes = sklearn.datasets.load_diabetes()
x = xarray.DataArray(
    data=sk_diabetes.data,
    dims=('sample', 'feature'),
    coords={
        'feature': sk_diabetes.feature_names,
    },
    name='x',
)
x

In [None]:
x.hvplot.hist(groupby=['feature'], widget_location='top')

The response varialbe, what the features may help predict, is a quanitative measure of progression towards development of diabetes.

In [None]:
y = xarray.DataArray(
    data=sk_diabetes.target,
    dims='sample',
    name='y',
)
diabetes = xarray.merge((y, x))
diabetes

Visualization allows us to inspect the potential of any one feature to predict the response, but we will need a model to explore using all features at the same time.

In [None]:
diabetes.hvplot.scatter(y='y', x='x', groupby='feature', widget_location='top')

The train test split as a route towards model evaluation.

In [None]:
train_or_test = numpy.random.choice(
    a=('train', 'test'),
    size=diabetes.sizes['sample'],
    p=(0.8, 0.2),
)
train_or_test

In [None]:
diabetes['split'] = ('sample', train_or_test)
diabetes

In [None]:
gb = diabetes.groupby('split')
gb

In [None]:
test, train = gb
test['x']

Drop the groupby label and keep just the datasets from each group.

In [None]:
test = test[1]
train = train[1]

## Training

The scikit-learn package includes many kinds of models, and does a good job at making the interface to using them all pretty similar.

In [None]:
model = sklearn.linear_model.LinearRegression()
model

:::{seealso} Supervised Learning
Supervised learning is the subset of statistical models that rely on having both inputs and outputs on hand to train. The alternative, "un-supervised" learning involves training models just on inputs; for example, clustering high-dimensional inputs into unkown but similar categories, or change-point detection in a time series.
:::

In [None]:
x = train['x']
y = train['y']

Every scikit-learn model for supervised learning has a fit method that takes the inputs (a.k.a. features or predictors) as the first argument and the outputs (a.k.a. targets, labels, or responses) as the second argument.

In [None]:
model.fit(x, y)

A trained model can now be used to make predictions based on the inputs alone. Note that "prediction" is not used in modeling to mean predicting into the future (which is "forecasting"). The name "estimate" might be more logical.

In [None]:
train['estimate'] = ('sample', model.predict(x))
train

If the fitting procedure has worked, then the estimate should track the outputs "y". For the case of univariate outputs, a 1:1 plot is a good visual check.

In [None]:
(
    train.hvplot.scatter(x='y', y='estimate', groupby=[])
    * holoviews.Slope(1, 0).opts(color='orange')
)

## Evaluation

Quantitative evaluation, in the machine learning framework, should be performed on data that was not used during the fitting process.

In [None]:
x = test['x']
y = test['y']

In [None]:
test['estimate'] = ('sample', model.predict(x))

In [None]:
test_plt = test.drop_dims('feature')
(
    test_plt.hvplot.scatter(x='y', y='estimate')
    * holoviews.Slope(1, 0).opts(color='orange')
)

Quantitative measures of the quality of the estimates begins with examining the residuals, of the difference between the observations and the estimates.

In [None]:
test['residual'] = test['y'] - test['estimate']
test['cludge'] = test['residual'] * 0 # sorry about this, makes errorbars work

In [None]:
test_plt = test.drop_dims('feature')
(
    test_plt.hvplot.scatter(x='y', y='estimate')
    * test_plt.hvplot.errorbars(x='y', y='estimate', yerr1='cludge', yerr2='residual', hover=[])
    * holoviews.Slope(1, 0).opts(color='orange')
)

Multiple ways to understand the residual. We can develop a scalar score called the coefficient of determination, or $R^2$.

In [None]:
RSS = (test['residual'] ** 2).sum()
TSS = test['y'].var() * test.sizes['sample']

In [None]:
1 - RSS / TSS 

:::{seealso} Overfitting
The model does not generalize beyond the data used for training.
:::

This or some other scalar metric is so useful, especially for checking for overfitting, that most scikit-learn models have a built in method for returning this model "score".

When the score on the test set is worse than the score on the training set, that means the model does not generalize. Unfortunately, there is not widely recognized difference between the train and test score that is recognized as too much. There is always noise, and this one looks okay!

In [None]:
model.score(test['x'], test['y'])

In [None]:
model.score(train['x'], train['y'])

Beyond the scalar values, and once overfitting has been ruled out, the next step in evaluating the model is to examine the residuals.

Checking a residual histogram provides a bulk overview. The residuals should be just noise, meaning they should look random. Different assumptions are possible, but at least the residuals should not be biased. Again, this looks okay.

In [None]:
(
    test['residual'].hvplot.hist()
    * holoviews.VLine(test['residual'].mean().item()).opts(color='red', line_dash='dotted')
    * holoviews.VLine(test['residual'].median().item())
)

Even if the distribution looks random, closer examination could reveal patterns in the residuals. For this model, the residuals appear to correlate with the response. That means their is some bias in our estimate, and we want to start considering a more flexible model. With great flexibility, comes great potential for overfitting.

In [None]:
test.hvplot.scatter(x='y', y='residual', groupby=[])

## Next Steps

- classification versus regression
- feature importance
- so many ways to model

### Closing Poll

The [closing poll](https://PollEv.com/clickable_images/WHersUlQe5SsG68KOHyfL/respond) which is, as all the others are, anonymous.

:::{danger} Shutdown
Please shut down your server! (File > Hub Control Panel)
:::