In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

# Summary

This case study has demonstrated the different purposes of modeling: description, inference, and prediction. For description, we sought a simple, understandable model. We hand crafted this model, beginning with our findings from the exploratory phase of the analysis. Every action we took to include a feature in the model, collapse categories, or transform a feature amounts to a decision we made while investigating the data. 

In modeling a natural phenomenon such as the weight of a donkey, we would ideally make use of physical and statistical models. In this case, the physical model is the representation of a donkey by a cylinder. We could have used this representation to estimate the weight of a donkey (cylinder) from its length and girth (since girth is $2\pi r$), 

$$ weight \propto girth^2 \times length$$ 

This physical model suggests that the log transformed weight is approximately linear in girth  and length:

$$ \log(weight) \propto \log(girth) +  \log(length)$$ 

Given this physical model, you might wonder why we did not use logarithmic or square transformations in our model. We leave you to investigate such a model in greater detail. But generally, if the range of values measured is small, then the log function is roughly linear. Again, keeping our model simple, we chose not to make these transformations given the strength of the statistical model seen by the high correlation between the girth and weight of the donkeys.

We added the categorical variables as indicators or dummy variables. If we had left all of the dummies for a categorical feature in the model, that would not have changed the predictions. However, the model would be over parameterized, and there would not be a unique solution to our model fitting. While this makes little difference to the prediction goals, it's problematic for inference. That's why we left out one dummy variable for each categorical variable that we encoded. We could have left out any one of the dummies for a categorical feature, but we chose to drop the central or most common one. This way, the coefficients of the remaining features can be interpreted as how different from the common group are the other groups. For BCS, we dropped the dummy that corresponds to a BCS of 3, and found that for a lower BCS we subtracted six or seven kg from the predicted weight, and for a BCS of 4 we added 20 kg. To keep our model simple, we did not consider models where the slope of the linear features is different for groups. To do this, we would construct variables that are the product of, say, girth and each BCS dummy. The model grows fast in complexity with this approach. Given we have about 500 donkeys, some of these groups (such as BCS under 2 and age over 5) would be fitted on only a few data points. We don't pursue this topic further, and refer you to other resources such as XXXX.

We did a lot of data dredging in this modeling exercise. We examined all possible models built from linear combinations of the numeric features, and we examined coefficients of dummy variables to decide whether to collapse categories. This is why it is important that we set aside data to assess the model. Evaluating the model on new data reassures us that the model we chose works well. The data that we set aside did not enter into any decision making when building the model so it should give us a good sense of how well the model works for making  predictions. Also, if we had more features to include in the model, then the number of possible models grows quickly. For example, a data set  10 features has over 1,000 different combinations of variables to consider.  Other approaches to model fitting when there are many features can be found in {numref}`Chapter %s <ch:regularization>`.

Fitting models is often a balance between simplicity and complexity, and a balance between physical and statistical models. Physical models can be a good starting point in modeling, and statistical models can inform physical models.  Model fitting is both an art and a science. 