<a href="https://colab.research.google.com/github/PaulToronto/Math-and-Data-Science-Reference/blob/main/Scikit_learn_Estimators_Transformers_Predictors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scikit-learn - Estimators, Transformers and Predictors

https://scikit-learn.org/stable/developers/develop.html

- **IMPORTANT**: never use `.fit()` or `.fit_transform()` on anything other than the training set
    - that includes the validation set, the test set and new data
- Note that while the training set values will alweays be scaled to the specified range, if new data contains outliers, these may end up being scaled outside the range
    - to avoid this set the `clip` hyperparamter to `True`

## Estimators

- Any object that can estimate some parameters based on a dataset is called an **estimator**
    - They all have a `.fit()` method
        - It takes 2 paramters for supervised learning algorithms
            - `estimator = estimator.fit(data, targets)`
        - It always has at least one parameter
            - `estimator = estimator.fit(data)`
        - All other parameters are considered **hyperparameters** and they are set as an instance variable, generally via a constructor

## Transfomers

- Some estimators can also transform a dataset. These are called **transformers**
    - They have a `.transform()` method
        - `new_data = transformer.transform(data)`
    - They return the transformed dataset
    - The transformer generally relies on the learned parameters
    - All transfomers have a convenience method, `fit_transform()` which is equivalent to calling `fit()` followed by `transform()`, but sometimes `fit_transform()` is optimized and runs much faster
        - `new_data = transformer.fit_transform(data)`

## Predictors

- Some estimators are capable of making predictions. These are called **predictors**. 
    - They have a `.predict()` method that takes a dataset of new instances and returns a dataset of corresponding predictions
        - `prediction = predictor.predict(data)`
    - They also have a `.score()` method that measures the quality of the predictions, given a test set and the corresponding labels, in the case of supervised learning algorithmds
    - Classification algorithms usually offer a way to quantify certainty of a prediction using `decision_function` or `predict_proba`
        - `probability = predictor.predict_proba(data)`

## Inspection

- All the estimator's hyperparameters are accessible directly via public instances variables
    - example: `imputer.strategy`
- All the estimator's learned parameters are accessible vial public instance variables with an underscore suffix:
    - example: `imputer.statistics_`

## Output of Transformers

- Scikit-learn transformers output Numpy arrays (or sometimes SciPy sparse matrices) evne when the are fed Pandas DataFrames as input. 
- It is not too difficult to wrap that output into a DataFrame:

```python
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X,
                          columns = housing_num.columns,
                          index = housing_num.index)
```