![Erudio logo](img/erudio-logo-small.png)
---
![Sklearn logo](img/scikit-learn-logo-small.png)

# Machine Learning with scikit-learn

Sometimes the features you have available in your initial data have little predictive strength when used in the most straightforward way.  This might be true almost regardless of choice of model class and hyperparameters.  And yet it might also be true that there are synthetic features latent in the data that are highly predictive, but that have to be *engineered* (mechanically, rather than sample-wise modification) to produce powerful features.

At the same time, a highly dimension model—whether of high dimension because of the initial data collection or because of creation of extra synthetic features—may lend itself less well to modeling techniques.  In these cases, it can be more computationally tractable, as well as more predictive, to work with a subset of all available features.

We will spend several lessons that can be thought of broadly as "Feature Engineering." This middle lesson focuses on creating new composite or derived features features.

In [None]:
%matplotlib inline
from src.setup import *

# A synthetic example

Let us look at an artificial example where the raw features of a dataset are of absolutely no value, but it is possible to derive good predictions by creating synthetic features out of them. Obviously, real world data will not be as neat as that, but it is useful to express the concept.

At first brush the loaded data seems fairly noisy without an obvious pattern.

In [None]:
linf.head()

In [None]:
# The features seem uncorrelated, and no univariate correlation with target
linf.corr()

In [None]:
# Distribution of features and target looks roughly Gaussian
linf.hist(figsize=(12,6));

In [None]:
# No obvious trends in the data as sequences
%matplotlib inline
linf.plot(figsize=(12,6), style='.');

We might hope to identify a relationship between features and target using a linear regression such as this:

In [None]:
from sklearn.linear_model import LinearRegression

X = linf.drop('TARGET', axis=1)
y = linf['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y)
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

Slightly different linear models do equally poorly in detecting any relationship between the features and the target.  Notice that the metric used here is $R^2$ score rather than e.g. explained variance or mean absolute error (or others).


In [None]:
from sklearn.linear_model import Lasso, Ridge

lasso, ridge = Lasso(), Ridge()
lasso.fit(X_train, y_train)
ridge.fit(X_train, y_train)

lasso.score(X_test, y_test), ridge.score(X_test, y_test)

## Adding a feature

Let us try creating a new feature that is entirely based on existing features.

In [None]:
linf['f1xf2'] = linf.feature_1 * linf.feature_2
linf.head()

In [None]:
X = linf.drop('TARGET', axis=1)
y = linf['TARGET']

X_train, X_test, y_train, y_test = train_test_split(X, y)
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

The information we need was "latent" in the data the whole time, it just needed to be teased out.

In fairness, we can note that other regressors manage to derive the synthetic feature through their algorithmic structure.  But these regressors will have their own "blind spots" also, relative to different datasets.

In [None]:
X = linf.drop('TARGET', axis=1)
y = linf['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr.score(X_test, y_test)

In [None]:
from sklearn.svm import SVR

svr = SVR()
svr.fit(X_train, y_train)
svr.score(X_test, y_test)

In [None]:
X = linf.drop(['TARGET', 'f1xf2'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y)
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr.score(X_test, y_test)

In [None]:
svr = SVR()
svr.fit(X_train, y_train)
svr.score(X_test, y_test)

# Dimensionality Expansion

There are two standard ways in which you are likely to engineer new synthetic features based on existing features: polynomial features and one-hot encoding.  In a sense, the decompositions also do the same thing—but they create synthetic features globally across the parametric space, and they generally are used as replacements rather than supplements to raw features.

## Polynomial Features

Generating polynomial features will create a very large number of new features.  The basic idea is simple, we add new features that are the multiplicative product of up to degree=N of the existing features.  In the toy example at the beginning of this lesson, we manually synthesized one feature by multiplying two existing ones together.  The `PolyFeatures` construction does so with all combinations of parameters.

Often using polynomial features is a large part of the reason it is particularly important to go back and winnow features using feature selection.  Reducing 30 features to 15, for example, is unlikely to be hugely important to most models.  But reducing the 496 synthetic features in the below example becomes important (let alone the much larger number if you choose a higher degree or started with more raw features).  We look at this winnowing in the next lesson.

If the `interactions_only` option is not used, the number of produced features is:

$$ \#Features = N + N + \frac{N \times (N-1)}{2} + 1 $$

E.g. for dimensions=30, it is 496; for dimensions=100, it is 5151.

In [None]:
cancer = load_breast_cancer()

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(cancer.data)
print(cancer.data.shape)
print(X_poly.shape)

In [None]:
poly_names = poly.get_feature_names_out(cancer.feature_names)
pd.DataFrame(X_poly, columns=poly_names).head()

Just to show it off, let us look at how much more higher degree combinations would explode things.

In [None]:
PolynomialFeatures(3).fit_transform(cancer.data).shape

In [None]:
PolynomialFeatures(4).fit_transform(cancer.data).shape

Working with 5k, let alone 50k, features is quite unwieldy.  Even 500 is questionable, especially given we only have about the same number of rows here.  But we look at features selection/elimination in the next lesson.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_depth=7, random_state=1)

We can see that we get significant improvement with polynomial features.  Note that even simple models do "pretty well" with the most naive attempts.  Here it is more illustrative to look at how often a model is *wrong* than its accuracy to highlight differences.

In [None]:
# Scale the engineered features
# (makes little difference for this model, but is good practice)
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=42
)

acc = rfc.fit(X_train, y_train).score(X_test, y_test)
print(f"Error rate: {(1 - acc) * 100:.2f}%")

In [None]:
# Scale the engineered features
# (makes little difference for this model, but is good practice)
from sklearn.preprocessing import MinMaxScaler

poly = PolynomialFeatures(3)
X_poly = poly.fit_transform(cancer.data)

X_poly_scaled = MinMaxScaler().fit_transform(X_poly)
# makes no difference whether we scale

X_train, X_test, y_train, y_test = train_test_split(
    X_poly_scaled, cancer.target, random_state=42
)

acc = rfc.fit(X_train, y_train).score(X_test, y_test)
print(f"Error rate: {(1-acc)*100:.2f}%")

## One-Hot Encoding

We have looked in previous lessons at the need to encode categorical values in **one-hot encoding**.  That is, we might have one feature with a number of class values encoded in it.  For many models, this is either better quality—or simply required for the code to operate—than trying to use the class labels.  In some cases, integer values might work algorithmically, but will distort the training by being interpreted in a quantitative or ordinal way.

The interfaces provided by scikit-learn are servicable, but somewhat awkward.  `sklearn.preprocessing.LabelBinarizer` does almost what you want in some cases, but doesn't expose the clearest API.  The same can be said of `sklearn.preprocessing.OneHotEncoder` and `sklearn.preprocessing.LabelEncoder` and a couple others.  I simply recommend using `pandas.get_dummies()` in place of these others.  The result will be the same, in any case.

Let us look at a small toy example with catgorical data.

In [None]:
pets = pd.read_csv('data/pets.csv')
pets

In [None]:
pd.get_dummies(pets, dummy_na=True)

Performing the same encoding using the native scikit-learn classes and methods is a little bit more work.  `OneHotEncoder` chooses somewhat different column names that are a bit less descriptive.  There is not a straightforward way to specify the desciptive parts of the original column names rather than `x0`, `x1` etc.  Also, if you have a mixture of categorical and quantitative columns, specifying that is cumbersome compared to Pandas simply auto-detecting it for you.

On the other hand, if you do not wish to use Pandas, everything in scikit-learn happily works with the underlying NumPy arrays alone.

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit(pets)
one_hot_pets = enc.transform(pets)
columns = enc.get_feature_names_out()
pd.DataFrame(one_hot_pets.toarray(), columns=columns)

# Binning 

## Binary values

In the very first example in this course, we manually binarized a target.  We decided that one of those 1-10 scale ratings would have a cut-off point for "success" versus "failure."  That sort of thing is perfectly easy to construct using Pandas predicate filters (or similarly in NumPy).  But `sklearn.preprocessing.Binarizer` is available to accomplish the same end.

In [None]:
data = pd.read_csv('data/Learning about Humans learning ML.csv')
target = data[['How successful has this tutorial been so far?']]
target.head()

In [None]:
(target>=8).head()

In [None]:
from sklearn.preprocessing import Binarizer

binary = Binarizer(threshold=7.5).fit_transform(target)
binary[:5]

In [None]:
pd.DataFrame(binary).describe()

## Binning to categorical values

A move general form of quantizing ordinal or continuous values can use the `KBinsDiscretizer` class.  The idea here is that we want to divide a range into separate values using cuts.  The class provides a variety of ways of deciding these cuts.  For example, perhaps my tutorial was not simply "successful" or "unsuccessful", but rather "terrrible", "mediocre", "good", or "great" according to different audience members.

Using `n_bins=2` simply binarizes.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

binary = KBinsDiscretizer(n_bins=2, encode='ordinal').fit_transform(target)
binary[:5]

More bins allow us to illustrate some additional options.  The ordinal options will assign various integers, counting up from zero, for the thresholded values.

In [None]:
bins = KBinsDiscretizer(n_bins=4, encode='ordinal').fit_transform(target)
bins[:8]

Plain `onehot` creates a more compact sparse-matrix representation than `onehot-dense`.  For millions of rows of data this matters, not for hundreds.

In [None]:
cuts = KBinsDiscretizer(n_bins=4, encode='onehot-dense')
bins = cuts.fit_transform(target)
bins[:8]

As was mentioned when the example was first presented, 1-10 ratings tend to clump together near the top.  The default strategy is `quantile` which puts as equal a number into each bin as possible. Since we started with a limited number of ordinal values, this example is fairly "lumpy."  An example with continuous values would typically be able to divide more evenly.

In [None]:
print("Cut-offs:", cuts.bin_edges_[0])
print("Count per bin:", bins.sum(axis=0))

We also have an option to use a `uniform` strategy which makes numeric ranges equal at the cost of a more uneven bin size.

In [None]:
cuts = KBinsDiscretizer(n_bins=4, encode="onehot", strategy="uniform")
bins = cuts.fit_transform(target)
bins

In [None]:
print("Cut-offs:", cuts.bin_edges_[0])
print("Count per bin:", bins.sum(axis=0))

## Next lesson

**Feature Selection**: In this lesson we looked at serveral ways of expanding dimensionality, and creating *synthetic features* (or synthetic targets, sometimes).  Expansion is typically the first step which is followed by selection; i.e. it is time to discard some of those engineered features that prove of little value.

<a href="SKLearn-09_FeatureSelection.ipynb"><img src="img/open-notebook.png" align="left"/></a>


---

Materials licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by the authors