Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [None]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/06_Regression'
except ImportError as e:
    pass

# Regression

- Classification predicts a **categorical** value
    - a finite set of values
- Regression predicts a **numerical** value
    - a possibly infinite set of values
    - can be interpolating or extrapolating

## Regression Estimators

Regression estimators work in the same way as classification estimators in scikit-learn:

Linear Regression Models:
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge)

K-Nearest Neighbor Regression:
- [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

Decision Tree Regression:
- [DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

Neural Network Regression:
- [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html)

We will use a simple dataset about fish for the introduction of regression:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

fish = pd.read_csv('fish.csv')
fish.head()

Lets have a look at some plots to determine how the weight and length of the fish are related to the age:

In [None]:
import itertools
import matplotlib.pyplot as plt

# create a list of all columns that we are considering
features = [ 'age', 'temp', 'weight', 'length' ]

# create all combinations of considered columns
combinations = itertools.combinations(features, 2)

# create a figure and specify its size
fig = plt.figure(figsize=(15,10))

# go through all combinations and create one plot for each
figure_index = 1
for combination in combinations:
    # add a sub plot to the figure
    axs = fig.add_subplot(2,3,figure_index)
    
    # plot the feature combination
    axs.scatter(fish[combination[0]], fish[combination[1]])
    
    # set the axis labels of the current sub plot
    axs.set_xlabel(combination[0])
    axs.set_ylabel(combination[1])
        
    # increase the figure index (otherwise all plots are drawn in the first subplot)
    figure_index+=1

    
# show the plot
plt.show()

It seems that there is a linear relationship between age and weight. We can fit a linear regression and add it to the plot. We use the age as a feature and the weight as target variable.

In [None]:
from sklearn.model_selection import train_test_split

# separate features and target variable
weight = fish['weight']

# special case: we only have one feature, so we must reshape the data here
features = fish['age'].values.reshape(-1, 1)

# create a train/test split
weight_train, weight_test, weight_target_train, weight_target_test = train_test_split(
    features, weight, test_size=0.4, random_state=42)

Now let's fit a linear regression:

In [None]:
from sklearn.linear_model import LinearRegression

# create and fit a linear regression
weight_estimator = LinearRegression()

weight_estimator.fit(weight_train, weight_target_train)

# plot the original values
plt.scatter(weight_train, weight_target_train, c='green', label='train')
plt.scatter(weight_test, weight_target_test, c='blue', label='test')

# plot the predicted values
plt.plot(fish['age'], weight_estimator.predict(features),c='red', label='prediction')

plt.xlabel('age')
plt.ylabel('weight')
plt.legend()
plt.show()

# print the model that was fitted (the regression formula)
print("weight = {}*age + {}".format(weight_estimator.coef_[0], weight_estimator.intercept_))

Now lets see if that also works for the length. We use the age as feature again and the length as target variable now:

In [None]:
# separate features and target variable
length = fish['length']

# create a train/test split
length_train, length_test, length_target_train, length_target_test = train_test_split(
    features, length, test_size=0.4, random_state=42)

# create and fit a linear regression
length_estimator = LinearRegression()
length_estimator.fit(length_train, length_target_train)

# plot the original values
plt.scatter(length_train, length_target_train, c='green', label='train')
plt.scatter(length_test, length_target_test, c='blue', label='test')

# plot the predicted values
plt.plot(fish['age'], length_estimator.predict(features),c='red', label='prediction')

plt.xlabel('age')
plt.ylabel('length')
plt.legend()
plt.show()

# print the model that was fitted (the regression formula)
print("length = {}*age + {}".format(length_estimator.coef_[0], length_estimator.intercept_))

## Polynomial Features


The fitted regression does not really match the data that we see. It seems that we need a polynomial regression here. We can fit such a regression by using a [```PolynomialFeatures``` transformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) that generates all possible feature combinations for the polynomial that we want to fit. On these transformed features, we can then use the linear regression again to fit our model:

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# create a transformer that generates polynomial features
transformer = PolynomialFeatures(degree=2, include_bias=False)
estimator = LinearRegression()
pipeline = Pipeline([ ('transformer', transformer), ('estimator', estimator)])
pipeline.fit(length_train, length_target_train)

prediction = pipeline.predict(features)

# plot the original values
plt.scatter(length_train, length_target_train, c='green', label='train')
plt.scatter(length_test, length_target_test, c='blue', label='test')

# create a new dataframe that contains the age and the predictions
d = fish[['age']]
d = d.assign(prediction=prediction)

# sort the data before plotting it
d = d.sort_values(by='age')

# plot the predicted values
plt.plot(d['age'], d['prediction'], c='red', label='prediction')

plt.xlabel('age')
plt.ylabel('length')
plt.legend()
plt.show()

# print the model that was fitted (the regression formula)
print("length = ", end='')
for i, f in enumerate(pipeline.named_steps['transformer'].get_feature_names_out(['age'])):
    if i > 0:
        print(" + ", end='')
    print("{}*{}".format(pipeline.named_steps['estimator'].coef_[i], f), end='')
print(" + {}".format(pipeline.named_steps['estimator'].intercept_))


## Evaluation

With a continuous target variable, it does not make sense to count how often we predicted the exact correct value. The measures used for regression rather check how close our prediction is to the correct value:

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

# predict the values for the test data
predictions = pipeline.predict(length_test)

# evaluate using different measures
mse = mean_squared_error(length_target_test, predictions)
r2 = r2_score(length_target_test, predictions)

print("MSE:", mse)
print("RMSE:", sqrt(mse))
print("R^2:", r2)

# print the model that was fitted (the regression formula)
print("length = ", end='')
for i, f in enumerate(pipeline.named_steps['transformer'].get_feature_names_out(['age'])):
    if i > 0:
        print(" + ", end='')
    print("{}*{}".format(pipeline.named_steps['estimator'].coef_[i], f), end='')
print(" + {}".format(pipeline.named_steps['estimator'].intercept_))

### Try it  yourself
- Task 6.1.1
![xkcd comic](https://imgs.xkcd.com/comics/extrapolating.png)



## Feature Selection

- [```f_regression``` function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html)
    - Performs an [F-Test](https://en.wikipedia.org/wiki/F-test) to determine feature importances

- [```SelectKBest``` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
     - selects the ```k``` best features according to ```score_func```
     
- [```SelectFwe``` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFwe.html)
    - selects all features with a p-value above ```threshold``` according to ```score_func```

- Recursive Feature Elimination: [```RFECV``` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html)
    - fits a model (any estimator that provides feature weights) and removes the ```step``` least important features
    - uses cross validation to find the optimal number of features
    
For more details, have a look at the [feature selection documentation of scikit-learn](https://scikit-learn.org/stable/modules/feature_selection.html).

For this part, we will use all features in the fish dataset:

In [None]:
# separate features and target variable
fish_target = fish['length']
fish_data = fish.drop(columns=['length'])

# create a train/test split
data_train, data_test, target_train, target_test = train_test_split(
    fish_data, fish_target,test_size=0.2, random_state=42)

Let's first try an F Regression:

In [None]:
from sklearn.feature_selection import f_regression

# create a transformer
transformer = PolynomialFeatures(degree=2, include_bias=False)

# run the F-Test
f, pval = f_regression(transformer.fit_transform(data_train), target_train)

# prepare a dataframe to inspect the results
stat = pd.DataFrame({ 'feature': transformer.get_feature_names_out(fish_data.columns), 'F value': f, 'p value': pval })
stat['p value'] = round(stat['p value'], 2)

# show the results
display(stat)

In [None]:
from sklearn.feature_selection import SelectFwe

best = SelectFwe(f_regression, alpha=0.05)
transformer = PolynomialFeatures(degree=2, include_bias=False)

pipeline = Pipeline([ ('transformer', transformer), ('feature_selection', best), ('estimator', estimator)])

# fit the regression on the training data
pipeline.fit(data_train, target_train)

# predict the values for the test data
predictions = pipeline.predict(data_test)

# evaluate using different measures
mse = mean_squared_error(target_test, predictions)
r2 = r2_score(target_test, predictions)

print("MSE:", mse)
print("RMSE:", sqrt(mse))
print("R^2:", r2)

# get the selected features
selected_features = pipeline.named_steps['feature_selection'].get_support()
feature_index = 0

# print the model that was fitted (the regression formula)
for i, f in enumerate(pipeline.named_steps['transformer'].get_feature_names_out(fish_data.columns)):
    # check if the feature was selected
    if selected_features[i]:
        if i > 0:
            print(" + ", end='')
        print("{}*{}".format(pipeline.named_steps['estimator'].coef_[feature_index], f), end='')
        feature_index += 1
print(" + {}".format(pipeline.named_steps['estimator'].intercept_))

Of course we can also make feature selection a part of the sklearn pipeline and have it automatically train a model with the best found features:

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor

# create a transformer and linear regression
transformer = PolynomialFeatures(degree=2, include_bias=False)
scaler = StandardScaler()
estimator_feature_selection = LinearRegression()
estimator_classification = DecisionTreeRegressor()

# create the feature selection estimator
feature_selection = RFECV(estimator_feature_selection, cv=10)

# setup the pipeline
pipeline = Pipeline([ ('transformer', transformer), ('scaler', scaler), ('feature_selection', feature_selection), ('classification', estimator_classification)])

# fit pipeline
pipeline.fit(data_train, target_train)

# predict the values for the test data
predictions = pipeline.predict(data_test)

# evaluate
mse = mean_squared_error(target_test, predictions)
r2 = r2_score(target_test, predictions)
print("MSE:", mse)
print("RMSE:", sqrt(mse))
print("R^2:", r2)

# get the selected features
fs = pipeline.named_steps['feature_selection']
est = pipeline.named_steps['feature_selection'].estimator_

selected_features = fs.get_support()
feature_index = 0

# print the model that was fitted (the regression formula)
print("length = ", end='')
for i, f in enumerate(pipeline.named_steps['transformer'].get_feature_names_out(fish_data.columns)):
    # check if the feature was selected
    if selected_features[i]:
        if i > 0:
            print(" + ", end='')
        print("{}*{}".format(est.coef_[feature_index], f), end='')
        feature_index += 1
print(" + {}".format(est.intercept_))