<a href="" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>

# Regression

## Examples

Can we estimate the number of distinct species for a study site?

![image of a natural environment](https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Hopetoun_falls.jpg/640px-Hopetoun_falls.jpg)

(taken from wikipedia: [natural environment](https://en.wikipedia.org/wiki/Natural_environment))

Can we predict atmospheric $CO_2$ concentration levels in future years based on historical data?

![historic atmospheric co2 data](https://research.noaa.gov/Portals/0/EasyGalleryImages/1/864/co2_data_mlo.png)

(taken from [NOAA research news](https://research.noaa.gov/article/ArtMID/587/ArticleID/2764/Coronavirus-response-barely-slows-rising-carbon-dioxide), Monday, June 7, 2021)

## How to do regression?

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

In [None]:
features, target = make_regression(n_features=1, noise=10)

In [None]:
ax = sns.scatterplot(x=features.flatten(), y=target)
ax.set_xlabel("x = Features")
ax.set_ylabel("y = Target");

In [None]:
# Fit a linear model to the example data
linear_model = LinearRegression().fit(features, target)

# Plot the datapoints. x = features, y = target value
ax = sns.scatterplot(x=features.flatten(), y=target)

x_min = features.min()
x_max = features.max()

# Generate a prediction using the linear model on the example data
pred = linear_model.predict([[x_min], [x_max]])
y_min = pred.min()

# Plot the predicted line
ax.plot([x_min, x_max], pred, color="black", linestyle="--", linewidth=2)

# Define a new test point at x = 2
test_point = [2]

# Use the linear model to predict its target value
predicted_value = linear_model.predict([test_point])[0]

# Draw a point at the test point with its predicted value
plt.scatter(test_point, [predicted_value], color="red")

# Draw a vertical arrow from the x-axis at x = test_point to its predicted value
ax.arrow(
    2,
    y_min,
    0,
    predicted_value - y_min,
    color="red",
    head_width=0.1,
    head_length=10,
    length_includes_head=True,
)

# Draw a horizontal arrow from the point (x, y) = (test_point, predicted_value) to
# the y-axis at y = predicted_value
ax.arrow(
    test_point[0],
    predicted_value,
    -test_point[0] + x_min,
    0,
    color="red",
    head_width=10,
    head_length=0.1,
    length_includes_head=True,
)

# Add the linear formula to the plot
ax.text(-1, 50, "y = mx + b")

# Add labels to axis
ax.set_xlabel("x = Features")
ax.set_ylabel("y = Target");

## But what about non-linear data?

In [None]:
features_a, target_a = make_regression(n_features=1, noise=10, random_state=100)

features_b, target_b = make_regression(
    n_features=1,
    noise=10,
    bias=100,
    random_state=50,
)

# Make a dataset by joining two datasets
features_c = np.r_[features_a, features_b - 3]
target_c = np.r_[target_a, target_b]

# Plot data points (x, y) = (features, target)
sns.scatterplot(x=features_c.flatten(), y=target_c);

In [None]:
# Fit a linear model to the example data
linear_model = LinearRegression().fit(features_c, target_c)

# Plot the datapoints. x = features, y = target value
ax = sns.scatterplot(x=features_c.flatten(), y=target_c)

x_min = features_c.min()
x_max = features_c.max()

# Generate a prediction using the linear model on the example data
pred = linear_model.predict([[x_min], [x_max]])
y_min = pred.min()

# Plot the predicted line
ax.plot([x_min, x_max], pred, color="black", linestyle="--", linewidth=2)

# Add the linear formula to the plot
ax.text(-3, 0, "y = mx + b")

# Add labels to axis
ax.set_xlabel("x = Features")
ax.set_ylabel("y = Target");

### Nearest neighbours again

### Multidimensional features

## Are we doing good regression?

* Create train/validation split just like in classification!

* Can’t use classification accuracy as measure

* Instead, perhaps:
    - Mean squared error
    - Mean absolute error

# 4 Nonlinear regression

Here we will compare a linear and nonlinear regressor (random forest) on a one dimensional regression task.

In [None]:
# Load numeric computation library
import numpy as np

# Load the plotting library
import seaborn as sns

# Load the random forest regressor class
from sklearn.ensemble import RandomForestRegressor

# Load the linear model class
from sklearn.linear_model import LinearRegression

In [None]:
# create a dataset
# here y_ideal is a nonlinear function of x
x = np.arange(0, 100, 2.0)
y_ideal = np.sin(x / 10) + (x / 50) ** 2

In [None]:
# add some noise to our target variable y_ideal
y = y_ideal + np.random.normal(size=len(y_ideal)) * 0.1

In [None]:
# fit linear model
lin = LinearRegression()
lin.fit(x.reshape(-1, 1), y)
lin_train_fit = lin.predict(x.reshape(-1, 1))

In [None]:
# fit random forest
rf = RandomForestRegressor()
rf.fit(x.reshape(-1, 1), y)
rf_train_fit = rf.predict(x.reshape(-1, 1))

In [None]:
# plot the fitted linear model and random forest
sns.scatterplot(x=x, y=y, color="black")
sns.lineplot(x=x, y=lin_train_fit, color="red", label="linear regression")
sns.lineplot(x=x, y=rf_train_fit, color="blue", label="random forest")

In [None]:
# genererate a random test point (sample between the min and max of x)
rand_point = np.random.uniform(x.min(), x.max())

In [None]:
# linear model
# y = slope * x + intercept
lm_predicted = lin.predict([[rand_point]])

In [None]:
# random forest
rf_predicted = rf.predict([[rand_point]])

In [None]:
# plot single predictions
sns.scatterplot(x=x, y=y, color="black")
sns.scatterplot(
    x=[rand_point],
    y=lm_predicted,
    marker="x",
    color="red",
    label="linear regresion prediction",
)
sns.scatterplot(
    x=[rand_point],
    y=rf_predicted,
    marker="x",
    color="blue",
    label="random forest prediction",
)