Run the cell below if you are using Google Colab to mount your Google Drive in your Colab instance. Adjust the path to the files in your Google Drive as needed if it differs.

If you do not use Google Colab, running the cell will simply do nothing, so do not worry about it.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd 'drive/My Drive/Colab Notebooks/06_Regression'
except ImportError as e:
    pass

## Exercise 6: Regression

### 6.1. Modeling characteristics of fish
The Fish dataset is a simple dataset which helps to illustrate the linear and non-linear dependencies which may exist between different attributes of the data. The dataset is provided in the fish.csv file. It contains 44 examples, each with four attributes: age, water temperature, weight and length.
#### 6.1.1 Load the Fish dataset and visualize it by combining different attributes. Can you make an assumption about the function to predict the weight and length based on one of the variables?

In [None]:
# Load fish data
import pandas as pd
fish = pd.read_csv('fish.csv')
fish.head()

In [None]:
import itertools
import matplotlib.pyplot as plt

# create a list of all columns that we are considering
features = [ 'age', 'temp', 'weight', 'length' ]

# create all combinations of considered columns
combinations = itertools.combinations(features, 2)

# create a figure and specify its size
fig = plt.figure(figsize=(15,10))

# go through all combinations and create one plot for each
figure_index = 1
for combination in combinations:
    # add a sub plot to the figure
    axs = fig.add_subplot(2,3,figure_index)
    
    # plot the feature combination
    axs.scatter(fish[combination[0]], fish[combination[1]])
    
    # set the axis labels of the current sub plot
    axs.set_xlabel(combination[0])
    axs.set_ylabel(combination[1])
        
    # increase the figure index (otherwise all plots are drawn in the first subplot)
    figure_index+=1

    
# show the plot
plt.show()

#### 6.1.2 Learn a linear regression model that predicts the weight of the fish, and another one that predicts the length of the fish based on the combination of attributes you find most convenient for this. Which types of regression work best? Do they apply equally to all combinations of attributes? Make sure to evaluate your results properly!

In [None]:
from sklearn.model_selection import train_test_split

# separate features and target variable
length = fish['length']
weight = fish['weight']

# special case: we only have one feature, so we must reshape the data here
features = fish['age'].values.reshape(-1, 1)

# create a train/test split
length_train, length_test, length_target_train, length_target_test = train_test_split(
    features, length, test_size=0.4, random_state=42)

weight_train, weight_test, weight_target_train, weight_target_test = train_test_split(
    features, weight, test_size=0.4, random_state=42)

Learn a linear regression model for the weight:

In [None]:
from sklearn.linear_model import LinearRegression

# create and fit a linear regression
#TODO: INSERT YOUR CODE HERE

# plot the original values
plt.scatter(weight_train, weight_target_train, c='green', label='train')
plt.scatter(weight_test, weight_target_test, c='blue', label='test')

# plot the predicted values
#TODO: INSERT YOUR CODE HERE

# format and show the plot
plt.xlabel('age')
plt.ylabel('weight')
plt.legend()
plt.show()

# print the model that was fitted (the regression formula)
#TODO: INSERT YOUR CODE HERE

Learn a linear regression model for the length:

In [None]:
# create and fit a linear regression
#TODO: INSERT YOUR CODE HERE

# plot the original values
plt.scatter(length_train, length_target_train, c='green', label='train')
plt.scatter(length_test, length_target_test, c='blue', label='test')

# plot the predicted values
#TODO: INSERT YOUR CODE HERE

# format and show the plot
plt.xlabel('age')
plt.ylabel('length')
plt.legend()
plt.show()

# print the model that was fitted (the regression formula)
#TODO: INSERT YOUR CODE HERE

Learn a polynomial regression model for the length:

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# create a transformer that generates polynomial features
#TODO: INSERT YOUR CODE HERE

# setup and fit a pipeline
#TODO: INSERT YOUR CODE HERE

prediction = pipeline.predict(features)

# plot the original values
plt.scatter(length_train, length_target_train, c='green', label='train')
plt.scatter(length_test, length_target_test, c='blue', label='test')

# create a new dataframe that contains the age and the predictions
d = fish[['age']]
d = d.assign(prediction=prediction)

# sort the data before plotting it
d = d.sort_values(by='age')

# plot the predicted values
plt.plot(d['age'], d['prediction'], c='red', label='prediction')

plt.xlabel('age')
plt.ylabel('length')
plt.legend()
plt.show()

# print the model that was fitted (the regression formula)
print("length = ", end='')
for i, f in enumerate(pipeline.named_steps['transformer'].get_feature_names(['age'])):
    if i > 0:
        print(" + ", end='')
    print("{}*{}".format(pipeline.named_steps['estimator'].coef_[i], f), end='')
print(" + {}".format(pipeline.named_steps['estimator'].intercept_))


#### 6.1.3 Measure the performance of the different regression models you learned before. Use 10-fold cross validation and RMSE for evaluation.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

print("Weight as a function of age")
#TODO: INSERT YOUR CODE HERE
print("MSE:", mse_1)
print("RMSE:", sqrt(mse_1))
print("R^2:", r2_1)
print()

print("Length as a function of age (linear)")
#TODO: INSERT YOUR CODE HERE
print("MSE:", mse_2)
print("RMSE:", sqrt(mse_2))
print("R^2:", r2_2)
print()

print("Length as a function of age (polynomial)")
#TODO: INSERT YOUR CODE HERE
print("MSE:", mse_3)
print("RMSE:", sqrt(mse_3))
print("R^2:", r2_3)
print()

### 6.2. Feature Selection
In this exercise you will explore different feature selection methods for linear regression.

#### 1. First, fit a linear regression model to the “birthweight_train” dataset without any feature selection and evaluate the model on the “birthweight_test” dataset.

In [None]:
# Load train and test data
b_train = pd.read_csv('birthweight_train.csv', sep=';')
b_test = pd.read_csv('birthweight_test.csv', sep=';')

b_train.head()

In [None]:
# Split features and target

# drop the 'id' and 'Birthweigth' columns from the feature set
# using .iloc[,] you can access parts of the dataframe using numeric indices
# the first argument specifies which records to keep (here : means all)
# the second argument specifies which columns to keep (here 0:-2 means start from the first and end before the second last)
features_train = b_train.iloc[:,0:-2]
target_train = b_train['Birthweight']

features_test = b_test.iloc[:,0:-2]
target_test = b_test['Birthweight']

features_train.head()

We start with a standard linear regression model using all the features.

In [None]:
#TODO: INSERT YOUR CODE HERE
print("RMSE:", sqrt(mse))

#### 2. Look at the results of an F-Regression and inspect the p-values for each feature. Fit a second regression model using only the significant features (p<=0.05). How does the performance of your model change?

In [None]:
from sklearn.feature_selection import f_regression

# run the F-Test
#TODO: INSERT YOUR CODE HERE

# show the results
#TODO: INSERT YOUR CODE HERE

In [None]:
from sklearn.feature_selection import SelectKBest, SelectFwe

# create a pipeline with feature selection based on the F Test
#TODO: INSERT YOUR CODE HERE

# fit the regression on the training data
#TODO: INSERT YOUR CODE HERE

# predict the values for the test data
#TODO: INSERT YOUR CODE HERE

# evaluate on the test set
#TODO: INSERT YOUR CODE HERE
print("RMSE:", rmse)

#### 3. Look at the new model and inspect the p-values again. Are there any features for which the p-value has changed?

In [None]:
# run the F-Test again on selected features

# prepare a dataframe to inspect the results

# show the results

### 6.3. Predicting housing prices in Boston

The Housing dataset describes 506 houses in the suburbs of Boston in 1993. The data set is provided in the housing.csv file. The houses are described by the following 12 continuous attributes and 1 binary attribute.

#### 1. Your task is to find a good regression model for determining the median value (column MEDV) of a house.
You may experiment with different regression models and parameter tuning. As always, it
may help to first visualize different attribute combinations of the data.

In [None]:
# Load the training and test data
h_train = pd.read_csv('housing_train.csv', sep=';')
h_test = pd.read_csv('housing_test.csv', sep=';')
h_train.head()

In [None]:
# Split features and target
train_data = h_train.iloc[:,0:-1]
train_target = h_train['MEDV']

test_data = h_test.iloc[:,0:-1]
test_target = h_test['MEDV']

Run all regression approaches that you learned about in a grid search:

In [None]:
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler

#TODO: INSERT YOUR CODE HERE

Next steps: Look at the different parameters of the regression approaches and try to optimse them. Also experiment with feature selection to improve your model!

# ChatGPT Bonus Exercises
Reminder: Do not take the answers of ChatGPT at face value! Always cross-check with lecture slides, literature and/or the teaching staff!

### C.1. Discuss types of regression models
* Use ChatGPT to compare the different regression models you learned about in the lecture. What are common challenges and limitations associated with regression analysis, and how can the different models address these issues? Which guidelines does ChatGPT propose for selecting an appropriate regression model?

### C.2. Learn about more ways for feature selection in Python
* Ask ChatGPT about the difference between feature selection using the p-value with a threshold and recursively removing features. Finally, ask ChatGPT to generate python code that evaluates the recursive elimination on the dataset from task 6.2.

In [None]:
# put your code here

### C.3. Self-Assessment
* Ask ChatGPT to create a pen-and-paper exercise for graduate students that lets you practice the calculation of the evaluation metrics MAE, MSE and RMSE. 
* Ask ChatGPT to create an exam exercise for graduate students on linear regressions. The task should include a question on how insignificant variables are identified and how they impact the model’s performance. Get the answer from ChatGPT and critically evaluate it.
* Ask ChatGPT to create three multiple-choice questions for graduate students about choosing the best regression model for a specific task.