In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Regression Models Lab
## Linear and logistic regression: theory and practice

In this lab you'll revisit and expand on your knowledge of modelling in general, as well as the fundamentals of linear and logistic regression. As a reminder, _linear regression_ is a regression model (regressor), and _logistic regression_ is a classification model (classifier).

This time, you'll use generated data, in order to separate some of the complexity of handling various datasets from inspecting and evaluating models.

**Use vectorization as much as possible!** You should be able to complete the lab using for-loops only to track the training steps.

### Problem 1. Generate some data for multiple linear regression (1 point)
As an expansion to the lecture, you'll create a dataset and a model.

Create a dataset of some (e.g., 50-500) observations of several (e.g., 5-20) independent features. You can use random generators for them; think about what distributions you'd like to use. Let's call them $x_1, x_2, ..., x_m$. The data matrix $X$ you should get should be of size $n \times m$. It's best if all features have different ranges.

Create the dependent variable by assigning coefficients $\bar{a_1}, \bar{a_2}, ..., \bar{a_m}, \bar{b}$ and calculating $y$ as a linear combination of the input features. Add some random noise to the functional values. I've used bars over coefficients to avoid confusion with the model parameters later.

Save the dataset ($X$ and $y$), and "forget" that the coefficients have ever existed. "All" you have is the file and the implicit assumption that there is a linear relationship between $X$ and $y$.

In [18]:
# Generate x values using linspace
x = np.linspace(-3, 5, 500)

# Generate y values based on the equation y = 2x + 3 with added noise
noise = np.random.normal(0, 0.5, x.shape)  # Add some Gaussian noise
y = 2 * x + 3 + noise

# Convert x and y into a DataFrame
dataset = pd.DataFrame({'x': x, 'y': y})

dataset

Unnamed: 0,x,y
0,-3.000000,-3.323286
1,-2.983968,-3.508710
2,-2.967936,-2.092301
3,-2.951904,-2.462988
4,-2.935872,-2.875730
...,...,...
495,4.935872,13.084472
496,4.951904,12.420320
497,4.967936,12.912016
498,4.983968,12.966135


In [16]:
n_observations = 100
n_features = 5

# Generate random features with different ranges using vectorized operations
np.random.seed(42)  # For reproducibility
X = np.random.rand(n_observations, n_features) * np.array([1, 10, 100, 1000, 10000])

# Assign arbitrary coefficients for the linear combination
true_coefficients = np.array([2, -3, 4.5, -1.5, 0.5])

# Calculate the dependent variable using vectorized operations and add noise
noise = np.random.normal(0, 10, n_observations)  # Gaussian noise
y = X @ true_coefficients + noise  # Matrix multiplication using @ operator

# Convert to a DataFrame for easier handling
data = pd.DataFrame(X, columns=[f'Feature_{i+1}' for i in range(n_features)])
data['Target'] = y

# Display the first few rows of the dataset
data


Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Target
0,0.374540,9.507143,73.199394,598.658484,1560.186404,187.147961
1,0.155995,0.580836,86.617615,601.115012,7080.725778,3045.800826
2,0.020584,9.699099,83.244264,212.339111,1818.249672,945.663470
3,0.183405,3.042422,52.475643,431.945019,2912.291402,1029.839072
4,0.611853,1.394939,29.214465,366.361843,4560.699842,1850.326991
...,...,...,...,...,...,...
95,0.992965,0.737966,55.385428,969.302536,5230.978442,1400.986474
96,0.629399,6.957487,45.454106,627.558080,5843.143119,2149.099792
97,0.901158,0.454464,28.096319,950.411484,8902.637839,3154.608690
98,0.455657,6.201326,27.738118,188.121160,4636.984049,2135.875646


In [17]:
data.to_csv('multiple_linear_regression_dataset.csv', index=False)

data.head()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Target
0,0.37454,9.507143,73.199394,598.658484,1560.186404,187.147961
1,0.155995,0.580836,86.617615,601.115012,7080.725778,3045.800826
2,0.020584,9.699099,83.244264,212.339111,1818.249672,945.66347
3,0.183405,3.042422,52.475643,431.945019,2912.291402,1029.839072
4,0.611853,1.394939,29.214465,366.361843,4560.699842,1850.326991


### Problem 2. Check your assumption (1 point)
Read the dataset you just saved (this is just to simulate starting a new project). It's a good idea to test and verify our assumptions. Find a way to check whether there really is a linear relationship between the features and output.

### Problem 3. Figure out the modelling function (1 point)
The modelling function for linear regression is of the form
$$ \tilde{y} = \sum_{i=1}^{m}a_i x_i + b $$

If you want to be clever, you can find a way to represent $b$ in the same way as the other coefficients.

Write a Python function which accepts coefficients and data, and ensure (test) it works correctly.

### Problem 4. Write the cost function and compute its gradients (1 point)
Use MSE as the cost function $J$. Find a way to compute, calculate, or derive its gradients w.r.t. the model parameters $a_1, ..., a_m, b$

Note that computing the cost function value and its gradients are two separate operations. Quick reminder: use vectorization to compute all gradients (maybe with the exception of $\frac{\partial J}{\partial b}$) at the same time.

### Problem 5. Perform gradient descent (1 point)
Perform weight updates iteratively. Find a useful criterion for stopping. For most cases, just using a fixed (large) number of steps is enough.

You'll need to set a starting point (think about which one should be good, and how it matters); and a learning rate.

### Problem 6. Do other cost functions work? (2 points)
Repeat the process in problems 4 and 5 with MAE, and then again - with the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss). Both of them are less sensitive to outliers / anomalies than MSE); with the Huber loss function being specifically made for datasets with outliers.

Explain your findings. Is there a cost function that works much better? How about speed of training (measured in wall time)?

### Problem 7. Experiment with the learning rate (1 point)
Use your favorite cost function. Run several "experiments" with different learning rates. Try really small, and really large values. Observe and document your findings.

### Problem 8. Generate some data for classification (1 point)
You'll need to create two clusters of points (one cluster for each class). I recomment using `scikit-learn`'s `make_blobs()` ([info](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html)). Use as many features as you used in problem 1.

### Problem 9. Perform logistic regression (1 point)
Reuse the code you wrote in problems 3-7 as much as possible. If you wrote vectorized functions with variable parameters - you should find this easy. If not - it's not too late to go back and refactor your code.

The modelling function for logistic regression is
$$ \tilde{y} = \frac{1}{1+\exp{(-\sum_{i=1}^{m}a_i x_i + b)}}$$. Find a way to represent it using as much of your previous code as you can.

The most commonly used loss function is the [cross-entropy](https://en.wikipedia.org/wiki/Cross-entropy).

Experiment with different learning rates, basically repeating what you did in problem 7.

### * Problem 10. Continue experimenting and delving deep into ML
You just saw how modelling works and how to implement some code. Some of the things you can think about (and I recommend you pause and ponder on some of them are):
* Code: OOP can be your friend sometimes. `scikit-learn`'s models have `fit()`, `predict()` and `score()` methods.
* Data: What approaches work on non-generated data?
* Evaluation: How well do different models (and their "settings" - hyperparameters) actually work in practice? How do we evaluate a model in a meaningful way?
* Optimization - maths: Look at what `optimizers` (or solvers) are used in `scikit-learn` and why. Many "tricks" revolve around making the algorithm converge (finish) in fewer iterations, or making it more numerically stable.
* Optimization - code: Are there ways to make the code run fastr?