# 1.2 Exercises – Non-linear Regression

In this notebook you will practice the non-linear regression techniques from the lecture using a **new dataset**.

## Dataset: Diamond Prices

We will use the [diamonds dataset](https://ggplot2.tidyverse.org/reference/diamonds.html), which contains prices and attributes of ~54,000 diamonds. For efficiency we work with a random sample of **500 diamonds**.

| Feature | Description |
|---------|-------------|
| `carat` | Weight of the diamond (0.2–5.01) |
| `price` | Price in US dollars ($326–$18,823) |

The relationship between `carat` and `price` is clearly **non-linear** — larger diamonds are disproportionately more expensive.

**Goal**: Use polynomial regression and other non-linear techniques to model diamond prices.

In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_context("notebook", font_scale=1.2)
sns.set_style("whitegrid")

## Exercise 1 – Load and explore the dataset

Load the diamonds dataset from the URL below and take a random sample of 500 rows (use `random_state=42`). Keep only the `carat` and `price` columns.

Display the shape, basic statistics (`.describe()`), and the first 5 rows.

```python
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"
```

In [None]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv"

# YOUR CODE HERE

## Exercise 2 – Visualize the relationship

Create a scatter plot of `carat` (x-axis) versus `price` (y-axis) using `sns.lmplot` with `fit_reg=False`.

Does the relationship look linear?

In [None]:
# YOUR CODE HERE

## Exercise 3 – Fit a linear model

Fit a `LinearRegression` model using `carat` as the single feature to predict `price`.

1. Compute and print the R² score.
2. Plot the data points and overlay the linear fit line.

**Hint**: Use `np.linspace` to generate evenly spaced x-values for plotting the line.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# YOUR CODE HERE

## Exercise 4 – Evaluate the linear model

Is the linear model a good fit for this data? Why or why not? Write your answer below.

*Your answer here*

## Exercise 5 – Polynomial model (degree 2)

Add a new feature `carat^2` (the square of `carat`) to the dataset. Fit a `LinearRegression` model using both `carat` and `carat^2` as features.

1. Print the R² score and compare it to the linear model.
2. Plot the data and overlay the degree-2 polynomial curve.

**Hint**: To plot the curve, compute predictions as:

$$\hat{y} = \theta_0 + \theta_1 \cdot x + \theta_2 \cdot x^2$$

where the coefficients come from `model.intercept_` and `model.coef_`.

In [None]:
# YOUR CODE HERE

## Exercise 6 – Higher-degree polynomials with scaling

Fit polynomial models of degrees 2 through 7 and plot all fits on a single figure.

**Important**: Higher-degree polynomial features (e.g. `carat^7`) can have very large values. Use `MinMaxScaler` to scale each polynomial feature to the [0, 1] range before fitting.

**Steps**:
1. Loop over degrees 2 to 7.
2. For each degree, build a feature matrix with columns `carat`, `carat^2`, ..., `carat^d`.
3. Scale the polynomial features (degree ≥ 2) using `MinMaxScaler`.
4. Fit a `LinearRegression` and store the R² score.
5. Generate predictions for plotting (remember to scale the plot features with the **same** scaler).
6. Show all polynomial fits on one scatter plot with a legend that includes the R².

In [None]:
from sklearn.preprocessing import MinMaxScaler

# YOUR CODE HERE

## Exercise 7 – Train/test split: detecting overfitting

Split the data into 80% training and 20% test sets (use `random_state=42`). For each polynomial degree from 2 to 9:

1. Build polynomial features and scale them with `MinMaxScaler`.
2. Fit on the **training** set.
3. Compute R² on both the training and test sets.

Print a table showing degree, training R², and test R².

**Hint**: Use `train_test_split` from `sklearn.model_selection`. Make sure to `fit` the scaler on the training data only and `transform` both train and test.

In [None]:
from sklearn.model_selection import train_test_split

# YOUR CODE HERE

## Exercise 8 – Interpret the results

Based on the training vs. test R² values:
- Which polynomial degree gives the best **test** performance?
- At what degree do you start to see signs of **overfitting** (training R² much higher than test R²)?

Write your answer below.

*Your answer here*

## Exercise 9 – Using `PolynomialFeatures`

Instead of manually creating polynomial features, use `sklearn.preprocessing.PolynomialFeatures` to automatically generate them.

1. Create degree-3 polynomial features from `carat` (set `include_bias=False`).
2. Fit a `LinearRegression` model on these features.
3. Print the R² score.
4. Compare this result with your manual degree-3 polynomial from Exercise 6.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# YOUR CODE HERE

## Exercise 10 – Sigmoid curve fitting

As discussed in the lecture, many biological relationships follow a **sigmoid (logistic) function**:

$$f(x) = \frac{L}{1 + e^{-\theta_1(x - \theta_0)}}$$

where:
- $L$ is the curve's maximum value
- $\theta_0$ is the x-value of the sigmoid's midpoint
- $\theta_1$ controls the steepness

The cell below provides synthetic dose-response data. Your task:

1. Define a Python function `sigmoid(x, L, theta0, theta1)` implementing the formula above.
2. Fit the parameters using `scipy.optimize.curve_fit`.
3. Plot the data and the fitted sigmoid curve.
4. Print the fitted parameters.

In [None]:
# Synthetic dose-response data (provided)
np.random.seed(42)
dose = np.linspace(0, 10, 30)
response = 100 / (1 + np.exp(-(1.5 * (dose - 5)))) + np.random.normal(0, 3, size=30)
response = np.clip(response, 0, 100)
dose_response = pd.DataFrame({'dose': dose, 'response': response})

plt.figure(figsize=(9, 6))
plt.scatter(dose_response['dose'], dose_response['response'], s=60)
plt.xlabel('Dose')
plt.ylabel('Response (%)')
plt.title('Dose-Response Data')
plt.show()

In [None]:
from scipy.optimize import curve_fit

# YOUR CODE HERE

## Bonus – Log-transform the target variable

Sometimes a non-linear relationship can be made approximately linear by **transforming the target variable**.

1. Create `log_price = log(price)` using `np.log`.
2. Fit a simple `LinearRegression` model: `carat` → `log_price`.
3. Compute R² and plot the fit.
4. Compare this R² to your polynomial models.

**Why does this work?** If price ≈ $e^{a + b \cdot \text{carat}}$, then $\log(\text{price}) \approx a + b \cdot \text{carat}$, which is linear!

In [None]:
# YOUR CODE HERE