# EV Car Prices

This assignment focuses on car prices. The data ('car_prices.xlsx') is a pre-processed version of original data scraped from bilbasen.dk by previous MAL1 students. The dataset contains 16 columns:

- **Price (DKK)**: The current listed price of the vehicle in Danish Kroner.
- **Model Year**: The manufacturing year of the vehicle.
- **Mileage (km)**: The total kilometres driven by the vehicle (odometer reading).
- **Electric Range (km)**: The estimated maximum driving range on a full charge.
- **Battery Capacity (kWh)**: The total capacity of the vehicle's battery in kilowatt-hours.
- **Energy Consumption (Wh/km)**: The vehicle's energy consumption in watt-hours per kilometre.
- **Annual Road Tax (DKK)**: The annual road tax cost in Danish Kroner.
- **Horsepower (bhp)**: The vehicle's horsepower (brake horsepower).
- **0-100 km/h (s)**: The time (in seconds) for the car to accelerate from 0 to 100 km/h.
- **Top Speed (km/h)**: The maximum speed the vehicle can achieve.
- **Towing Capacity (kg)**: The maximum weight the vehicle can tow.
- **Original Price (DKK)**: The price of the vehicle when first sold as new.
- **Number of Doors**: The total number of doors on the vehicle.
- **Rear-Wheel Drive**: A binary indicator (1 = Yes, 0 = No) for rear-wheel drive.
- **All-Wheel Drive (AWD)**: A binary indicator (1 = Yes, 0 = No) for all-wheel drive.
- **Front-Wheel Drive**: A binary indicator (1 = Yes, 0 = No) for front-wheel drive.

The first one, **Price**, is the response variable.

The **objective** of this assignment is:
1. Understand how linear algebra is used in Machine Learning, specifically for correlations and regression
2. Learn how to perform multiple linear regression, ridge regression, lasso regression and elastic net
3. Learn how to assess regression models

Please solve the tasks using this notebook as you template, i.e. insert code blocks and markdown block to this notebook and hand it in. Please use 42 as your random seed.


## Import data
 - Import the dataset 
 - Split the data in a training set and test set - make sure you extract the response variable
 - Remember to use the data appropriately; in the tasks below, we do not explicitly state when to use train and test - but in order to compare the models, you must use the same dataset for training and testing in all models.
 - Output: When you are done with this, you should have the following sets: `X` (the original dataset), `X_train`, `X_train`, `X_test`, `y_train`, `y_test`

In [10]:
# Code block for important and creating data sets. Add more code blocks if needed.
import pandas as pd
from sklearn.model_selection import train_test_split

target = 'Price (DKK)'
data = pd.read_excel('car_prices.xlsx').drop(columns=['All-Wheel Drive (AWD)'])
X_train, X_test, y_train, y_test = train_test_split(data.drop(target, axis=1), data[target], test_size=0.2, random_state=42)

## Part 1: Linear Algebra
In this assignment, you have to solve all problems using linear algebra concepts. You are free to use SymPy or NumPy - though NumPy is **significantly** more efficient computationally than SymPy since NumPy is optimized for numerical computations with floating-point arithmetic. Since linear regression is purely numerical, NumPy is the better choice.


### Task 1: Regression



Linear regression finds the best-fitting line (or hyperplane) by solving for the **coefficient vector** $\mathbf{B}$ that minimizes the squared error:

$$
\mathbf{B} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

where:
- $\mathbf{X}$ is the **design matrix**, including a column of ones for the intercept.
- $\mathbf{y}$ is the **response variable** (target values).
- $\mathbf{B}$ contains the **regression coefficients**.

**Explanation of Each Step**
1. **Construct the matrix $X$**:
   - Each **row** represents a data point.
   - Each **column** represents a feature.
   - The **first column is all ones** to account for the **intercept**.

2. **Solve for $\mathbf{B}$ using the normal equation**:
   - Compute $X^T X$ (feature correlation).
   - Compute $X^T y$ (cross-product with the target variable).
   - Compute the **inverse of $X^T X$** and multiply by $X^T y$ to get $\mathbf{B}$.

3. **Interpret the results**:
   - The **first value** in $\mathbf{B}$ is the **intercept**.
   - The remaining values are the **coefficients for each feature**.



In [11]:
# Use this for Task 3. Add more code blocks if needed.
import numpy as np
X_train['Intercept'] = np.ones(X_train.shape[0])
X_test['Intercept'] = np.ones(X_test.shape[0])

X_train_matrix = X_train.values
X_test_matrix = X_test.values

y_train_matrix = y_train.values
y_test_matrix = y_test.values

In [12]:
def get_B(X_matrix, y_matrix):
    Xt_matrix = np.transpose(X_matrix)
    XtX_matrix = np.dot(Xt_matrix, X_matrix)
    Xty_matrix = np.dot(Xt_matrix, y_matrix)
    B = np.dot(np.linalg.inv(XtX_matrix), Xty_matrix)
    return B

B = get_B(X_train_matrix, y_train_matrix)

B_explained = pd.DataFrame(B, index=X_train.columns, columns=['Coefficient'])
display(B_explained)


Unnamed: 0,Coefficient
Model Year,18078.22
Mileage (km),-0.6251698
Electric Range (km),106.8243
Battery Capacity (kWh),38.07482
Energy Consumption (Wh/km),111.4775
Annual Road Tax (DKK),-346.9803
Horsepower (bhp),24.49535
0-100 km/h (s),6483.267
Top Speed (km/h),145.7067
Towing Capacity (kg),20.02742


Task 2: Evaluating the Model

Once we have the regression coefficients $\mathbf{B}$, we can evaluate how well the model fits the data using two key metrics:

1. **Mean Squared Error (MSE)** – Measures the average squared difference between the predicted and actual values:
   $$
   MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
   $$
   - Lower MSE means better fit.

2. **$R^2$ (Coefficient of Determination)** – Measures how much of the variance in $y$ is explained by $X$:
   $$
   R^2 = 1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
   $$
   - $R^2$ ranges from **0 to 1**, where **1** indicates a perfect fit and **0** means the model explains no variance.


**Explanation of Each Step**
1. **Compute Predictions**:  
   $$ \hat{y} = X B $$
   This gives the model’s predicted values.

2. **Compute MSE**:  
   - We square the residuals $ (y - \hat{y})^2 $ and take the mean.

3. **Compute $R^2$**:
   - **Total sum of squares** $ SS_{total} $ measures the total variance in $ y $.
   - **Residual sum of squares** $ SS_{residual} $ measures the variance left unexplained by the model.
   - $ R^2 $ tells us what fraction of variance is explained.

**Interpreting the Results**
- **MSE**: Lower values indicate a better fit.
- **$R^2$ Score**:
  - **$R^2 = 1$** → Perfect fit (all points on the regression line).
  - **$R^2 = 0$** → Model is no better than predicting the mean of $ y $.
  - **$R^2 < 0$** → Model performs worse than a simple average.

Implement the above steps using linear algebra so that you both create a regression model and calculate the MSE and $R^2$. Note, here you need to use `X_train`, `X_test`, `y_train` and `y_test` appropriately!


In [13]:
# Use this for Task 2. Add more code blocks if needed.
y_pred_matrix = np.dot(X_test_matrix, B)

mse = np.mean((y_test_matrix - y_pred_matrix)**2)
# ss_total = np.sum((y_test_matrix - np.mean(y_test_matrix))**2)
# ss_residual = np.sum((y_test_matrix - y_pred_matrix)**2)
# r2 = ss_residual/ss_total

r_squared = 1 - (np.sum((y_test_matrix - y_pred_matrix)**2) / np.sum((y_test_matrix - np.mean(y_test_matrix))**2))

print(f'Mean Squared Error: {mse}')
print(f'R^2: {r_squared}')

Mean Squared Error: 2774486709.2039394
R^2: 0.8644264442856893


# Part 2: Using Library Functions

### Task 4: Correlation and OLS
For this task you must do the following
 - Using library functions, build the following models:
   - Correlation matrix where the correlations are printed in the matrix and a heat map is overlaid
   - Ordinary least squares
   - Performance metrics: MSE, RMSE, $R^2$
   - Comment on the real world meaning of RMSE and $R^2$


In [14]:
correlation_matrix = data.corr()
correlation_target_matrix = correlation_matrix[target].sort_values(ascending=False)
display(correlation_target_matrix)

heatmap = correlation_matrix.style.background_gradient(cmap='coolwarm', axis=None)
display(heatmap)

ols = np.linalg.lstsq(X_train_matrix, y_train_matrix, rcond=None)
B_ols = ols[0]
B_ols_explained = pd.DataFrame(B_ols, index=X_train.columns, columns=['Coefficient'])
display(B_ols_explained)

y_pred_ols = np.dot(X_test_matrix, B_ols)
mse_lib = np.mean((y_test_matrix - y_pred_ols)**2)
rmse_lib = np.sqrt(mse_lib)
r_squared_lib = 1 - (np.sum((y_test_matrix - y_pred_ols)**2) / np.sum((y_test_matrix - np.mean(y_test_matrix))**2))

print(f'Mean Squared Error (Library): {mse_lib}')
print(f'Root Mean Squared Error (Library): {rmse_lib}')
print(f'R^2 (Library): {r_squared_lib}')

print("\n")

print("RMSE measures the average magnitude of the errors")
print("R2 measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s)")



Price (DKK)                   1.000000
Original Price (DKK)          0.893071
Horsepower (bhp)              0.650156
Battery Capacity (kWh)        0.624558
Energy Consumption (Wh/km)    0.588180
Top Speed (km/h)              0.566843
Electric Range (km)           0.522351
Model Year                    0.405433
Towing Capacity (kg)          0.356940
Number of Doors               0.135778
Annual Road Tax (DKK)         0.073300
Rear-Wheel Drive              0.046230
Mileage (km)                 -0.207022
Front-Wheel Drive            -0.477583
0-100 km/h (s)               -0.541049
Name: Price (DKK), dtype: float64

Unnamed: 0,Price (DKK),Model Year,Mileage (km),Electric Range (km),Battery Capacity (kWh),Energy Consumption (Wh/km),Annual Road Tax (DKK),Horsepower (bhp),0-100 km/h (s),Top Speed (km/h),Towing Capacity (kg),Original Price (DKK),Number of Doors,Rear-Wheel Drive,Front-Wheel Drive
Price (DKK),1.0,0.405433,-0.207022,0.522351,0.624558,0.58818,0.0733,0.650156,-0.541049,0.566843,0.35694,0.893071,0.135778,0.04623,-0.477583
Model Year,0.405433,1.0,-0.639181,0.44154,0.340616,0.137949,-0.16344,0.161478,-0.191305,0.134181,0.165483,0.13762,0.075447,0.138106,-0.17522
Mileage (km),-0.207022,-0.639181,1.0,-0.087012,-0.061631,-0.013605,0.165516,0.09422,-0.063117,0.140555,-0.03414,0.037873,-0.031265,-0.036325,-0.038853
Electric Range (km),0.522351,0.44154,-0.087012,1.0,0.730463,0.13116,-0.033891,0.548337,-0.497768,0.525905,0.180114,0.418936,0.216001,0.273725,-0.507334
Battery Capacity (kWh),0.624558,0.340616,-0.061631,0.730463,1.0,0.484996,0.075723,0.603375,-0.516159,0.496303,0.303371,0.586538,0.249358,0.107914,-0.493198
Energy Consumption (Wh/km),0.58818,0.137949,-0.013605,0.13116,0.484996,1.0,0.226662,0.495509,-0.385822,0.310819,0.426699,0.649474,0.26644,-0.155324,-0.30415
Annual Road Tax (DKK),0.0733,-0.16344,0.165516,-0.033891,0.075723,0.226662,1.0,0.130265,-0.105111,0.103639,0.09658,0.206875,0.02329,-0.086018,-0.05932
Horsepower (bhp),0.650156,0.161478,0.09422,0.548337,0.603375,0.495509,0.130265,1.0,-0.893347,0.861124,0.436839,0.719772,0.056067,-0.150229,-0.564218
0-100 km/h (s),-0.541049,-0.191305,-0.063117,-0.497768,-0.516159,-0.385822,-0.105111,-0.893347,1.0,-0.794771,-0.385387,-0.600842,0.001795,0.139608,0.489892
Top Speed (km/h),0.566843,0.134181,0.140555,0.525905,0.496303,0.310819,0.103639,0.861124,-0.794771,1.0,0.293927,0.63003,-0.072225,0.023334,-0.552141


Unnamed: 0,Coefficient
Model Year,18078.22
Mileage (km),-0.6251698
Electric Range (km),106.8243
Battery Capacity (kWh),38.07482
Energy Consumption (Wh/km),111.4775
Annual Road Tax (DKK),-346.9803
Horsepower (bhp),24.49535
0-100 km/h (s),6483.267
Top Speed (km/h),145.7067
Towing Capacity (kg),20.02742


Mean Squared Error (Library): 2774486707.5869985
Root Mean Squared Error (Library): 52673.396582971545
R^2 (Library): 0.8644264443647001


RMSE measures the average magnitude of the errors
R2 measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s)


### Task 5: Ridge, Lasso and Elastic Net
In order for Ridge and Lasso (and Elastic net) to have an effect, you must use scaled data to build the models, since regularization depends on coefficient magnitude, and if using non-scaled data the penalty will affect them unequally. Feel free to use this code to scale the data:

```python
# Standardize X
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Standardize y
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
y_test_scaled = scaler_y.transform(y_test.reshape(-1, 1)).flatten()
```
For the final task you must do the following
   - Ridge regression (using multiple alphas)
   - Lasso regression (using multiple alphas)
   - Elastic Net (using multiple alphas)
 - Discussion and conclusion:
   - Discuss the MSE and $R^2$ of all 3 models and conclude which model has the best performance - note the MSE will be scaled!
   - Rebuild the OLS model from Task 4, but this time use the scaled data from this task - interpret the meaning of the model's coefficients
   - Use the coefficients of the best ridge and lasso model to print the 5 most important features and compare to the 5 most important features in the OLS with scaled data model. Do the models agree about which features are the most important?

Note: You may get a convergence warning; try increasing the `max_iter` parameter of the model (the default is 1000 - maybe set it to 100000)

In [15]:
# Use this for Task 5. Add more code blocks if needed.
from sklearn import clone
from sklearn.discriminant_analysis import StandardScaler


scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_to_scaled_train = X_train.drop(columns=['Intercept'])
X_to_scaled_test = X_test.drop(columns=['Intercept'])

# X_train_scaled = pd.DataFrame(scaler_X.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
# X_test_scaled = pd.DataFrame(scaler_X.transform(X_test), columns=X_test.columns, index=X_test.index)

X_train_scaled = pd.DataFrame(scaler_X.fit_transform(X_to_scaled_train), columns=X_to_scaled_train.columns, index=X_to_scaled_train.index)
X_test_scaled = pd.DataFrame(scaler_X.transform(X_to_scaled_test), columns=X_to_scaled_test.columns, index=X_to_scaled_test.index)
X_train_scaled['Intercept'] = np.ones(X_train.shape[0])
X_test_scaled['Intercept'] = np.ones(X_test.shape[0])

y_train_scaled = pd.DataFrame(scaler_y.fit_transform(y_train.values.reshape(-1, 1)).flatten(), index=y_train.index)
y_test_scaled = pd.DataFrame(scaler_y.transform(y_test.values.reshape(-1, 1)).flatten(), index=y_test.index)

alphas = np.logspace(-10, 10, 10)
display(alphas)

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

models = [Ridge(fit_intercept=False), Lasso(fit_intercept=False), ElasticNet(fit_intercept=False)]


results = []

for model in models:
    for alpha in alphas:
        model = clone(model).set_params(alpha=alpha)
        model.fit(X_train_scaled, y_train_scaled)
        y_pred_scaled = model.predict(X_test_scaled)
        y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten()
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        results.append({'Model': model, 'Alpha': alpha, 'MSE': mse, 'R^2': r2})

results_df = pd.DataFrame(results)
display(results_df.sort_values(by="R^2", ascending=False))



array([1.00000000e-10, 1.66810054e-08, 2.78255940e-06, 4.64158883e-04,
       7.74263683e-02, 1.29154967e+01, 2.15443469e+03, 3.59381366e+05,
       5.99484250e+07, 1.00000000e+10])

Unnamed: 0,Model,Alpha,MSE,R^2
13,"Lasso(alpha=0.0004641588833612782, fit_interce...",0.0004641589,2769458000.0,0.864672
23,"ElasticNet(alpha=0.0004641588833612782, fit_in...",0.0004641589,2772091000.0,0.864544
12,"Lasso(alpha=2.782559402207126e-06, fit_interce...",2.782559e-06,2774454000.0,0.864428
22,"ElasticNet(alpha=2.782559402207126e-06, fit_in...",2.782559e-06,2774472000.0,0.864427
11,"Lasso(alpha=1.6681005372000592e-08, fit_interc...",1.668101e-08,2774487000.0,0.864426
21,"ElasticNet(alpha=1.6681005372000592e-08, fit_i...",1.668101e-08,2774487000.0,0.864426
10,"Lasso(alpha=1e-10, fit_intercept=False)",1e-10,2774487000.0,0.864426
20,"ElasticNet(alpha=1e-10, fit_intercept=False)",1e-10,2774487000.0,0.864426
0,"Ridge(alpha=1e-10, fit_intercept=False)",1e-10,2774487000.0,0.864426
1,"Ridge(alpha=1.6681005372000592e-08, fit_interc...",1.668101e-08,2774487000.0,0.864426


### Discussion and Conclusion

The best model is Lasso with the biggest R^2 and the smallest MSE. The values are in general very close.
On the other hand, I noticed that for the alpha bigger than 1.0 all the models perform poorly, perhaps due to that it has already been scaled the data and the regularization is too strong.

In [16]:
X_train_scaled_matrix = X_train_scaled.values
X_test_scaled_matrix = X_test_scaled.values

y_train_scaled_matrix = y_train_scaled.values
y_test_scaled_matrix = y_test_scaled.values


scaled_B = get_B(X_train_scaled_matrix, y_train_scaled_matrix)

scaled_B_explained = pd.DataFrame(scaled_B, index=X_train_scaled.columns, columns=['Coefficient'])
display(scaled_B_explained)
scaled_B_explained['abs'] = scaled_B_explained['Coefficient'].abs()
display(scaled_B_explained.sort_values(by='abs', ascending=False).head(5))

Unnamed: 0,Coefficient
Model Year,0.1679778
Mileage (km),-0.1032103
Electric Range (km),0.07060633
Battery Capacity (kWh),0.005387426
Energy Consumption (Wh/km),0.01816163
Annual Road Tax (DKK),-0.06272769
Horsepower (bhp),0.01798803
0-100 km/h (s),0.07702223
Top Speed (km/h),0.02558098
Towing Capacity (kg),0.04749138


Unnamed: 0,Coefficient,abs
Original Price (DKK),0.850039,0.850039
Model Year,0.167978,0.167978
Mileage (km),-0.10321,0.10321
0-100 km/h (s),0.077022,0.077022
Electric Range (km),0.070606,0.070606


### OLS with scaled data

So the coeefiecients indicate whether or not the variable is descreasing or increasing along with the target variable.

The bigger absolute values (further from 0) indicate that the variable is more important.

In [17]:
best_lasso = Lasso(alpha=0.0004641588833612782, fit_intercept=False)
best_lasso.fit(X_train_scaled, y_train_scaled)
best_lasso_coef = best_lasso.coef_.reshape(-1)

best_lasso_coef_explained = pd.DataFrame(best_lasso_coef, index=X_train_scaled.columns, columns=['Coefficient'])
best_lasso_coef_explained['abs'] = best_lasso_coef_explained['Coefficient'].abs()
display("Best Lasso Coefficients")
display(best_lasso_coef_explained.sort_values(by='abs', ascending=False).head(5))

'Best Lasso Coefficients'

Unnamed: 0,Coefficient,abs
Original Price (DKK),0.850289,0.850289
Model Year,0.168018,0.168018
Mileage (km),-0.102695,0.102695
0-100 km/h (s),0.072476,0.072476
Electric Range (km),0.071143,0.071143


In [18]:
best_ridge = Ridge(alpha=1e-10, fit_intercept=False)
best_ridge.fit(X_train_scaled, y_train_scaled)
best_ridge_coef = best_ridge.coef_.reshape(-1)

best_ridge_coef_explained = pd.DataFrame(best_ridge_coef, index=X_train_scaled.columns, columns=['Coefficient'])
best_ridge_coef_explained['abs'] = best_ridge_coef_explained['Coefficient'].abs()
display("Best Ridge Coefficients")
display(best_ridge_coef_explained.sort_values(by='abs', ascending=False).head(5))

'Best Ridge Coefficients'

Unnamed: 0,Coefficient,abs
Original Price (DKK),0.850039,0.850039
Model Year,0.167978,0.167978
Mileage (km),-0.10321,0.10321
0-100 km/h (s),0.077022,0.077022
Electric Range (km),0.070606,0.070606


So for all the best models the most important features are the same but coefficients are different.
SO models picking the same ideas but with different weights.