# 0. Meta

## 0.1. Packages

In [52]:
import pandas as pd
from datetime import datetime
import numpy as np
import joblib

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import KFold, cross_val_predict

## 0.2. Functions

# 1. Data Import

In [53]:
X_test = pd.read_csv("../data/processed/X_test.csv")
y_test = pd.read_csv("../data/processed/y_test.csv")
X_train = pd.read_csv("../data/processed/X_train.csv")
y_train = pd.read_csv("../data/processed/y_train.csv")

# 2. Model Training and Hyperparameter Tuning

## 2.1. Linear Regression

Train linear regression model and measure computation time.

In [54]:
lr_model = LinearRegression() 
start = datetime.now()
lr_model.fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
stop = datetime.now()
delta = stop - start

Due to the log-transformation applied to the target variable y_train during data preprocessing to ensure normality and linearity, it is necessary to reverse this transformation on the predictions before assessing the model using metrics such as RMSE and R-squared.

In [55]:
y_test_log = np.log1p(y_test.to_numpy())

lr_pred_df = pd.DataFrame({'pred': lr_pred.flatten(), 'y_test': y_test_log.flatten()})

lr_pred_df['pred'] = np.expm1(lr_pred_df['pred'])
lr_pred_df['y_test'] = np.expm1(lr_pred_df['y_test'])

  result = getattr(ufunc, method)(*inputs, **kwargs)


During the back-transformation, NaNs and Infs occurred for some rows. These entries are eliminated, and the indices of the removed rows are stored to identify the observations that led to these issues. Additionally, the total count and the relative proportion of dropped rows are computed.

In [56]:
index_before = lr_pred_df.index
rows_before = lr_pred_df.shape[0]

lr_pred_df = lr_pred_df.replace([np.inf, -np.inf], np.nan).dropna()

index_after = lr_pred_df.index
rows_after = lr_pred_df.shape[0]

removed_indices = index_before.difference(index_after)
removed_rows = rows_before - rows_after
percent_removed = (removed_rows / rows_before) * 100

removed_rows_X_test = X_test.iloc[removed_indices]
removed_rows_y_test = y_test.iloc[removed_indices]
removed_rows_df = pd.concat([removed_rows_X_test, removed_rows_y_test], axis=1)
removed_rows_df['pre_backtrans_pred'] = lr_pred.flatten()[removed_indices]
filtered_removed_rows_df = removed_rows_df.loc[:, (removed_rows_df != 0).any()]

print('Rows before removing NaNs and Infs:', rows_before)
print('Rows after removing NaNs and Infs:', rows_after)
print('Number of rows removed:', removed_rows)
print('Percentage of rows removed:', percent_removed, '%')

print("Problematic rows in X_test and y_test:")
filtered_removed_rows_df

Rows before removing NaNs and Infs: 9215
Rows after removing NaNs and Infs: 9193
Number of rows removed: 22
Percentage of rows removed: 0.23874118285404233 %
Problematic rows in X_test and y_test:


Unnamed: 0,mileage,offerType,hp,year,make_Audi,make_BMW,make_Bentley,make_Corvette,make_DS,make_Fiat,...,model_S7,model_SLC 250,model_T5 Shuttle,fuel_Diesel,fuel_Electric,fuel_Gasoline,gear_Automatic,gear_Manual,price,pre_backtrans_pred
77,0.139248,0.5,0.538058,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,83870,3996071000.0
101,0.3,0.0,0.732819,0.6,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,34950,5998644000.0
824,0.4068,0.0,0.33938,0.5,0,0,0,0,1,0,...,0,0,0,0,0,1,0,1,8500,7560187000.0
1032,0.283372,0.0,0.67883,0.7,0,1,0,0,0,0,...,0,0,0,0,0,1,1,0,35000,337960100.0
2314,0.403326,0.0,0.810958,0.0,0,0,1,0,0,0,...,0,0,0,0,0,1,1,0,99800,39486260000.0
2468,0.126833,0.0,0.77053,0.8,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,86885,337960100.0
2550,0.554074,0.0,0.493344,0.4,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,9985,5998644000.0
3623,0.398366,0.0,0.744444,0.3,1,0,0,0,0,0,...,1,0,0,0,0,1,1,0,39925,7201592000.0
3684,0.208001,0.5,0.538058,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,45555,6188076000.0
4405,0.020801,0.75,0.409779,1.0,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,33994,7184330000.0


It appears that the linear regression model leads to extremely high predicted values for the problematic rows. Therefore, the back transformation fails, leading to infinite values. The full linear regression model therefore appears to be unsuitable. After removing the problematic rows, the evaluation metrics are calculated.  However, it should be noted that these can only be interpreted to a limited extent, as not all predictions are taken into account.

In [58]:
lr_r2 = r2_score(lr_pred_df['y_test'], lr_pred_df['pred'])
lr_r2_adj = 1 - (1 - lr_r2) * ((len(X_test) - 1) / (len(X_test) - len(X_test.columns) - 1))
lr_rmse = np.sqrt(mean_squared_error(lr_pred_df['y_test'], lr_pred_df['pred']))
lr_seconds = delta.seconds + delta.microseconds/1E6

lr_evaluation = pd.DataFrame({
    'model': ['lr'],
    'r2': [lr_r2],
    'r2_adj': [lr_r2_adj],
    'rmse': [lr_rmse],
    'seconds': [lr_seconds]
})

lr_evaluation

Unnamed: 0,model,r2,r2_adj,rmse,seconds
0,lr,0.928132,0.920122,4906.257588,3.950021


## 2.2. Regularized Linear Regression

### 2.2.1. Lasso Regression

### 2.2.2. Ridge Regression

### 2.2.3. Elastic Net Regression

## 2.3. Gaussian Process Regression

## 2.4. Bayesian Linear Regression

## 2.5. Robust Regression

### 2.5.1. Huber Regression

### 2.5.2. Quantile Regression

### 2.5.3. RANSAC Regression

### 2.5.4. Theil Sen Regression

## 2.6. K-Nearest Neighbors Regression

## 2.7. Artificial Neural Networks

### 2.7.1. Multi-Layer Perceptron Regressor

## 2.8. Support Vector Regression

## 2.9. Decision Trees Regression

## 2.10. Ensemble

### 2.10.1. Ada Boost Regressor

### 2.10.2. Bagging Regressor

### 2.10.3. Extra Tree Regressor

### 2.10.4. Gradient Boosting Regressor

### 2.10.5. XGBoost Regressor

### 2.10.6. LightGBM Regressor

### 2.10.7. Random Forest Regressor

### 2.10.8. Extra Trees Regressor

### 2.10.9. Stacking Regressor

### 2.10.10. Voting Regressor

### 2.10.11. Histogram-based Gradient Boosting Regressor

## 2.10. Dimensionality-Reduced Regression