In [1]:
import pandas as pd

df = pd.read_csv('vehicles.csv')

df.columns

Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'county', 'state', 'lat', 'long',
       'posting_date'],
      dtype='object')

# Feature Engineering

# Trainning & Fitting Machine Learning Models

## 1. Linear Regression Model: Ridge Regression

### Rationale of Selection

- Ridge Regression adds an L2 penalty to the loss function. This shrinks coefficients but generally keeps all features in the model.
- Lasso (L1 penalty) tends to zero out coefficients of less important features, effectively performing feature selection. This can be useful in some scenarios, but it can also drop too many correlated features and become unstable if you want to leverage those correlations.
- In real-world used-car datasets, you may have correlated features (e.g., year and odometer, or manufacturer and model). Ridge handles multicollinearity more gracefully—rather than arbitrarily zeroing out correlated features, it shrinks them equally. This usually yields more stable results.
- Elastic Net (a mix of L1 and L2) is also a possibility, but if you don’t specifically need feature selection from Lasso, Ridge is a solid, simpler choice.

### Strengths
- Very fast to train on large tabular data.
- Interpretability: You can easily examine which features have higher coefficients.
- Good baseline to quickly gauge your data’s linear separability.

### Weaknesses
- Struggles with non-linear relationships unless you manually engineer polynomial or interaction terms.
- May still underfit if the real relationship between features and price is complex.

## 2. Tree-Based Model: Random Forest Regressor

### Rationale of Selection
- Random Forest is an ensemble of many decision trees, each trained on different subsets of the data (rows and/or columns).
- It naturally captures non-linearities and feature interactions without extensive feature engineering.
- It is fairly robust to outliers and can handle missing values (with proper strategy or imputation).
- Typically achieves strong performance on tabular data.

### Strengths
- High predictive power and often outperforms simple linear models.
- Inherent feature importance assessment, which helps with interpretability.
- Averaging across multiple trees reduces overfitting compared to a single decision tree.

### Weaknesses
- Can be memory-intensive if a large number of trees is used.
- Predictions can be slower due to the need to average many decision trees.
- Less straightforward to interpret compared to a pure linear model (though feature importances can mitigate that).

## 3. Neural Network Model: Multilayer Perceptron (MLP)

### Rationale of Selection
- A feedforward MLP is generally the best neural network choice for tabular data (as opposed to CNNs for image data or LSTMs for sequence data).
- Can capture complex non-linear relationships if given enough layers/neurons.
- Offers flexibility with different architectures, regularization schemes (dropout, batch normalization), and activation functions.

### Strengths
- Highly flexible universal approximator; can learn intricate patterns in the data.
- Scales well with more data, potentially outperforming simpler models given sufficient examples.
- Possible to use embedding layers for high-cardinality categorical features, reducing dimensionality.

### Weaknesses
- Tuning hyperparameters (number of layers, neurons, learning rate, etc.) can be complex and time-consuming.
- More prone to overfitting if not properly regularized.
- Less interpretable compared to linear or tree-based models.

## 4. Nearest Neighbor Model: K-Nearest Neighbors (KNN) Regressor

### Rationale of Selection
- **KNN Regressor** predicts a target value based on the average (or another aggregation) of the k nearest neighbors in feature space.
- No explicit “training” phase—KNN stores the entire dataset and performs distance-based lookups during prediction.
- Can handle complex boundaries since the decision (or prediction) is based purely on local neighbor information.

### Strengths
- Simple to implement and understand conceptually.
- Naturally captures non-linear relationships if the data in the local neighborhood is consistent.
- Few parameters: mainly k (number of neighbors) and distance metric (e.g., Euclidean).

### Weaknesses
- Can be **slow at prediction time** for large datasets because it must search through all or a large portion of the data.
- **Sensitive to feature scaling**—proper normalization/standardization is crucial.
- Choosing the optimal number of neighbors (k) can be non-trivial and must be tuned with techniques like cross-validation.