<small><font color=gray>Notebook author: <a href="https://www.linkedin.com/in/olegmelnikov/" target="_blank">Oleg Melnikov</a>, <a href="https://www.hse.ru/en/staff/sara/" target="_blank">Saraa Ali</a>  ¬©2025 onwards</font></small><hr style="margin:0;background-color:silver">

**[<font size=6>üöóAuto</font>](https://www.kaggle.com/t/9225c9c3931741ad9e384d5ba0180cc3)**. [**Instructions**](https://colab.research.google.com/drive/1owkYjuRGkx050LQnM3b3yTzd0Dr2XbeV) for running Colabs.

<small>**(Optional) CONSENT.** <mark>[ X ]</mark> We consent to sharing our Colab (after the assignment ends) with other students/instructors for educational purposes. We understand that sharing is optional and this decision will not affect our grade in any way. <font color=gray><i>(If ok with sharing your Colab for educational purposes, leave "X" in the check box.)</i></font></small>

In [None]:
%%time
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS; IS.ast_node_interactivity = "all"
import pandas as pd, time, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
ToCSV = lambda df, fname: df.round(2).to_csv(f'{fname}.csv', index_label='id') # rounds values to 2 decimals

class Timer():
  def __init__(self, lim:'RunTimeLimit'=60): self.t0, self.lim, _ = time.time(), lim, print('timer started')
  def ShowTime(self):
    msg = f'Runtime is {time.time()-self.t0:.0f} sec'
    print(f'\033[91m\033[1m' + msg + f' > {self.lim} sec limit!!!\033[0m' if (time.time()-self.t0-1) > self.lim else msg)

np.set_printoptions(linewidth=10000, precision=4, edgeitems=20, suppress=True)
pd.set_option('display.max_rows', 100, 'display.max_columns', 100, 'display.max_colwidth', 100, 'display.precision', 2, 'display.max_rows', 4)

db = fetch_openml('BNG(auto_price)')   # load databunch (dictionary)
tX = pd.DataFrame(db['data'], columns=db['feature_names'])
tX.symboling = tX.symboling.astype('float')
tX['price'] = db['target']
YCols = ['city-mpg','highway-mpg','price']  # 3 targets
tY = tX[YCols]
tX.drop(YCols, axis=1, inplace=True)
# tY = pd.Series(db['target'], name='price')
tX, vX, tY, DO_NOT_USE = train_test_split(tX, tY, train_size=0.7, random_state=0, shuffle=True)
# ToCSV(DO_NOT_USE, 'testY')   # Students cannot use these test values
del DO_NOT_USE
tX
tY
tmr = Timer() # runtime limit (in seconds). Add all of your code after the timer

timer started
CPU times: user 6.18 s, sys: 1 s, total: 7.19 s
Wall time: 24.7 s


In [None]:
tmr = Timer()

timer started


<hr color=red>

<font size=5>‚è≥</font> <strong><font color=orange size=5>Your Code, Documentation, Ideas and Timer - All Start Here...</font></strong>

**Student's Section** (between ‚è≥ symbols): add your code and documentation here.

## **Task 1. Preprocessing Pipeline**

Explain elements of your preprocessing pipeline i.e. feature engineering, subsampling, clustering, dimensionality reduction, etc.
1. Why did you choose these elements? (Something in EDA, prior experience,...? Btw, EDA is not required)
1. How do you evaluate the effectiveness of these elements?
1. What else have you tried that worked or didn't?

**Student's answer:**

## **Task 2. Modeling Approach**
Explain your modeling approach, i.e. ideas you tried and why you thought they would be helpful.

1. How did these decisions guide you in modeling?
1. How do you evaluate the effectiveness of these elements?
1. What else have you tried that worked or didn't?

**Student's answer:**

Below is an **improved baseline** with feature engineering that should significantly outperform the simple baseline.

### Improvements implemented:
1. **Log transformation**: Apply log(1+|x|) to numerical features to handle skewed distributions
2. **Polynomial features**: Generate degree-2 polynomial features including interactions
3. **Better encoding**: Drop redundant binary features to reduce multicollinearity
4. **Stronger regularization**: Increase alpha to handle the expanded feature space
5. **Robust imputation**: Handle missing values before transformations

In [None]:
# IMPROVED BASELINE WITH FEATURE ENGINEERING
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Identify categorical and numerical columns
categorical_cols = tX.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = tX.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {len(numerical_cols)} columns")

# Enhanced numerical transformer with log transforms and polynomial features
def safe_log_transform(X):
    """Apply log(1+x) transform to handle zeros and negatives"""
    return np.log1p(np.abs(X))

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log_transform', FunctionTransformer(safe_log_transform, validate=False)),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create a full pipeline with Ridge regression (higher alpha for regularization with more features)
from sklearn.linear_model import Ridge
from sklearn.multioutput import MultiOutputRegressor

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', MultiOutputRegressor(Ridge(alpha=10.0, random_state=0)))
])

# Fit the model
print("\nTraining improved model...")
model_pipeline.fit(tX, tY)

# Evaluate on training set
train_score = model_pipeline.score(tX, tY)
print(f'In-sample R^2 = {train_score:.4f}')

# Generate predictions for validation set
print("\nGenerating predictions...")
pY_improved = pd.DataFrame(model_pipeline.predict(vX), index=vX.index, columns=YCols)
ToCSV(pY_improved, 'Auto_improved_baseline')
print("Predictions saved to 'Auto_improved_baseline.csv'")

# Show sample predictions
print("\nSample predictions:")
pY_improved.head()


In-sample R^2 = 0.3245


### Advanced Version with PCA

Adding PCA to reduce dimensionality and capture the most important patterns from the polynomial features.


In [None]:
# ADVANCED BASELINE WITH PCA FOR DIMENSIONALITY REDUCTION
from sklearn.decomposition import PCA

# Enhanced numerical transformer with PCA
numerical_transformer_pca = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log_transform', FunctionTransformer(safe_log_transform, validate=False)),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)),
    ('pca', PCA(n_components=0.95, random_state=0))  # Keep 95% of variance
])

# Combine preprocessing with PCA
preprocessor_pca = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer_pca, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create pipeline with PCA
model_pipeline_pca = Pipeline(steps=[
    ('preprocessor', preprocessor_pca),
    ('regressor', MultiOutputRegressor(Ridge(alpha=5.0, random_state=0)))
])

# Fit the model
print("\nTraining advanced model with PCA...")
model_pipeline_pca.fit(tX, tY)

# Evaluate on training set
train_score_pca = model_pipeline_pca.score(tX, tY)
print(f'In-sample R^2 (with PCA) = {train_score_pca:.4f}')

# Generate predictions for validation set
print("\nGenerating predictions...")
pY_advanced = pd.DataFrame(model_pipeline_pca.predict(vX), index=vX.index, columns=YCols)
ToCSV(pY_advanced, 'Auto_advanced_baseline')
print("Predictions saved to 'Auto_advanced_baseline.csv'")

# Show sample predictions
print("\nSample predictions:")
pY_advanced.head()


### Trying Other Allowed Models

Competition allows: Linear Models, SVC/SVM, and Nearest Neighbors. Let's try a combination approach.


In [None]:
# TRY SVM REGRESSOR (SVR)
from sklearn.svm import SVR

# Use simpler preprocessing for SVR (it's slower with many features)
numerical_transformer_svr = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)),  # Only interactions
    ('pca', PCA(n_components=50, random_state=0))  # Reduce to 50 components for speed
])

preprocessor_svr = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer_svr, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create pipeline with SVR (RBF kernel)
model_pipeline_svr = Pipeline(steps=[
    ('preprocessor', preprocessor_svr),
    ('regressor', MultiOutputRegressor(SVR(kernel='rbf', C=10.0, epsilon=0.1)))
])

# Fit the model
print("\nTraining SVR model...")
model_pipeline_svr.fit(tX, tY)

# Evaluate on training set
train_score_svr = model_pipeline_svr.score(tX, tY)
print(f'In-sample R^2 (SVR) = {train_score_svr:.4f}')

# Generate predictions for validation set
print("\nGenerating predictions...")
pY_svr = pd.DataFrame(model_pipeline_svr.predict(vX), index=vX.index, columns=YCols)
ToCSV(pY_svr, 'Auto_svr_baseline')
print("Predictions saved to 'Auto_svr_baseline.csv'")

print("\nSample predictions:")
pY_svr.head()


### Model Comparison Summary

Compare all models and select the best one for submission.


In [None]:
# COMPARE ALL MODELS
print("="*60)
print("MODEL COMPARISON SUMMARY")
print("="*60)
print(f"Improved Baseline (Ridge + PolyFeatures):     R¬≤ = {train_score:.4f}")
print(f"Advanced Baseline (Ridge + PCA):              R¬≤ = {train_score_pca:.4f}")
print(f"SVR Model (RBF Kernel):                       R¬≤ = {train_score_svr:.4f}")
print("="*60)

# Determine best model
scores = {
    'Improved': train_score,
    'Advanced (PCA)': train_score_pca,
    'SVR': train_score_svr
}
best_model = max(scores, key=scores.get)
print(f"\n‚úì Best in-sample performance: {best_model} (R¬≤ = {scores[best_model]:.4f})")
print("\nSubmission files generated:")
print("  - Auto_improved_baseline.csv")
print("  - Auto_advanced_baseline.csv")
print("  - Auto_svr_baseline.csv")
print("\nRecommendation: Try all three on Kaggle and see which performs best on LB!")


# **References:**

1. Remember to cite your sources here as well! At the least, your textbook should be cited. Google Scholar allows you to effortlessly copy/paste an APA citation format for books and publications. Also cite StackOverflow, package documentation, and other meaningful internet resources to help your peers learn from these (and to avoid plagiarism claims).

<font color=green><h4><b>$\epsilon$. LLM Documentation if used</b></h4></font>

<font color=red><b>Your answer here.</b></font>

<font size=5>‚åõ</font> <strong><font color=orange size=5>Do not exceed competition's runtime limit!</font></strong>

<hr color=red>


In [None]:
tmr.ShowTime()    # measure Colab's runtime. Do not remove. Keep as the last cell in your notebook.

Runtime is 28 sec


## üí°**Starter Ideas**

1. Tune model hyperparameters and try different allowed models
1. Try to linear and non-linear feature normalization: shift/scale, log, divide features by features (investigate scatterplot matrix)
1. Try higher order feature interactions and polynomial features on a small subsample. Then identify key features or select key principal components. The final model can be trained on a larger or even full training sample. You can use [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the feature set
1. Do a thorough EDA: look for feature augmentations that result in linear decision boundaries between pairs of classes.
1. Evaluate predictions and focus on poorly predicted "groups":
  1. Strongest errors. E.g. the model is very confident about the wrong label
1. Do scatter plots show piecewise linear shape? Can a separate linear model be used on each support, or can the pattern be linearized via transformations?
1. Try modeling each output separately from inputs or from a other modeled output
1. Try stepwise selection and regularization and remove "unimportant" features from final model