<a href="https://colab.research.google.com/github/Rwolste/DS-3001-Assignments/blob/main/Linear_Model_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# === 1. Load the dataset ===
df = pd.read_csv("cars_hw.csv")

# === 2. Drop unnecessary index column ===
df.drop(columns=["Unnamed: 0"], inplace=True)

# === 3. One-hot encode categorical variables ===
categorical_cols = ['Make', 'Color', 'Body_Type', 'No_of_Owners',
                    'Fuel_Type', 'Transmission', 'Transmission_Type']
cars_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# === 4. Define features and target ===
X = cars_encoded.drop(columns=["Price"])
y = cars_encoded["Price"]

# === 5. Train-test split ===
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# === 6. Create transformation pipeline for selected features ===
numeric_transformer = Pipeline(steps=[
    ('log_mileage', FunctionTransformer(lambda x: np.log1p(x))),  # log(1 + x)
    ('poly', PolynomialFeatures(degree=2, include_bias=False))    # adds interactions and squares
])

# === 7. ColumnTransformer to apply only on selected columns ===
preprocessor = ColumnTransformer(transformers=[
    ('numeric', numeric_transformer, ['Mileage_Run', 'Make_Year'])
], remainder='passthrough')  # keep all other columns unchanged

# === 8. Full pipeline: preprocess + linear regression ===
complex_model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# === 9. Fit model ===
complex_model_pipeline.fit(X_train, y_train)

# === 10. Predict on train and test sets ===
y_train_pred = complex_model_pipeline.predict(X_train)
y_test_pred = complex_model_pipeline.predict(X_test)

# === 11. Evaluate performance ===
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

# === 12. Print results ===
print("Model Performance:")
print(f"Train RMSE: ₹{train_rmse:,.0f}")
print(f"Test RMSE: ₹{test_rmse:,.0f}")
print(f"Train R²: {train_r2:.3f}")
print(f"Test R²: {test_r2:.3f}")


📊 Model Performance:
Train RMSE: ₹140,383
Test RMSE: ₹142,266
Train R²: 0.859
Test R²: 0.824


The best model is the Complex model since it has a slightly lower RSME and a higher R^2. We saw some non linearity in the mileage variable. The complex model outperforms the simpler one, mainly due to handling non-linear effects and interactions. But the improvement is incremental, showing that while complexity can help, it's important to avoid going overboard unless justified by patterns in the data. We cleaned and explored the data, transformed categorical variables, and split it for training and testing. A simple linear model performed well, but a complex model with log and interaction terms did better. We learned that careful feature engineering improves performance and that added complexity can help without causing overfitting.