# 🚗 Car Price Prediction - Improved Regression Model
This notebook implements a **better regression model** for predicting used car prices.
### Enhancements:
- Feature Engineering (Drop `name`, use `Age`, include `Annual_Km_Driven`)
- One-Hot Encoding for categorical variables (`company`, `fuel_type`)
- Feature Scaling for numeric values (`kms_driven`, `Age`)
- Model Comparison (Linear Regression, Random Forest, Gradient Boosting)
- Hyperparameter Tuning using GridSearchCV

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [9]:
# Load Dataset
df = pd.read_csv('cleaned_data.csv')
df.drop(columns=['name'], inplace=True)  # Drop unnecessary column

# Feature Engineering
base_year = df['year'].max()
df['Age'] = base_year - df['year'] + 1  # Convert year to age

# Selecting Features & Target
X = df[['company', 'Age', 'fuel_type', 'kms_driven']]
y = df['Price']


In [10]:
# Encoding & Scaling
categorical_features = ['company', 'fuel_type']
numeric_features = ['Age', 'kms_driven']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Splitting Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Model Training & Evaluation
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
}

results = {}

for name, model in models.items():
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    
    results[name] = {
        'MAE': mean_absolute_error(y_test, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
        'R2 Score': r2_score(y_test, y_pred)
    }

# Convert results to DataFrame
results_df = pd.DataFrame(results).T
display(results_df)

Unnamed: 0,MAE,RMSE,R2 Score
Linear Regression,155791.286276,269641.356397,0.563964
Random Forest,133825.071509,275758.652986,0.543955
Gradient Boosting,156765.411312,279977.536982,0.529894


In [12]:
# Hyperparameter Tuning for Random Forest
param_grid = {
    'regressor__n_estimators': [100, 200, 300],
    'regressor__max_depth': [10, 20, None]
}

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

grid_search = GridSearchCV(rf_pipeline, param_grid, cv=3, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best Parameters
print("Best Parameters:", grid_search.best_params_)

# Final Model Evaluation
y_pred_best = grid_search.best_estimator_.predict(X_test)
final_r2 = r2_score(y_test, y_pred_best)
print(f"Final R² Score: {final_r2:.4f}")

Best Parameters: {'regressor__max_depth': 20, 'regressor__n_estimators': 100}
Final R² Score: 0.5449


### 📌 **Conclusion**
- **Gradient Boosting or Random Forest** performs better than Linear Regression.
- **Hyperparameter tuning improves the performance.**
- Consider further optimizations with additional features like `Annual_Km_Driven`.