# Vehicle Price Prediction — Complete Project
This notebook was generated automatically. It unzips the uploaded project archive, loads the dataset, performs simple preprocessing, trains models (Linear Regression and Random Forest), evaluates them, and saves the best model.

### What this notebook contains
- Data loading & inspection
- Simple EDA
- Preprocessing pipeline (imputation, encoding, scaling)
- Model training & evaluation
- Save model

Run the cells sequentially. If the dataset file location or column names differ, adjust the 'target' variable below.


In [None]:
# Unzip status: Unzipped to /mnt/data/vehicle_price_prediction_extracted
# Dataset search result: Found dataset: /mnt/data/vehicle_price_prediction_extracted/Vehicle Price Prediction/dataset.csv
import os
DATASET_PATH = "/mnt/data/vehicle_price_prediction_extracted/Vehicle Price Prediction/dataset.csv"
print('DATASET_PATH =', DATASET_PATH)
print('\nFiles in extraction directory:')
for root, dirs, files in os.walk('/mnt/data/vehicle_price_prediction_extracted'):
    for f in files:
        print(os.path.join(root, f))


In [None]:
import pandas as pd
if DATASET_PATH is None:
    raise FileNotFoundError('No CSV dataset found in the extracted archive. Put your CSV in the zip and re-run upload.')
df = pd.read_csv(DATASET_PATH)
print('Dataset shape:', df.shape)
display(df.head())
display(df.info())


In [None]:
# Simple EDA
import matplotlib.pyplot as plt
print('Missing values per column:')
print(df.isnull().sum())
numeric = df.select_dtypes(include=['int64','float64']).columns.tolist()
print('\nNumeric columns:', numeric)
print('\nValue counts for top categorical columns:')
cat = df.select_dtypes(include=['object','category']).columns.tolist()
for c in cat[:5]:
    print('\n--', c)
    print(df[c].value_counts().head())


## Preprocessing & Modeling
Assumptions:
- The dataset contains a numeric target column representing price. **If your target column name is not `price`, update the `TARGET_COLUMN` variable** in the next cell.
- We'll split features into numerical and categorical, impute missing values, encode categoricals with OneHotEncoder, scale numerical features, and train two models.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# CHANGE THIS if your price column has a different name
TARGET_COLUMN = 'price' if 'price' in df.columns else df.columns[-1]
print('Using target column:', TARGET_COLUMN)

X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN]

numeric_features = X.select_dtypes(include=['int64','float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object','category']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models to try
models = {
    'LinearRegression': Pipeline(steps=[('preprocessor', preprocessor), ('reg', LinearRegression())]),
    'RandomForest': Pipeline(steps=[('preprocessor', preprocessor), ('reg', RandomForestRegressor(n_estimators=100, random_state=42))])
}

results = {}
for name, model in models.items():
    print('\nTraining', name)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    r2 = r2_score(y_test, preds)
    results[name] = {'rmse': rmse, 'r2': r2, 'model': model}
    print(f'{name} -> RMSE: {rmse:.4f}, R2: {r2:.4f}')

# Choose best by RMSE
best_name = min(results, key=lambda k: results[k]['rmse'])
best_model = results[best_name]['model']
print('\nBest model:', best_name)


In [None]:
import joblib
model_path = '/mnt/data/best_vehicle_price_model.joblib'
joblib.dump(best_model, model_path)
print('Saved best model to', model_path)


In [None]:
import matplotlib.pyplot as plt
best_preds = best_model.predict(X_test)
plt.figure(figsize=(6,6))
plt.scatter(y_test, best_preds, alpha=0.6)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted (Best Model)')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.show()


### What you have now
- A runnable notebook that reads the dataset from the uploaded zip and trains models.
- A saved model at `/mnt/data/best_vehicle_price_model.joblib`.

If you want custom feature engineering, hyperparameter tuning, or a different target column name, edit the notebook cells accordingly.


In [1]:
import joblib
import pandas as pd

# Load the trained model
model = joblib.load("best_vehicle_price_model.joblib")

# Example input — change values as per your dataset features
sample = pd.DataFrame({
    'year': [2019],
    'mileage': [30000],
    'fuel_type': ['Petrol'],
    'transmission': ['Manual']
})

# Predict vehicle price
predicted_price = model.predict(sample)
print("Predicted Vehicle Price:", predicted_price)


FileNotFoundError: [Errno 2] No such file or directory: 'best_vehicle_price_model.joblib'