🚗 Car Price Prediction — Colab Notebook
Goal: Build a step-by-step car price prediction model using the uploaded car data.csv file.
This notebook covers: data loading, EDA (Plotly), feature engineering, preprocessing, model training (Linear Regression & Random Forest), evaluation, and exporting the trained model.



---



🎯 Objectives
Understand and clean the dataset.

* Explore relationships with interactive Plotly charts.
* Create features (car age) and preprocess categorical variables.
* Train baseline models and evaluate (MAE, RMSE, R²).
* Save the best model for deployment or reuse. List item







---



In [4]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv('/content/car data.csv')
df.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Driven_kms,Fuel_Type,Selling_type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [7]:
print(df.shape)

(301, 9)


In [8]:
df.columns = df.columns.str.strip().str.replace(' ', '_').str.lower()
print(df.dtypes)
print('\nNull counts:\n', df.isnull().sum())

car_name          object
year               int64
selling_price    float64
present_price    float64
driven_kms         int64
fuel_type         object
selling_type      object
transmission      object
owner              int64
dtype: object

Null counts:
 car_name         0
year             0
selling_price    0
present_price    0
driven_kms       0
fuel_type        0
selling_type     0
transmission     0
owner            0
dtype: int64


In [9]:
# Create a car_age feature using the dataset's max year as reference (so newest cars have age 0)
reference_year = df['year'].max()
df['car_age'] = reference_year - df['year']
# Drop 'year' if you won't use it directly
df = df.drop(columns=['year'])
df.head()

Unnamed: 0,car_name,selling_price,present_price,driven_kms,fuel_type,selling_type,transmission,owner,car_age
0,ritz,3.35,5.59,27000,Petrol,Dealer,Manual,0,4
1,sx4,4.75,9.54,43000,Diesel,Dealer,Manual,0,5
2,ciaz,7.25,9.85,6900,Petrol,Dealer,Manual,0,1
3,wagon r,2.85,4.15,5200,Petrol,Dealer,Manual,0,7
4,swift,4.6,6.87,42450,Diesel,Dealer,Manual,0,4


# **Exploratory Data Analysis**

In [11]:
import plotly.express as px
import plotly.graph_objects as go

In [12]:
# Basic statistics
display(df.describe().T)

# Target distribution (Selling_Price)
fig = px.histogram(df, x='selling_price', nbins=30, title='Distribution of Selling Price')
fig.update_layout(xaxis_title='Selling Price (Lakhs)', yaxis_title='Count')
fig.show()


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
selling_price,301.0,4.661296,5.082812,0.1,0.9,3.6,6.0,35.0
present_price,301.0,7.628472,8.642584,0.32,1.2,6.4,9.9,92.6
driven_kms,301.0,36947.20598,38886.883882,500.0,15000.0,32000.0,48767.0,500000.0
owner,301.0,0.043189,0.247915,0.0,0.0,0.0,0.0,3.0
car_age,301.0,4.372093,2.891554,0.0,2.0,4.0,6.0,15.0


In [13]:
# Scatter: Present Price vs Selling Price
fig = px.scatter(df, x='present_price', y='selling_price', hover_data=['car_name','driven_kms'],
                 title='Present Price vs Selling Price (car resale)')
fig.update_layout(xaxis_title='Present Price (Lakhs)', yaxis_title='Selling Price (Lakhs)')
fig.show()

Fortuner and Land cruiser have the highest resale value

In [14]:
# Box: Selling Price by Fuel Type
fig = px.box(df, x='fuel_type', y='selling_price', points='all', title='Selling Price by Fuel Type')
fig.update_layout(xaxis_title='Fuel Type', yaxis_title='Selling Price (Lakhs)')
fig.show()


In [15]:
# Scatter: Driven kms vs Selling Price colored by Transmission
fig = px.scatter(df, x='driven_kms', y='selling_price', color='transmission',
                 hover_data=['car_name'], title='Driven Kms vs Selling Price')
fig.update_layout(xaxis_title='Driven Kms', yaxis_title='Selling Price (Lakhs)')
fig.show()

Fortuner has the highest resale value in automatic transmission category

Land cruiser hast the highest in manual transmission

# **Machine Learning**

In [35]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [16]:
TARGET = 'selling_price'
FEATURES = ['present_price', 'driven_kms', 'owner', 'car_age', 'fuel_type', 'selling_type', 'transmission']

X = df[FEATURES].copy()
y = df[TARGET].copy()

# Quick check
X.head()

Unnamed: 0,present_price,driven_kms,owner,car_age,fuel_type,selling_type,transmission
0,5.59,27000,0,4,Petrol,Dealer,Manual
1,9.54,43000,0,5,Diesel,Dealer,Manual
2,9.85,6900,0,1,Petrol,Dealer,Manual
3,4.15,5200,0,7,Petrol,Dealer,Manual
4,6.87,42450,0,4,Diesel,Dealer,Manual


In [20]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #80/20
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

Train shape: (240, 7) Test shape: (61, 7)


In [22]:
# Preprocessing: scale numeric, one-hot encode categorical
numeric_features = ['present_price', 'driven_kms', 'owner', 'car_age']
categorical_features = ['fuel_type', 'selling_type', 'transmission']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

In [24]:
# Build pipelines for models
lin_pipe = Pipeline(steps=[('pre', preprocessor), ('model', LinearRegression())])
rf_pipe  = Pipeline(steps=[('pre', preprocessor), ('model', RandomForestRegressor(n_estimators=200, random_state=42))])

# Fit both
lin_pipe.fit(X_train, y_train)
rf_pipe.fit(X_train, y_train)

print("Models trained.")

Models trained.


In [30]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score
import numpy as np

def evaluate_model(pipe, X_test, y_test, name='Model'):
    preds = pipe.predict(X_test)

    mae = mean_absolute_error(y_test, preds)
    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)   # ✅ manual RMSE
    r2 = r2_score(y_test, preds)
    evs = explained_variance_score(y_test, preds)
    mape = np.mean(np.abs((y_test - preds) / y_test)) * 100  # %

    # Accuracy defined as 100 - MAPE
    accuracy = 100 - mape

    print(f"📊 {name} Evaluation")
    print(f"MAE       : {mae:.3f}")
    print(f"RMSE      : {rmse:.3f}")
    print(f"R² Score  : {r2:.3f} ({r2*100:.2f}% approx)")
    print(f"EVS       : {evs:.3f}")
    print(f"MAPE      : {mape:.2f}%")
    print(f"Accuracy  : {accuracy:.2f}%")  # ✅ accuracy in %

    return preds

print('Linear Regression evaluation:')
preds_lin = evaluate_model(lin_pipe, X_test, y_test, 'LinearRegression')

print('\nRandom Forest evaluation:')
preds_rf  = evaluate_model(rf_pipe, X_test, y_test, 'RandomForest')


Linear Regression evaluation:
📊 LinearRegression Evaluation
MAE       : 1.216
RMSE      : 1.866
R² Score  : 0.849 (84.89% approx)
EVS       : 0.854
MAPE      : 80.30%
Accuracy  : 19.70%

Random Forest evaluation:
📊 RandomForest Evaluation
MAE       : 0.619
RMSE      : 0.948
R² Score  : 0.961 (96.10% approx)
EVS       : 0.961
MAPE      : 16.80%
Accuracy  : 83.20%


Random forest returned the highest accuracy of 83.20%

In [31]:
# Cross-validation (5-fold) on the training set for Random Forest
cv_scores = cross_val_score(rf_pipe, X_train, y_train, cv=5, scoring='r2')
print('Random Forest CV R2 (5-fold):', cv_scores.round(3), 'Mean:', cv_scores.mean().round(3))

Random Forest CV R2 (5-fold): [0.948 0.828 0.799 0.871 0.943] Mean: 0.878


In [32]:
# Extract feature names after ColumnTransformer
num_feats = numeric_features
cat_feats = rf_pipe.named_steps['pre'].named_transformers_['cat'].get_feature_names_out(categorical_features).tolist()
feature_names = num_feats + cat_feats

importances = rf_pipe.named_steps['model'].feature_importances_
feat_imp = pd.DataFrame({'feature': feature_names, 'importance': importances}).sort_values('importance', ascending=False)
feat_imp

Unnamed: 0,feature,importance
0,present_price,0.886373
3,car_age,0.057162
1,driven_kms,0.033753
10,transmission_Manual,0.00766
9,transmission_Automatic,0.004851
5,fuel_type_Diesel,0.003883
6,fuel_type_Petrol,0.002595
8,selling_type_Individual,0.002111
7,selling_type_Dealer,0.001209
2,owner,0.00039


In [33]:
fig = px.bar(feat_imp, x='importance', y='feature', orientation='h', title='Random Forest — Feature Importances')
fig.update_layout(yaxis={'categoryorder':'total ascending'})
fig.show()

# **Insights**
* Petrol cars generally have lower resale value than diesel



* Older cars have much lower predicted selling prices
* Present Price is the most influential feature