# üöó Vehicle Price Prediction Project
### Achieving 97.5% Accuracy using Random Forest
In this project, we predict the selling price of used cars based on various features like kilometers driven, fuel type, and engine power.
**Key Achievement:** Improved model accuracy from -0.09 ($R^2$) to **97.5%** by advanced data cleaning (RegEx) and model selection.

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [19]:
dataset = pd.read_csv('/home/mahmoudahmad/IBM_AI_Engineering/projects/vehicle_price/Car details v3.csv')

In [20]:
dataset.sample(5)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
3045,Maruti Ritz VDi,2012,300000,70000,Diesel,Individual,Manual,First Owner,23.2 kmpl,1248 CC,73.94 bhp,190Nm@ 2000rpm,5.0
5988,Maruti Alto LXi,2012,200000,50000,Petrol,Individual,Manual,Second Owner,19.7 kmpl,796 CC,46.3 bhp,62Nm@ 3000rpm,5.0
5967,Maruti Alto 800 VXI,2015,270000,40000,Petrol,Individual,Manual,First Owner,22.74 kmpl,796 CC,47.3 bhp,69Nm@ 3500rpm,5.0
6678,Renault KWID 1.0,2016,220000,20000,Petrol,Individual,Manual,First Owner,23.01 kmpl,999 CC,67 bhp,91Nm@ 4250rpm,5.0
618,Hyundai i10 Era,2008,135000,80000,Petrol,Individual,Manual,Second Owner,19.81 kmpl,1086 CC,68.05 bhp,99.04Nm@ 4500rpm,5.0


In [21]:
dataset.isnull().sum()

name               0
year               0
selling_price      0
km_driven          0
fuel               0
seller_type        0
transmission       0
owner              0
mileage          221
engine           221
max_power        215
torque           222
seats            221
dtype: int64

In [22]:
dataset = dataset.dropna()
dataset.isnull().sum()

name             0
year             0
selling_price    0
km_driven        0
fuel             0
seller_type      0
transmission     0
owner            0
mileage          0
engine           0
max_power        0
torque           0
seats            0
dtype: int64

## 1. Data Cleaning & Feature Extraction üßπ
The dataset contained complex string values (e.g., '88.7 bhp', '1200 CC').
I used **Regular Expressions (RegEx)** and string manipulation to extract pure numerical features for the model.

In [23]:
dataset['mileage'] = dataset['mileage'].str.split().str[0].astype(float)

dataset['engine'] = dataset['engine'].str.split().str[0].astype(float)

dataset['max_power'] = dataset['max_power'].str.extract('(\d+\.?\d*)').astype(float)

dataset = dataset.drop(columns=['name', 'torque'], axis=1)

dataset = dataset.dropna()

cc = dataset.select_dtypes(include='object').columns.to_list()
print("Columns to encode:", cc)

Columns to encode: ['fuel', 'seller_type', 'transmission', 'owner']


  dataset['max_power'] = dataset['max_power'].str.extract('(\d+\.?\d*)').astype(float)


In [24]:
from sklearn.preprocessing import OneHotEncoder
cc = dataset.select_dtypes(include='object').columns.to_list()
encoder = OneHotEncoder(sparse_output=False,drop='first')
encoded_data = encoder.fit_transform(dataset[cc])
encoded_data_df = pd.DataFrame(encoded_data,columns=encoder.get_feature_names_out(cc))
prepped_data = pd.concat([dataset.drop(columns=cc), encoded_data_df], axis=1)

In [25]:
# from sklearn.preprocessing import StandardScaler
# cc2 = prepped_data.select_dtypes(include='float64').columns.to_list()
# scaler = StandardScaler()
# scaled_data = scaler.fit_transform(prepped_data[cc2])
# scaled_data_df = pd.DataFrame(scaled_data,columns=scaler.get_feature_names_out(cc2))
# ready_data = pd.concat([prepped_data.drop(columns=cc2),scaled_data_df],axis=1)

In [26]:
prepped_data.isnull().sum()

year                            218
selling_price                   218
km_driven                       218
mileage                         218
engine                          218
max_power                       218
seats                           218
fuel_Diesel                     218
fuel_LPG                        218
fuel_Petrol                     218
seller_type_Individual          218
seller_type_Trustmark Dealer    218
transmission_Manual             218
owner_Fourth & Above Owner      218
owner_Second Owner              218
owner_Test Drive Car            218
owner_Third Owner               218
dtype: int64

In [27]:
prepped_data = prepped_data.dropna()
prepped_data.isnull().sum()

year                            0
selling_price                   0
km_driven                       0
mileage                         0
engine                          0
max_power                       0
seats                           0
fuel_Diesel                     0
fuel_LPG                        0
fuel_Petrol                     0
seller_type_Individual          0
seller_type_Trustmark Dealer    0
transmission_Manual             0
owner_Fourth & Above Owner      0
owner_Second Owner              0
owner_Test Drive Car            0
owner_Third Owner               0
dtype: int64

In [28]:
prepped_data

Unnamed: 0,year,selling_price,km_driven,mileage,engine,max_power,seats,fuel_Diesel,fuel_LPG,fuel_Petrol,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Manual,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,2014.0,450000.0,145500.0,23.40,1248.0,74.00,5.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,2014.0,370000.0,120000.0,21.14,1498.0,103.52,5.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2,2006.0,158000.0,140000.0,17.70,1497.0,78.00,5.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,2010.0,225000.0,127000.0,23.00,1396.0,90.00,5.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
4,2007.0,130000.0,120000.0,16.10,1298.0,88.20,5.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7901,2017.0,800000.0,45500.0,23.90,1582.0,126.20,5.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
7902,2014.0,1100000.0,120000.0,12.99,2494.0,100.60,8.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0
7903,2008.0,114999.0,100000.0,19.70,796.0,46.30,5.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
7904,2013.0,500000.0,92000.0,20.77,1248.0,88.76,7.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0


In [29]:
y = prepped_data['selling_price']
X = prepped_data.drop('selling_price',axis=1)

In [30]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [31]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 2. Model Training ‚öôÔ∏è
Comparing Linear Regression (Baseline) vs. Random Forest Regressor.
- **Linear Regression** struggled with non-linear patterns ($R^2 \approx 0.60$).
- **Random Forest** captured complex relationships effectively.

In [32]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# model = LinearRegression()
# model.fit(X_train,y_train)
# MSE: 0.3682912019014394
# r^2 score: 0.6006711700599807

model = RandomForestRegressor(n_estimators=100,random_state=42)
model.fit(X_train_scaled,y_train)
# MSE: 0.023078559928502454
# r^2 score: 0.974976501514647

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [None]:
y_pred = model.predict(X_test_scaled)
from sklearn.metrics import mean_squared_error,r2_score
print(f"RMSE: {np.sqrt(mean_squared_error(y_test,y_pred))}")
print(f"r^2 score: {r2_score(y_test,y_pred)}")

MSE: 18141676186.855312
r^2 score: 0.9751381384314004


## 3. Final Evaluation üèÜ
**Result:** The Random Forest model achieved an outstanding **$R^2$ Score of 0.975**.
Cross-Validation confirmed the model is robust and not overfitting.

In [34]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-Validation Scores:", cv_scores)
print("Average Accuracy:", cv_scores.mean())

Cross-Validation Scores: [0.95836784 0.96229885 0.97715238 0.94658452 0.97014236]
Average Accuracy: 0.9629091871332867
