# INTRODUCTION

This project focuses on predicting the selling price of second-hand cars using machine learning. The goal is to build a regression model that can estimate car prices based on various features such as fuel type, mileage, transmission type, and engine size.

The dataset used contains details of multiple used cars. After loading and preparing the data, a Decision Tree Regressor model is trained to learn the relationship between the features and the target variable (price).

To improve the model's accuracy, GridSearchCV is used for hyperparameter tuning. This helps in finding the best set of parameters that reduce error and improve prediction performance.

Overall, the project demonstrates a step-by-step approach to building and optimizing a machine learning model for real-world use.



# IMPORTING NECESSARY LIBRARIES
Importing essential libraries such as Pandas Matplotlib, and Seaborn.

In [54]:
#Importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# DATA LOADING AND PREPARATION

The dataset was first loaded using `pd.read_csv()` and stored in a DataFrame. Basic inspection was performed using:

- `data.info()` – to view column data types and non-null counts
- `data.describe()` – to examine statistical summaries of numeric columns
- `data.tail()` – to view the last few entries and confirm proper loading
- `data.head()` – to glance at the first few records
- `data.columns` – to get an overview of all column names in the dataset

After this initial examination, the following columns were dropped:
- `v.id`
- `on road now`
- `on road old`

These columns were removed because their descriptions were unclear and they were considered non-essential for the analysis.


In [55]:
#Load the dataset
data = pd.read_csv("Car_prices.csv")

In [56]:
#First five rows
data.head()

Unnamed: 0,v.id,on road old,on road now,years,km,rating,condition,economy,top speed,hp,torque,current price
0,1,535651,798186,3,78945,1,2,14,177,73,123,351318.0
1,2,591911,861056,6,117220,5,9,9,148,74,95,285001.5
2,3,686990,770762,2,132538,2,8,15,181,53,97,215386.0
3,4,573999,722381,4,101065,4,3,11,197,54,116,244295.5
4,5,691388,811335,6,61559,3,9,12,160,53,105,531114.5


In [57]:
#Last five rows
data.tail()

Unnamed: 0,v.id,on road old,on road now,years,km,rating,condition,economy,top speed,hp,torque,current price
995,996,633238,743850,5,125092,1,6,11,171,95,97,190744.0
996,997,599626,848195,4,83370,2,9,14,161,101,120,419748.0
997,998,646344,842733,7,86722,1,8,9,196,113,89,405871.0
998,999,535559,732439,2,140478,4,5,9,184,112,128,74398.0
999,1000,590105,779743,5,67295,4,2,8,199,99,96,414938.5


In [58]:
#Dataset information
data.info

<bound method DataFrame.info of      v.id  on road old  on road now  years      km  rating  condition  \
0       1       535651       798186      3   78945       1          2   
1       2       591911       861056      6  117220       5          9   
2       3       686990       770762      2  132538       2          8   
3       4       573999       722381      4  101065       4          3   
4       5       691388       811335      6   61559       3          9   
..    ...          ...          ...    ...     ...     ...        ...   
995   996       633238       743850      5  125092       1          6   
996   997       599626       848195      4   83370       2          9   
997   998       646344       842733      7   86722       1          8   
998   999       535559       732439      2  140478       4          5   
999  1000       590105       779743      5   67295       4          2   

     economy  top speed   hp  torque  current price  
0         14        177   73     123 

In [59]:
data.describe()

Unnamed: 0,v.id,on road old,on road now,years,km,rating,condition,economy,top speed,hp,torque,current price
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,601648.286,799131.397,4.561,100274.43,2.988,5.592,11.625,166.893,84.546,103.423,308520.2425
std,288.819436,58407.246204,57028.9502,1.719079,29150.463233,1.402791,2.824449,2.230549,19.28838,20.51694,21.058716,126073.25915
min,1.0,500265.0,700018.0,2.0,50324.0,1.0,1.0,8.0,135.0,50.0,68.0,28226.5
25%,250.75,548860.5,750997.75,3.0,74367.5,2.0,3.0,10.0,150.0,67.0,85.0,206871.75
50%,500.5,601568.0,798168.0,5.0,100139.5,3.0,6.0,12.0,166.0,84.0,104.0,306717.75
75%,750.25,652267.25,847563.25,6.0,125048.0,4.0,8.0,13.0,184.0,102.0,121.0,414260.875
max,1000.0,699859.0,899797.0,7.0,149902.0,5.0,10.0,15.0,200.0,120.0,140.0,584267.5


In [60]:
data.columns

Index(['v.id', 'on road old', 'on road now', 'years', 'km', 'rating',
       'condition', 'economy', 'top speed', 'hp', 'torque', 'current price'],
      dtype='object')

In [61]:
data.drop(columns=['v.id', 'on road old', 'on road now'], inplace=True)

# EXPLORATORY DATA ANALYSIS



In [None]:
# Histograms
data.hist(figsize=(15, 10), bins=30)
plt.suptitle("Histograms of Numeric Features", fontsize=16)
plt.tight_layout()
plt.show()


In [None]:
#Correlation heatmap
plt.figure(figsize=(12, 8))
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap", fontsize=16)
plt.show()


# FEATURES AND TARGET

After cleaning the dataset, the features (`X`) and the target variable (`y`) are defined. The target is the `current price` of the car, while the features include all other relevant attributes.


In [None]:
X = data.drop("current price", axis=1 )
y = data["current price"]

# SPLITTING THE DATASET

To evaluate model performance properly, the dataset was split into training and testing sets using `train_test_split` from `sklearn.model_selection`. This helps test how well the model generalizes to unseen data.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# TRAINING THE DECISION TREE REGRESSOR MODEL

Importing the `DecisionTreeRegressor` model from `sklearn.tree`, training it on the training data using the `.fit()` method, and making predictions using `.predict()`.



In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()

tree_model.fit(X_train, y_train)

In [None]:
y_pred1 = tree_model.predict(X_test)

In [None]:
#Evaluating the model
from sklearn.metrics import root_mean_squared_error, r2_score

print(root_mean_squared_error(y_test, y_pred1))

print(r2_score(y_test, y_pred1))

# HYPERPARAMETER TUNING USING ```GridSearchCV```

To improve model performance, hyperparameter tuning using `GridSearchCV` was performed. 

This process searches through multiple combinations of parameters (like `max_depth`, `min_samples_split`, and `max_features`) to find the configuration that yields the best model performance.

After tuniain the mowas retrained del using the best parameters and evalduate again. In this case, we observed a lower error and improved accuracy.


In [None]:
from sklearn.model_selection import GridSearchCV
grid_model = GridSearchCV(
    estimator=tree_model,
    param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2'],  
    'criterion': ['squared_error', 'absolute_error']
},
    cv=5, n_jobs=-1
)

In [None]:
grid_model.fit(X_train, y_train)

In [None]:
y_pred = grid_model.predict(X_test)

In [None]:
#Evaluating the model

print(root_mean_squared_error(y_test, y_pred))

print(r2_score(y_test, y_pred))

# CONCLUSION


Initially, the Decision Tree Regressor showed decent performance, but the error margin (RMSE) was relatively high. After applying GridSearchCV to tune hyperparameters like `max_depth`, `min_samples_split`, and `min_samples_leaf`, the model’s performance improved significantly. This demonstrates the critical role of hyperparameter tuning in enhancing predictive accuracy and reducing overfitting or underfitting in machine learning models.
