### Predict Car price using LinearRegression

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load dataset
df = pd.read_csv("car_price.csv")
print(df.head())
print(df.info())
print(df.describe())


   car_id    brand  mileage  enginesize  age   price
0       1   Toyota    10000         1.2    3  550000
1       2  Hyundai    25000         1.4    2  680000
2       3   Maruti    60000         1.0    5  350000
3       4    Honda    20000         1.5    1  750000
4       5   Toyota    80000         1.2    4  520000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   car_id      10 non-null     int64  
 1   brand       10 non-null     object 
 2   mileage     10 non-null     int64  
 3   enginesize  10 non-null     float64
 4   age         10 non-null     int64  
 5   price       10 non-null     int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 612.0+ bytes
None
         car_id        mileage  enginesize        age          price
count  10.00000      10.000000   10.000000  10.000000      10.000000
mean    5.50000   48500.000000    1.330000 

In [5]:
df = df.dropna()     
df


Unnamed: 0,car_id,brand,mileage,enginesize,age,price
0,1,Toyota,10000,1.2,3,550000
1,2,Hyundai,25000,1.4,2,680000
2,3,Maruti,60000,1.0,5,350000
3,4,Honda,20000,1.5,1,750000
4,5,Toyota,80000,1.2,4,520000
5,6,Hyundai,50000,1.3,4,480000
6,7,Ford,15000,1.6,1,460000
7,8,Honda,150000,1.5,4,610000
8,9,Maruti,40000,1.0,1,300000
9,10,Ford,35000,1.6,6,680000


In [6]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()

Unnamed: 0,car_id,mileage,enginesize,age,price,brand_Honda,brand_Hyundai,brand_Maruti,brand_Toyota
0,1,10000,1.2,3,550000,False,False,False,True
1,2,25000,1.4,2,680000,False,True,False,False
2,3,60000,1.0,5,350000,False,False,True,False
3,4,20000,1.5,1,750000,True,False,False,False
4,5,80000,1.2,4,520000,False,False,False,True


In [11]:
X = df_encoded.drop("price", axis=1)
y = df_encoded["price"]
X


Unnamed: 0,car_id,mileage,enginesize,age,brand_Honda,brand_Hyundai,brand_Maruti,brand_Toyota
0,1,10000,1.2,3,False,False,False,True
1,2,25000,1.4,2,False,True,False,False
2,3,60000,1.0,5,False,False,True,False
3,4,20000,1.5,1,True,False,False,False
4,5,80000,1.2,4,False,False,False,True
5,6,50000,1.3,4,False,True,False,False
6,7,15000,1.6,1,False,False,False,False
7,8,150000,1.5,4,True,False,False,False
8,9,40000,1.0,1,False,False,True,False
9,10,35000,1.6,6,False,False,False,False


In [12]:
X = df_encoded.drop("price", axis=1)
y = df_encoded["price"]
y

0    550000
1    680000
2    350000
3    750000
4    520000
5    480000
6    460000
7    610000
8    300000
9    680000
Name: price, dtype: int64

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test

(   car_id  mileage  enginesize  age  brand_Honda  brand_Hyundai  brand_Maruti  \
 5       6    50000         1.3    4        False           True         False   
 0       1    10000         1.2    3        False          False         False   
 7       8   150000         1.5    4         True          False         False   
 2       3    60000         1.0    5        False          False          True   
 9      10    35000         1.6    6        False          False         False   
 4       5    80000         1.2    4        False          False         False   
 3       4    20000         1.5    1         True          False         False   
 6       7    15000         1.6    1        False          False         False   
 
    brand_Toyota  
 5         False  
 0          True  
 7         False  
 2         False  
 9         False  
 4          True  
 3         False  
 6         False  ,
    car_id  mileage  enginesize  age  brand_Honda  brand_Hyundai  brand_Maruti  \
 8    

In [14]:
model = LinearRegression()
model.fit(X_train, y_train)


In [15]:
y_pred = model.predict(X_test)

print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R² Score:", r2_score(y_test, y_pred))


MAE: 268464.02391975466
MSE: 78477176669.8655
RMSE: 280137.7815823233
R² Score: -1.1738830102455817


In [16]:
new_car = pd.DataFrame({
    "Year": [2020],
    "Mileage": [30000],
    "Engine": [1500],
    # include your other encoded columns...
})

# match columns
new_car = pd.get_dummies(new_car, drop_first=True).reindex(columns=X.columns, fill_value=0)

predicted_price = model.predict(new_car)
print("Predicted Car Price:", predicted_price[0])


Predicted Car Price: 382215.9529320973


#### Summary of the Car Price Prediction Model

In this project, a Linear Regression model was developed to predict car prices based on features such as manufacturing year, mileage, engine specifications, and other categorical attributes like fuel type, transmission, and brand. The dataset was cleaned by removing missing values and transforming categorical variables into numerical format using one-hot encoding. After preprocessing, the data was split into training and testing sets to evaluate the model's performance on unseen data.

The Linear Regression algorithm was then trained on the processed dataset and used to predict car prices. Performance metrics—including MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Square Error), and R² Score—were calculated to assess how accurately the model predicts car prices. These metrics help determine how well the model fits the data and how useful it is for practical prediction scenarios.

#### Conclusion

The Linear Regression model successfully learned relationships between car attributes and their prices, demonstrating that numerical and categorical factors significantly influence pricing. A lower MAE and RMSE indicate that the model can predict prices with reasonable accuracy. A higher R² score (close to 1) shows that the model explains a large portion of the variance in car prices.

However, since car pricing is influenced by many complex factors—such as market demand, brand value, condition, maintenance history, and regional pricing—the model's predictions may not be perfect. Additional techniques like Random Forest Regression, XGBoost, or Regularized Linear Models (Lasso/Ridge) could further improve accuracy.

Overall, the project provides a meaningful and interpretable model for estimating car prices using machine-learning techniques, offering valuable insights for buyers, sellers, and automotive businesses.