In [1]:
import pandas as pd

In [2]:
data= pd.read_csv('car data.csv')

In [3]:
data.head()

Unnamed: 0,Car_Name,Year,Selling_Price,Present_Price,Driven_kms,Fuel_Type,Selling_type,Transmission,Owner
0,ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
1,sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0
2,ciaz,2017,7.25,9.85,6900,Petrol,Dealer,Manual,0
3,wagon r,2011,2.85,4.15,5200,Petrol,Dealer,Manual,0
4,swift,2014,4.6,6.87,42450,Diesel,Dealer,Manual,0


In [4]:
data.shape

(301, 9)

# Data Preprocessing

* Car Name Drop: Remove the "Car_Name" column, which is unlikely to significantly influence the car's price.

* Age Calculation: Create a new column, "Age," by subtracting the car's manufacturing year from the current year, providing the age of each car.

* Categorical Variable Conversion: Encode categorical variables ("Fuel_Type," "Selling_type," and "Transmission") into numerical format using methods like one-hot encoding or label encoding, ensuring compatibility with machine learning models.

In [5]:
data.drop(columns=['Car_Name'], inplace=True)

In [7]:

data = pd.get_dummies(data, columns=['Fuel_Type', 'Selling_type', 'Transmission'], drop_first=True)

In [9]:
data[['Fuel_Type_Diesel', 'Fuel_Type_Petrol', 'Selling_type_Individual', 'Transmission_Manual']] = data[['Fuel_Type_Diesel', 'Fuel_Type_Petrol', 'Selling_type_Individual', 'Transmission_Manual']].astype(int)

In [11]:
current_year = 2023 
data['Age'] = current_year - data['Year']

In [12]:
data.head()

Unnamed: 0,Year,Selling_Price,Present_Price,Driven_kms,Owner,Fuel_Type_Diesel,Fuel_Type_Petrol,Selling_type_Individual,Transmission_Manual,Age
0,2014,3.35,5.59,27000,0,0,1,0,1,9
1,2013,4.75,9.54,43000,0,1,0,0,1,10
2,2017,7.25,9.85,6900,0,0,1,0,1,6
3,2011,2.85,4.15,5200,0,0,1,0,1,12
4,2014,4.6,6.87,42450,0,1,0,0,1,9


# Splitting the Data:

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X= data.drop(['Selling_Price'],axis=1)

In [17]:
y= data.Selling_Price

In [19]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.15, random_state=42)

# Model Selection:

### I've chosen Lasso Regression for my car price prediction task because:
  1. It automatically selects key features impacting car prices.
  2. I need regularization to prevent overfitting, especially with a small dataset.
  3. Lasso provides clear model interpretability, helping me understand feature impacts.

In [21]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

In [24]:
car_model= Lasso(alpha=1.0)

In [30]:
#Training the Model:
car_model.fit(X_train,y_train)

In [27]:
y_pred= car_model.predict(X_test)

In [28]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [29]:
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared (R2) Score: {r2:.2f}")

Mean Squared Error: 2.92
R-squared (R2) Score: 0.76


# Model Performance Summary

* Mean Squared Error (MSE): The average prediction error is approximately 2.92, indicating how much, on average, the model's predictions deviate from the actual car prices. Lower values are better, suggesting more accurate predictions.

* R-squared (R2) Score: With an R2 score of 0.76, the model explains approximately 76% of the variance in car prices. This suggests a good fit, meaning the model captures a significant portion of the variation in prices.

In [31]:
import joblib

In [32]:
#save
carprice = "car_model_lasso.pkl"
joblib.dump(car_model, carprice)

['car_model_lasso.pkl']

In [33]:
num_samples = 5 

In [35]:
import numpy as np

In [40]:
random_data = pd.DataFrame({
    'Year': np.random.choice([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020], num_samples),
    'Present_Price': np.random.uniform(3, 15, num_samples),
    'Driven_kms': np.random.randint(1000, 80000, num_samples),
    'Owner': np.random.randint(0, 3, num_samples),
    'Fuel_Type_Diesel': np.random.choice([0, 1], num_samples),
    'Fuel_Type_Petrol': np.random.choice([0, 1], num_samples),
    'Selling_type_Individual': np.random.choice([0, 1], num_samples),
    'Transmission_Manual': np.random.choice([0, 1], num_samples),
    'Age': np.random.randint(1, 12, num_samples)  # Adjust the age range as needed
})

In [41]:
predicted_prices = car_model.predict(random_data)

In [42]:
for i, predicted_price in enumerate(predicted_prices):
    print(f"Sample {i + 1}: Predicted Price = {predicted_price:.2f}")

Sample 1: Predicted Price = 8.38
Sample 2: Predicted Price = 6.18
Sample 3: Predicted Price = 2.14
Sample 4: Predicted Price = 2.75
Sample 5: Predicted Price = 4.16


* In this project, a Lasso Regression model was developed for car price prediction. The model's accuracy on a new dataset was assessed using evaluation metrics such as Mean Squared Error (MSE) and R-squared (R2) Score. To ensure a meaningful evaluation, it's essential to incorporate actual car prices in the new dataset. This approach provides valuable insights into the model's performance and its ability to make accurate predictions on unseen data.