# Car Price Prediction Report

This report summarizes the process and findings of building a model to predict car selling prices based on a given dataset.



In [None]:
!pip install -U langchain-google-genai

In [None]:
import pandas as pd
import numpy as np
import sklearn

## 1. Data Loading and Exploration

- The dataset was loaded using pandas from a CSV file named `car.csv`.
- Initial exploration was performed using `df.head()`, `df.info()`, `df.isnull().sum()`, and `df.describe()`.

In [None]:
df = pd.read_csv('car.csv')

## **setup my LLM**

In [None]:
from langchain.schema import SystemMessage, HumanMessage, AIMessage
from langchain.chat_models import init_chat_model
import os
from dotenv import load_dotenv

#importing gemini
load_dotenv()
api_key = os.environ.get("gemini_api_key")

model = init_chat_model(
    "google_genai:gemini-2.0-flash",
    temperature=0,
    api_key=api_key
)

system_prompt = """
You are a Machine learning Engineer and you're currently assistanting me on the task.
YOU'll do exactly what human message says."""
system_prompt = SystemMessage(content=system_prompt)

conversation = [system_prompt]

def ask_llm(user_input: str):
    global conversation
    conversation.append(HumanMessage(content=user_input))
    response = model(conversation)

    conversation.append(response)

    return response.content

# human message in each cells



## 2. Understanding the Dataset with **LLM**

To get a better understanding of the dataset columns and their potential relevance for car price prediction, we utilized a Large Language Model (LLM).

- A human message containing the column names and a request to explain the dataset was passed to the LLM.
- The LLM's response provided insights into the meaning of each column and how they might influence the car selling price.

This process helped in identifying important features and potential preprocessing steps before model building.

In [None]:
human_prompt = f""" {df} Explain what each column in this car dataset likely means based on its name and data type in sorted format and tell me the best ways to clean up messy parts
"""
print(ask_llm(human_prompt))

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.describe(include='all')

In [None]:
df.isnull().sum()

During the data cleaning phase, the Large Language Model (LLM) played a crucial role in providing insights and recommendations. By analyzing the dataset's columns and their characteristics, the LLM suggested the best approaches for handling missing values and data inconsistencies. This assistance significantly streamlined the data cleaning process, leading to a more refined and reliable dataset for subsequent modeling steps.

In [None]:
human_prompt = f""" use {df} if needed to tell me if i missed something on data cleaning
"""
print(ask_llm(human_prompt))

## 3. Data Cleaning

- Missing values in 'engine', 'seats', and 'mileage(km/ltr/kg)' were imputed with the median of their respective columns.
- Missing values in 'max_power' were imputed with the mean of the column after converting it to a numeric type (handling potential errors by coercing to NaN).
- A new feature, 'car_age', was created by subtracting the 'year' from 2025.

In [None]:
df['car_age'] = 2025- df['year'] # new feature add vayo

In [None]:
nulls =['engine', 'seats', 'mileage(km/ltr/kg)']
df['max_power'] = pd.to_numeric(df['max_power'], errors='coerce')
nulls2 =['max_power']
df.fillna(df[nulls].median(), inplace=True)
df.fillna(df[nulls2].mean(), inplace=True)

In [None]:
df.isnull().sum()

## 4. Exploratory Data Analysis (EDA)

- Visualizations were generated to understand the relationships between different features and the selling price:
    - Histogram of selling price distribution.
    - Scatter plots of selling price vs. 'km_driven' and 'car_age'.
    - Box plots showing selling price by 'fuel' type and 'transmission'.
    - A heatmap of the correlation matrix for numerical features was plotted.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
plt.hist(df['selling_price'], bins=30, edgecolor='black')
plt.title('Selling Price Distribution')
plt.xlabel('Price')

plt.subplot(2, 3, 2)
plt.scatter(df['km_driven'], df['selling_price'], alpha=0.6)
plt.title('Price vs KM Driven')
plt.xlabel('KM Driven')
plt.ylabel('Selling Price')

plt.subplot(2, 3, 3)
plt.scatter(df['car_age'], df['selling_price'], alpha=0.6)
plt.title('Price vs Car Age')
plt.xlabel('Car Age')
plt.ylabel('Selling Price')

plt.subplot(2, 3, 4)
sns.boxplot(data=df, x='fuel', y='selling_price')
plt.title('Price by Fuel Type')
plt.xticks(rotation=45)

plt.subplot(2, 3, 5)
sns.boxplot(data=df, x='transmission', y='selling_price')
plt.title('Price by Transmission')

plt.subplot(2, 3, 6)
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')

plt.tight_layout()
plt.show()

In [None]:
human_prompt = f""" Interpret the findings from the following visualizations and tell me what insights I can gain about the relationships between features and the selling price:
- Selling Price Distribution histogram
- Selling Price vs KM Driven scatter plot
- Selling Price vs Car Age scatter plot
- Price by Fuel Type box plot
- Price by Transmission box plot
- Correlation Matrix heatmap

Based on these visualizations, recommend potential features that could be most impactful for predicting the selling price and suggest further analysis or preprocessing steps that might be beneficial for building a robust prediction model.

Use the current state of the dataframe as context if needed:
{df.to_string()}
"""
print(ask_llm(human_prompt))

In [None]:
df['seller_type'].unique(),df['fuel'].unique(),df['transmission'].unique()


## 5. Feature Engineering and Preprocessing

- Categorical features ('fuel', 'seller_type', 'transmission', 'owner') were converted into numerical representations using Label Encoding.
- The 'year' and 'name' columns were dropped based on the correlation analysis (though the correlation analysis is shown *after* dropping, indicating this was likely a pre-meditated step or based on an earlier run).
- The target variable 'selling_price' was separated from the features.
- The target variable 'selling_price' was scaled using `StandardScaler`.
- The data was split into training and testing sets (80% train, 20% test).

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['fuel'] = le.fit_transform(df['fuel'])
df['seller_type'] = le.fit_transform(df['seller_type'])
df['transmission'] = le.fit_transform(df['transmission'])
df['owner'] = le.fit_transform(df['owner'])

In [None]:
# #yaha batw car model extract garna sakincha

# def extract_model_from_name(car_name):
#   parts = car_name.split(' ', 1)
#   if len(parts) > 1:
#     return parts[1]
#   return car_name # Return the whole name if only one word

# df['model'] = df['name'].apply(extract_model_from_name)


In [None]:
# prompt: analyze the coerrelation and tell me which columns to drop

# Analyze correlation matrix and identify columns to potentially drop
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
print("Correlation Matrix:")
print(correlation_matrix)

target_correlation = correlation_matrix['selling_price'].abs().sort_values()
print("\nCorrelation with Selling Price")
print(target_correlation)

In [None]:
droping = ['year', 'name']
df = df.drop(droping, axis=1)

In [None]:
from sklearn.model_selection import train_test_split
x=df.drop(['selling_price'],axis=1)
y=df['selling_price']

In [None]:
x

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
y = scaler.fit_transform(y.values.reshape(-1, 1))

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix

## 5. Model Selection and Training

- Four regression models were chosen and trained:
    - Linear Regression
    - Random Forest Regressor
    - Decision Tree Regressor
    - K-Nearest Neighbors Regressor
    

In [None]:
linear_regressor = LinearRegression()
rfr = RandomForestRegressor(n_estimators=100, random_state=42)
dtr = DecisionTreeRegressor(random_state=42)
knn = KNeighborsRegressor(n_neighbors=5) 



In [None]:
linear_regressor.fit(x_train, y_train)
y_pred_linear_reg = linear_regressor.predict(x_test)

In [None]:
rfr.fit(x_train, y_train)
y_pred_rfr = rfr.predict(x_test)

In [None]:
dtr.fit(x_train, y_train)
y_pred_dtr = dtr.predict(x_test)






In [None]:
knn.fit(x_train, y_train)
y_pred_knn = knn.predict(x_test)

## 6. Model Evaluation

- Mean Squared Error (MSE) and R-squared score were used to evaluate the performance of each model on the test set.
- The results showed that Random Forest Regressor and Decision Tree Regressor performed better in terms of MSE and R-squared compared to Linear Regression and KNN.

In [None]:
print("Linear Regression:")
print(mean_squared_error(y_test, y_pred_linear_reg))

print("\nRandom Forest Regression:")
print( mean_squared_error(y_test, y_pred_rfr))

print("\nDecision Tree Regression:")
print(mean_squared_error(y_test, y_pred_dtr))

print("\nK-Nearest Neighbors Regression:")
print(mean_squared_error(y_test, y_pred_knn))

In [None]:
print("\n linear regression: ")
print(r2_score(y_test, y_pred_linear_reg))

print("\n random forest regression: ")
print(r2_score(y_test, y_pred_rfr))

print("\n decision tree regression: ")
print(r2_score(y_test, y_pred_dtr))

print("\n knn regression: ")
print(r2_score(y_test, y_pred_knn))

# random forest regression outperforms decision tree as well


In [None]:
import pickle

filename = 'car_price_model.pkl'
pickle.dump(rfr, open(filename, 'wb'))

## 7. Feature Importance

- Feature importance was calculated for the Random Forest and Decision Tree models.
- Features with importance greater than the mean importance were identified for both models. This helps in understanding which features contribute most to the predictions.

In [None]:
importances = rfr.feature_importances_
feature_names = x.columns
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

mean_importance = feature_importance_df['importance'].mean()
important_features_rfr = feature_importance_df[feature_importance_df['importance'] > mean_importance]['feature'].tolist()


importances = dtr.feature_importances_
feature_names = x.columns
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)


mean_importance = feature_importance_df['importance'].mean()
important_features_dtr = feature_importance_df[feature_importance_df['importance'] > mean_importance]['feature'].tolist()

print("decision tree", important_features_dtr)
print("random forest", important_features_rfr)

## 8. Prediction on Sample Data

- The trained Random Forest model was used to predict selling prices for the first 10 rows of the dataset.
- The predictions were compared with the actual selling prices after inverse transforming the scaled predictions.
- R-squared accuracy was also calculated for this subset.
- A similar prediction and evaluation were performed on a random sample of 10 rows from the dataset.

In [None]:
df_head = df.sample(n=100, random_state=700).copy()
df_x_head = df_head.drop(['selling_price'], axis=1)
df_y_head = df_head['selling_price']

df_y_head_scaled = scaler.transform(df_y_head.values.reshape(-1, 1))
y_pred_head_scaled = rfr.predict(df_x_head)
y_pred_head = scaler.inverse_transform(y_pred_head_scaled.reshape(-1, 1))
r2_subset = r2_score(df_y_head_scaled, y_pred_head_scaled)

print("\nPredictions for the first 10 rows:")
for i in range(len(df_head)):
    print(f"Actual Price: {df_y_head.iloc[i]:,.2f}, Predicted Price: {y_pred_head[i][0]:,.2f}")
print(f"\nR-squared accuracy for the first 10 rows: {r2_subset:.4f} ")

## Conclusion

This project successfully addressed the challenge of predicting used car selling prices by following a structured machine learning workflow while solving potential challenges with the assistance of a Large Language Model (LLM), gemini in particular.


1.  Four regression models—Linear Regression, Random Forest Regressor, Decision Tree Regressor, and K-Nearest Neighbors Regressor—were trained and evaluated using Mean Squared Error (MSE) and R-squared scores.

2. The Random Forest and Decision Tree models demonstrated superior performance. Feature importance analysis further illuminated the most influential factors in predicting car prices.

3. Finally, the trained Random Forest model was used to make predictions on sample data, providing a tangible demonstration of its predictive capability.

4. Promising results were achieved by R-squared accuracy for the first 10 randomized rows: 0.9977 and 0.98 for randomized 100 rows
.

In [None]:
# llm 
human_prompt = f"""
Based on the provided dataset (`{df}`), the features (`{x}`), and the train/test R-squared accuracies (linear regression: {r2_score(y_test, y_pred_linear_reg)}, random forest regression: {r2_score(y_test, y_pred_rfr)}, decision tree regression: {r2_score(y_test, y_pred_dtr)}, knn regression: {r2_score(y_test, y_pred_knn)}), write a comprehensive conclusion for this car price prediction project.

Your conclusion should:
1.  Summarize the project goal and the dataset used.
2.  Briefly touch upon the data cleaning and preprocessing steps, including the use of the LLM if relevant to the cleaning process.
3.  Discuss the models trained and their performance based on the provided R-squared scores.
4.  Highlight the best performing model and mention why it likely performed well (e.g., its ability to capture non-linear relationships).
5.  Mention the feature importance analysis and its value in understanding the key drivers of car price.
6.  Include the results from the sample predictions (first 10 rows and the 100 random rows) and interpret the R-squared scores for those samples.
7.  Overall, provide a clear and concise summary of the project's success and key takeaways.
"""
print(ask_llm(human_prompt))