# **Car Price Prediction using Machine Learning**

**Author:** Arya Paresh Pendbhaje

**Domain:** Data Science / Machine Learning

## **Project Overview**
In this project, we aim to build a predictive model that estimates the selling price of used cars based on various attributes. The dataset used for this analysis includes information such as the car's brand, model, year of manufacture, fuel type, transmission type, and previous ownership history.

## **Objective**
The primary goal is to develop a regression model using the **Linear Regression** algorithm. This model will learn the relationship between the car's features (e.g., kilometers driven, fuel type) and its selling price, allowing us to make accurate price predictions for unseen data.

**Key Steps:**
1.  Data Import and Inspection
2.  Exploratory Data Analysis (EDA) & Visualization
3.  Data Preprocessing (Handling categorical text data)
4.  Feature Selection
5.  Model Training (Linear Regression)
6.  Model Evaluation (R-Squared Score)
7.  Interactive User Prediction

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

url = "https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Car%20Price.csv"
df = pd.read_csv(url)

df.head()

## **Data Inspection & Exploratory Data Analysis (EDA)**

Before training the model, it is crucial to understand the data's characteristics. We will perform the following checks:
* **Data Info:** checking data types (integers vs. objects) and looking for null values.
* **Descriptive Statistics:** analyzing mean, min, and max values.
* **Visualization:** using pairplots and histograms to observe the distribution of the target variable (`Selling_Price`) and correlations between features.

In [None]:
df.info()

print("\nSummary Statistics:")
print(df.describe())

print("\nMissing Values:")
print(df.isnull().sum())

plt.figure(figsize=(8,5))
sns.histplot(df['Selling_Price'], kde=True)
plt.title('Distribution of Selling Price')
plt.show()

sns.pairplot(df)
plt.show()

## **Data Preprocessing**

Machine Learning models perform mathematical calculations, so they cannot directly interpret text data (strings). Our dataset contains several categorical columns that need to be converted into numbers:

1.  **Fuel:** (Petrol, Diesel, CNG, etc.) -> Converted to 0, 1, 2...
2.  **Seller_Type:** (Dealer, Individual) -> Converted to 0, 1...
3.  **Transmission:** (Manual, Automatic) -> Converted to 0, 1.
4.  **Owner:** (First Owner, Second Owner, etc.) -> Converted to 1, 2, 3...

We will use the `replace()` function to map these text categories to integer values manually.

In [None]:
print("Fuel Options:", df['Fuel'].value_counts())
print("Seller Options:", df['Seller_Type'].value_counts())
print("Transmission Options:", df['Transmission'].value_counts())
print("Owner Options:", df['Owner'].value_counts())

df.replace({'Fuel':{'Petrol':0, 'Diesel':1, 'CNG':2, 'LPG':3, 'Electric':4}}, inplace=True)

df.replace({'Seller_Type':{'Dealer':0, 'Individual':1, 'Trustmark Dealer':2}}, inplace=True)

df.replace({'Transmission':{'Manual':0, 'Automatic':1}}, inplace=True)

df.replace({'Owner': {
    'First Owner': 1,
    'Second Owner': 2,
    'Third Owner': 3,
    'Fourth & Above Owner': 4,
    'Test Drive Car': 0
}}, inplace=True)

df.head()

## **Define Target Variable (y) and Feature Matrix (X)**

Now that our data is numeric, we must separate it into:
* **Target (y):** The value we want to predict (`Selling_Price`).
* **Features (X):** The data used to make the prediction.

**Note:** We will drop the `Brand` and `Model` columns. Since there are hundreds of different car names, converting them to numbers would make the model overly complex for this specific project scope.

In [None]:
y = df['Selling_Price']

X = df.drop(['Brand', 'Model', 'Selling_Price'], axis=1)

print("X Shape:", X.shape)
print("y Shape:", y.shape)

## **Train-Test Split**

To evaluate our model fairly, we cannot test it on the same data used for training. We will split the dataset into two parts:
1.  **Training Set (70%):** Used to teach the model the relationship between features and price.
2.  **Testing Set (30%):** Used to test how well the model predicts prices for new cars.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

print("Training Set size:", X_train.shape)
print("Testing Set size:", X_test.shape)

## **Model Selection & Training**

We will use **Linear Regression** for this project. Linear Regression attempts to model the relationship between two or more variables by fitting a linear equation to observed data. It is the fundamental algorithm for predictive regression problems.

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

print("Model Trained Successfully!")

## **Model Evaluation**

Now that the model is trained, we will use it to predict the prices of the cars in the `X_test` set. We will then compare these predicted prices against the actual prices (`y_test`) using the **R-Squared (R²) Score**.

* **R² Score:** A statistical measure that represents the proportion of the variance for the target variable that's explained by the independent variables. A score closer to 1.0 indicates a perfect fit.

In [None]:
y_pred = lr_model.predict(X_test)

from sklearn.metrics import r2_score
error_score = r2_score(y_test, y_pred)
print("R-Squared Score:", error_score)

plt.figure(figsize=(8,5))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual Price vs Predicted Price")
plt.grid()
plt.show()

comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison.head())

## **Interactive Prediction System**

We can now use our trained model to predict the price of a car based on user input. Run the cell below, enter the car details as requested, and the system will output the estimated market value.

**Note:** For categorical features (like Fuel or Transmission), please enter the numeric code as shown in the prompt (e.g., enter `0` for Petrol).

In [None]:
def predict_car_price():
    print("\n--- Enter Car Details to Predict Price ---")

    year = int(input("Year of Manufacture (e.g., 2018): "))

    km_driven = int(input("Kilometers Driven (e.g., 50000): "))

    print("\nFuel Type: [0: Petrol, 1: Diesel, 2: CNG, 3: LPG, 4: Electric]")
    fuel = int(input("Enter Fuel Code: "))

    print("\nSeller Type: [0: Dealer, 1: Individual, 2: Trustmark Dealer]")
    seller = int(input("Enter Seller Code: "))

    print("\nTransmission: [0: Manual, 1: Automatic]")
    transmission = int(input("Enter Transmission Code: "))

    print("\nOwner: [1: First Owner, 2: Second Owner, 3: Third Owner, 4: Fourth+, 0: Test Drive]")
    owner = int(input("Enter Owner Code: "))

    input_data = pd.DataFrame([[year, km_driven, fuel, seller, transmission, owner]],
                              columns=['Year', 'KM_Driven', 'Fuel', 'Seller_Type', 'Transmission', 'Owner'])

    predicted_price = lr_model.predict(input_data)

    print("\n------------------------------------")
    print(f"Estimated Selling Price: {predicted_price[0]:.2f}")
    print("------------------------------------")

predict_car_price()

## **Conclusion**

In this project, we successfully developed a Machine Learning model to predict car prices.

1.  **Data Analysis:** We explored the dataset and visualized price distributions.
2.  **Preprocessing:** We successfully converted categorical variables (`Fuel`, `Owner`, `Transmission`) into numerical formats suitable for regression.
3.  **Modeling:** We implemented a Linear Regression model.
4.  **Result:** The R-Squared score indicates how well our model fits the data. The scatter plot shows a strong correlation between predicted and actual prices, validating the model's effectiveness.

This workflow demonstrates the practical application of Data Science in solving real-world pricing problems.