# Project Three: Predicting Car Price With Regression

<div>
<img src="../images/car.jpg" alt="car lot image" width="20%"/>
</div>

## Introduction
---

Cars are an integral part of modern life, serving as a primary mode of transportation and a significant economic asset. With millions of vehicles bought and sold each year, understanding what influences car prices and why it does is crucial for both buyers and sellers. We will need data on factors that will possibly determine a vehicle's value, such as current mileage, make, model, year, car amenity, and engine specifications.

The [Car Prices Dataset](https://www.kaggle.com/datasets/sidharth178/car-prices-dataset?select=train.csv) provides all those features and more for predicting vehicle prices. Additionally, we will use this dataset to explore other insights, such as identifying which features contribute the most to price variations and understanding trends in car depreciation over time.
Some key questions we seek to answer include:
- Which factors—such as mileage, engine size, or brand—have the greatest impact on car price?
- Can we build an accurate predictive model for car prices using linear regression?
- How do different fuel types, transmission types, and drivetrain configurations affect a car’s value?
- Is there a pattern in car depreciation based on production year and mileage?

To predict a price, we'll use some form of regression. The first Experiment of three will focus on Linear Regression, and then in the next experiment,  we'll use another regression model to see if accuracy improves or not. Finally, in the last experiment, we'll perform some data manipulation to compare how those changes affected both models.

The dataset has 18 different features with around 20,000+ total rows. The columns are displayed below and follow with a description:
<table>
    <tr>
        <td><strong>Feature</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>ID</td>
        <td>Car ID</td>
    </tr>
    <tr>
        <td>Price</td>
        <td>The selling price of the car (Target Column)</td>
    </tr>
    <tr>
        <td>Levy</td>
        <td>Additional tax or charge on the vehicle</td>
    </tr>
    <tr>
        <td>Manufacturer</td>
        <td>The brand or manufacturer of the car</td>
    </tr>
    <tr>
        <td>Model</td>
        <td>The specific model of the car</td>
    </tr>
    <tr>
        <td>Prod. year</td>
        <td>The year the car was manufactured</td>
    </tr>
    <tr>
        <td>Category</td>
        <td>The type of car (e.g., Sedan, Jeep, Hatchback)</td>
    </tr>
    <tr>
        <td>Leather interior</td>
        <td>Indicates whether the car has a leather interior (Yes/No)</td>
    </tr>
    <tr>
        <td>Fuel type</td>
        <td>The type of fuel the car uses (e.g., Petrol, Diesel, CNG, Hybrid)</td>
    </tr>
    <tr>
        <td>Engine volume</td>
        <td>The engine capacity measured in liters (e.g., 2.0L, 3.5L)</td>
    </tr>
    <tr>
        <td>Mileage</td>
        <td>The total miles or kilometers the car has been driven</td>
    </tr>
    <tr>
        <td>Cylinders</td>
        <td>The number of cylinders in the engine</td>
    </tr>
    <tr>
        <td>Gear box type</td>
        <td>The type of transmission (e.g., Automatic, Manual)</td>
    </tr>
    <tr>
        <td>Drive wheels</td>
        <td>The drivetrain type (e.g., Front-Wheel Drive, Rear-Wheel Drive, All-Wheel Drive)</td>
    </tr>
    <tr>
        <td>Doors</td>
        <td>The number of doors on the vehicle (e.g., 2, 4, >5)</td>
    </tr>
    <tr>
        <td>Wheel</td>
        <td>Indicates whether the car is left-hand or right-hand drive</td>
    </tr>
    <tr>
        <td>Color</td>
        <td>The exterior color of the car</td>
    </tr>
    <tr>
        <td>Airbags</td>
        <td>The number of airbags the vehicle is equipped with</td>
    </tr>
</table>

In [32]:
#Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn import datasets
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [33]:
#dataset
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


## Pre-processing
---

There are quite a few things I want to clean up from the dataset, but let's start with the basics

In [34]:
df['Nameplate'] = df['Manufacturer'] + ' ' + df['Model']
df = df.drop(['Manufacturer', 'Model'], axis = 1, inplace = True)
df = df[['Nameplate', 'Year', 'Engine_Size','Fuel_Type','Transmission','Mileage','Doors','Owner_Count', 'Price']]
Price	Levy	Manufacturer	Model	Prod. year	Category	Leather interior	Fuel type	Engine volume	Mileage	Cylinders	Gear box type	Drive wheels	Doors	Wheel	Color	Airbags
df.head()

AttributeError: 'NoneType' object has no attribute 'head'

### Handling nulls

In [None]:
df.isnull().sum()

While there are no nulls, expect the price which we expected, the data does have times where Levy uses a "-" instead of an actual null value. We'll just replace it to be 0.

In [None]:
df.replace('-', 0, inplace=True)

In [None]:
df.isnull().sum()

This shouldn't effect anything since we really care about the price and features related to the car, but it might help in the feature

### Duplicates

In [None]:
df.duplicated().value_counts()

There are some duplicates so let's just remove them.

In [None]:
df = df.drop_duplicates()

In [None]:
df.duplicated().value_counts()

### Dropping ID

We don't need the ID as it shouldn't affect the price at all. Let's just drop it

In [None]:
df = df.drop("ID", axis=1)

### Changing types from object to number

There are quite a few features that could be represented by a number instead of an object so lets change those.

In [None]:
df.info()

In [None]:
df["Mileage"] = df["Mileage"].str.replace(' km', '', regex=True).astype(int)
df['Levy'] = df['Levy'].astype(int)

In [None]:
for column in df.columns:
    if(column == "Doors" or column == "Mileage"):
        print(df[column].value_counts())
        print("\n" + "-"*50 + "\n")

In [None]:
df.info()

## Data Analysis 
---

Let's visual some data to see if we can identify important piece's of data before we actual make a prediction. I'll be using the LabelEncoder() from sklearn to converting all categories to numbers so we can use a heat map

In [None]:
# simple function to see our distrbutions for categorical variables we encoded
def plot_category_distribution(prefix, title):
    category_columns = [col for col in df.columns if col.startswith(prefix)]
    category_counts = df[category_columns].sum()

    plt.figure(figsize=(8, 5))
    sns.barplot(x=category_counts.index, y=category_counts.values, palette="viridis")

    plt.title(f"Distribution of {title}", fontsize=14)
    plt.xlabel(title, fontsize=12)
    plt.ylabel("Count", fontsize=12)
    plt.xticks(rotation=45)
    plt.grid(axis="y", linestyle="--", alpha=0.7)


    plt.show()

In [None]:
plot_category_distribution("Traffic_Level_", "Traffic Levels")

In [None]:
cat = df.select_dtypes(include='O')
encoder = list(cat)
df[encoder] = df[encoder].apply(lambda col: LabelEncoder().fit_transform(col))

In [None]:
plt.figure(figsize=(14,7))
sns.heatmap(df.corr(),annot=True)

## Experiment 1: Linear-Regression
---

We're going to start with Linear-Regression. 

In [None]:
#Splitting the data
X = df.drop("Price", axis=1)
y = df["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
mlr_model = sm.OLS(y_train, X_train).fit()
print(mlr_model.summary())

### Evaluation

## Experiment 2: Different Model
---

## Experiment 3: Changing Data
---

## Impact
---

## References
---