# Project Two: Predicting Car Price With Regression

<div>
<img src="../images/car.jpg" alt="car lot image" width="20%"/>
</div>

## Introduction
---

Cars are an integral part of modern life, serving as a primary mode of transportation and a significant economic asset. With millions of vehicles bought and sold each year, understanding what influences car prices and why it does is crucial for both buyers and sellers. We will need data on factors that will possibly determine a vehicle's value, such as current mileage, make, model, year, car amenity, and engine specifications.

The [Car Prices Dataset](https://www.kaggle.com/datasets/sidharth178/car-prices-dataset?select=train.csv) provides all those features and more for predicting vehicle prices. Additionally, we will use this dataset to explore other insights, such as identifying which features contribute the most to price variations and understanding trends in car depreciation over time.
Some key questions we seek to answer include:
- Which factors—such as mileage, engine size, or brand—have the greatest impact on car price?
- Can we build an accurate predictive model for car prices using linear regression?
- How do different fuel types, transmission types, and drivetrain configurations affect a car’s value?
- Is there a pattern in car depreciation based on production year and mileage?

To predict a price, we'll use some form of regression. The first Experiment of three will focus on Linear Regression, and then in the next experiment,  we'll use another regression model to see if accuracy improves or not. Finally, in the last experiment, we'll perform some data manipulation to compare how those changes affected both models.

The dataset has 18 different features with around 20,000+ total rows. The columns are displayed below and follow with a description:
<table>
    <tr>
        <td><strong>Feature</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>ID</td>
        <td>Car ID</td>
    </tr>
    <tr>
        <td>Price</td>
        <td>The selling price of the car (Target Column)</td>
    </tr>
    <tr>
        <td>Levy</td>
        <td>Additional tax or charge on the vehicle</td>
    </tr>
    <tr>
        <td>Manufacturer</td>
        <td>The brand or manufacturer of the car</td>
    </tr>
    <tr>
        <td>Model</td>
        <td>The specific model of the car</td>
    </tr>
    <tr>
        <td>Prod. year</td>
        <td>The year the car was manufactured</td>
    </tr>
    <tr>
        <td>Category</td>
        <td>The type of car (e.g., Sedan, Jeep, Hatchback)</td>
    </tr>
    <tr>
        <td>Leather interior</td>
        <td>Indicates whether the car has a leather interior (Yes/No)</td>
    </tr>
    <tr>
        <td>Fuel type</td>
        <td>The type of fuel the car uses (e.g., Petrol, Diesel, CNG, Hybrid)</td>
    </tr>
    <tr>
        <td>Engine volume</td>
        <td>The engine capacity measured in liters (e.g., 2.0L, 3.5L)</td>
    </tr>
    <tr>
        <td>Mileage</td>
        <td>The total miles or kilometers the car has been driven</td>
    </tr>
    <tr>
        <td>Cylinders</td>
        <td>The number of cylinders in the engine</td>
    </tr>
    <tr>
        <td>Gear box type</td>
        <td>The type of transmission (e.g., Automatic, Manual)</td>
    </tr>
    <tr>
        <td>Drive wheels</td>
        <td>The drivetrain type (e.g., Front-Wheel Drive, Rear-Wheel Drive, All-Wheel Drive)</td>
    </tr>
    <tr>
        <td>Doors</td>
        <td>The number of doors on the vehicle (e.g., 2, 4, >5)</td>
    </tr>
    <tr>
        <td>Wheel</td>
        <td>Indicates whether the car is left-hand or right-hand drive</td>
    </tr>
    <tr>
        <td>Color</td>
        <td>The exterior color of the car</td>
    </tr>
    <tr>
        <td>Airbags</td>
        <td>The number of airbags the vehicle is equipped with</td>
    </tr>
</table>

In [None]:
#Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
#dataset
df = pd.read_csv('train.csv')
df.head()

## Data Analysis 
---

An interesting this about this dataset is how the data comes in two files. The `train.csv` (what we have) and `test.csv`. `test.csv` has all prices marked with `NaN`, hence it is a "test" dataset for the data.

In [None]:
df2 = pd.read_csv('test.csv')
df2.head()

In [None]:
numerical_features = ["Levy", "Prod. year", "Engine volume", "Mileage", "Cylinders", "Airbags"]

for feature in numerical_features:
    plt.figure(figsize=(6, 4))
    sns.scatterplot(x=df[feature], y=df["Price"])
    plt.xlabel(feature)
    plt.ylabel("Price")
    plt.title(f"Price vs {feature}")
    plt.show()

## Pre-processing
---

There are quite a few things I want to clean up from the dataset, but let's start with the big one, the CSV files are split into a "train" and "test". What we'll do is combine the datasets and do all our basic cleaning, then separate them again. Luckily, the price isn't there on the test set and no training set is missing their price, so we'll just split them on if the price is there or not.

### Combining datasets

In [None]:
df2.shape

In [None]:
df.shape

In [None]:
df_combined = pd.concat([df, df2], ignore_index=True)
df_combined.shape

Alright, we are ready to clean the data.

### Handling nulls

In [None]:
df_combined.isnull().sum()

While there are no nulls, expect the price which we expected, the data does have times where Levy uses a "-" instead of an actual null value. We'll just replace it to be null

In [None]:
df_combined.replace('-', np.nan, inplace=True)

In [None]:
df_combined.isnull().sum()

This shouldn't effect anything since we really care about the price and features related to the car, but it might help in the feature

### Duplicates

In [None]:
df_combined.duplicated().value_counts()

There are some duplicates so let's just remove them.

In [None]:
df_combined = df_combined.drop_duplicates()

In [None]:
df_combined.duplicated().value_counts()

### Checking for unusual types

In [None]:
df_combined.info()

Doesn't seem like there are any issues here. I am considered about the doors and wheels being an object over an int, but we'll move on.

### Splitting back the test and training dfs

In [None]:
df = df_combined[df_combined['Price'].notnull()]
df.shape

In [None]:
df_test = df_combined[df_combined['Price'].isna()]
df_test.shape

It looks like the merging and separating of orginal rows maintained it's general shape. Some rows were cleaned out so it's not equal to what we started with, but we can now move on to the more intresting part, predicting the car price.

## Experiment 1: Linear-Regression
---

We're going to start with Linear-Regression. 

In [None]:
33333333333333333333333333333333333333333333333

In [None]:
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)

### Evaluation

## Experiment 2: Different Model
---

## Experiment 3: Changing Data
---

## Impact
---

## References
---