# **Used Car Price Prediction**

# Business Understanding

## Dataset Description
Source: [Kaggle | Car Price Prediction Challenge](https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge)

This dataset provides detailed information about used cars, including their specifications and pricing details. The dataset consists of approximately 19237 rows and 18 columns, covering various types of used cars such as sedans, SUVs, hatchbacks, etc.

| Column           | Description                                        |
| ---------------- | -------------------------------------------------- |
| ID               | Unique identifier for each car listing.            |
| Price            | Selling price of the car.                          |
| Levy             | Tax applied to importing/exporting the car.        |
| Manufacturer     | Brand or company that produced the car.            |
| Model            | Specific model of the car.                         |
| Prod. year       | Year the car was manufactured.                     |
| Category         | Type of car.                                       |
| Leather interior | Indicates whether the car has leather interior.    |
| Fuel type        | Type of fuel the car uses.                         |
| Engine volume    | Engine displacement.                               |
| Mileage          | Distance the car has traveled.                     |
| Cylinders        | Number of cylinders in the engine.                 |
| Gear box type    | Transmission type.                                 |
| Drive wheels     | Drivetrain configuration.                          |
| Doors            | Number of doors on the car.                        |
| Wheel            | Steering position.                                 |
| Color            | Exterior color of the car.                         |
| Airbags          | Number of airbags installed in the car for safety. |

## Background

The US used car market is influenced by various factors, including age, fuel efficiency, and engine size, all of which contribute to price fluctuations. Buyers and sellers struggle with pricing accuracy, often relying on subjective assessments rather than data-driven insights. By analyzing structured attributes such as mileage, transmission type, etc., machine learning models can offer valuable predictions to improve pricing transparency and decision-making.

## Problem Statement

Using regression-based machine learning techniques, this project aims to develop a predictive model that estimates accurate resale values of used cars based on historical data. The goal is to achieve an RMSE of less than 200 EUR, ensuring reliable pricing predictions. The model will be trained and evaluated within a six-month timeframe, facilitating timely implementation for practical market use. This solution enhances price transparency for buyers and sellers, contributing to greater efficiency and fairness in the US used car market.

## Problem Breakdown


# Library & Function

## Library

In [16]:
# data manipulation
import pandas as pd
import numpy as np

# data viz
import seaborn as sns
import matplotlib.pyplot as plt

## Function

In [17]:
# overview
def check_overview(df):
    '''
    df_overview adalah fungsi yang digunakan untuk melihat informasi seputar dataset.

    Argumen:
    df = dataset yang digunakan.

    Output:
    Informasi overall dataset, missing value, duplicated value dan jumlah unique value setiap kolom.
    '''
    # df overview
    print(df.info())

    # cek missing value
    print(f"\nmissing values: {round(((df.isna().sum().sum())/len(df))*100, 2)}% \n{df.isna().sum()[df.isna().sum()>0]}")

    # cek duplicated value
    print(f"\nduplicated values: {round(((df.duplicated().sum())/len(df))*100,2)}% \n{df.duplicated().sum()}\n")

    # cek nama kolom & jumlah unique value
    for col in df:
        print(f'{col}-#nunique: {df[col].nunique()}')

# Load Data

In [18]:
# load data from csv
df = pd.read_csv('car_price_prediction.csv')
df.tail(7)

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
19230,45760891,470,645,TOYOTA,Prius,2011,Hatchback,Yes,Hybrid,1.8,307325 km,4.0,Automatic,Front,04-May,Left wheel,Silver,12
19231,45772306,5802,1055,MERCEDES-BENZ,E 350,2013,Sedan,Yes,Diesel,3.5,107800 km,6.0,Automatic,Rear,04-May,Left wheel,Grey,12
19232,45798355,8467,-,MERCEDES-BENZ,CLK 200,1999,Coupe,Yes,CNG,2.0 Turbo,300000 km,4.0,Manual,Rear,02-Mar,Left wheel,Silver,5
19233,45778856,15681,831,HYUNDAI,Sonata,2011,Sedan,Yes,Petrol,2.4,161600 km,4.0,Tiptronic,Front,04-May,Left wheel,Red,8
19234,45804997,26108,836,HYUNDAI,Tucson,2010,Jeep,Yes,Diesel,2,116365 km,4.0,Automatic,Front,04-May,Left wheel,Grey,4
19235,45793526,5331,1288,CHEVROLET,Captiva,2007,Jeep,Yes,Diesel,2,51258 km,4.0,Automatic,Front,04-May,Left wheel,Black,4
19236,45813273,470,753,HYUNDAI,Sonata,2012,Sedan,Yes,Hybrid,2.4,186923 km,4.0,Automatic,Front,04-May,Left wheel,White,12


In [19]:
check_overview(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19237 entries, 0 to 19236
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ID                19237 non-null  int64  
 1   Price             19237 non-null  int64  
 2   Levy              19237 non-null  object 
 3   Manufacturer      19237 non-null  object 
 4   Model             19237 non-null  object 
 5   Prod. year        19237 non-null  int64  
 6   Category          19237 non-null  object 
 7   Leather interior  19237 non-null  object 
 8   Fuel type         19237 non-null  object 
 9   Engine volume     19237 non-null  object 
 10  Mileage           19237 non-null  object 
 11  Cylinders         19237 non-null  float64
 12  Gear box type     19237 non-null  object 
 13  Drive wheels      19237 non-null  object 
 14  Doors             19237 non-null  object 
 15  Wheel             19237 non-null  object 
 16  Color             19237 non-null  object

Overview:
- The data has 19237 entires and 18 column:
    - Categorical: Manufacturer, Model, Category, Leather interior, Fuel type, Gear box type, Drive wheels, Wheel, and Color.
    - Numerical: Price, Levy, Prod. year, Engine volume, Mileage, Cylinders, Doors, and Airbags.
- Some columns have incorrect data type:
    - levy, Engine volume, Mileage, and Doors should be float.
- Inconsistent column naming. Will be changed to snake_case.
- No missing values were detected, but there is a "-" value in Levy, which could be a non-standard missing value. Another potential non-standard missing value needs to be checked on the other columns, especially the categorical columns.
- 1.63% entries are duplicate. Will be delete.

# Data Cleaning

## Column Name Handling

Column names have Inconsistent type and will be changed to snake_case.

In [20]:
# change col. name type to snake case
df.columns = df.columns.str.lower().str.replace(' ', '_')

# check
df.columns

Index(['id', 'price', 'levy', 'manufacturer', 'model', 'prod._year',
       'category', 'leather_interior', 'fuel_type', 'engine_volume', 'mileage',
       'cylinders', 'gear_box_type', 'drive_wheels', 'doors', 'wheel', 'color',
       'airbags'],
      dtype='object')

Now, columns have consistent type and more easy to retrieve them.

In [21]:
# save
df_org = df.copy()

In [22]:
# load
df = df_org.copy()

## Duplicate Handling

In [29]:
# drop duplicate
df.drop_duplicates(keep='last',inplace=True)

# check
print('remaining data:',len(df))
print('duplicated data:',df.duplicated().sum())

remaining data: 18924
duplicated data: 0


The duplicate data has been removed and I keep the last to prevent it from losing too much data. 18924 records remaining after deletion.

## Missing Value Handling

### Check Non-standard Missing Value

In [30]:
df.columns

Index(['id', 'price', 'levy', 'manufacturer', 'model', 'prod._year',
       'category', 'leather_interior', 'fuel_type', 'engine_volume', 'mileage',
       'cylinders', 'gear_box_type', 'drive_wheels', 'doors', 'wheel', 'color',
       'airbags'],
      dtype='object')

In [66]:
# categorical to check
cat_check = ['manufacturer','category','fuel_type','gear_box_type','drive_wheels','color']

# categorical value check
print('Categorical value check')
for i in cat_check:
    # print column and its value
    print(i,df[i].unique())
    print()

# numerical value check
print('Numerical value check')
for i in df.columns:
    # detect unique value
    unique = set(df[i].unique())
    # detect unique value that matched with 0, '-', or '0'
    matched = unique.intersection({0, '-', '0'})

    # print out matched value
    if matched:
        print(i,':',matched)


Categorical value check
manufacturer ['LEXUS' 'HONDA' 'FORD' 'HYUNDAI' 'TOYOTA' 'MERCEDES-BENZ' 'OPEL'
 'PORSCHE' 'JEEP' 'VOLKSWAGEN' 'AUDI' 'RENAULT' 'NISSAN' 'BMW' 'CHEVROLET'
 'SUBARU' 'DAEWOO' 'KIA' 'MITSUBISHI' 'SSANGYONG' 'MAZDA' 'GMC' 'FIAT'
 'INFINITI' 'ALFA ROMEO' 'SUZUKI' 'ACURA' 'LINCOLN' 'VAZ' 'GAZ' 'CITROEN'
 'LAND ROVER' 'MINI' 'DODGE' 'CHRYSLER' 'JAGUAR' 'ISUZU' 'SKODA'
 'DAIHATSU' 'BUICK' 'TESLA' 'CADILLAC' 'PEUGEOT' 'BENTLEY' 'VOLVO' 'სხვა'
 'HAVAL' 'HUMMER' 'SCION' 'UAZ' 'MERCURY' 'ZAZ' 'ROVER' 'SEAT' 'LANCIA'
 'MOSKVICH' 'MASERATI' 'FERRARI' 'SAAB' 'LAMBORGHINI' 'ROLLS-ROYCE'
 'PONTIAC' 'SATURN' 'ASTON MARTIN' 'GREATWALL']

category ['Jeep' 'Hatchback' 'Sedan' 'Microbus' 'Goods wagon' 'Universal' 'Coupe'
 'Minivan' 'Cabriolet' 'Limousine' 'Pickup']

fuel_type ['Hybrid' 'Petrol' 'Diesel' 'CNG' 'Plug-in Hybrid' 'LPG' 'Hydrogen']

gear_box_type ['Automatic' 'Variator' 'Manual' 'Tiptronic']

drive_wheels ['4x4' 'Front' 'Rear']

color ['Silver' 'Black' 'White' 'Grey' 'Blu

Its detected that:
- manufacturer has 'სხვა'.
- category has 'Universal'.
- levy has '-'.
- engine volume has '0'.
- airbags has 0.

those columns need to be checked later to determine whether the value is a missing value or not.


### 'სხვა' in Manufacturer

In [67]:
df[df['manufacturer'] == 'სხვა']

Unnamed: 0,id,price,levy,manufacturer,model,prod._year,category,leather_interior,fuel_type,engine_volume,mileage,cylinders,gear_box_type,drive_wheels,doors,wheel,color,airbags
2358,45779593,25089,-,სხვა,IVECO DAYLY,2007,Microbus,No,Diesel,2.3 Turbo,328000 km,4.0,Manual,Rear,04-May,Left wheel,White,1
4792,39223518,9408,-,სხვა,GONOW,2005,Jeep,Yes,Petrol,2.3,102000 km,4.0,Manual,Rear,04-May,Left wheel,Silver,2


To do:
- ~~Change column name to snake_case~~
~~Delete duplicate~~
- Check another potential non-standard missing value.
- Change levy, Engine volume, Mileage, and Doors to float