# Exploratory Data Analysis (EDA)


## Understanding the attributes
- listing_id - the unique id for the listing of a used car
  - ID in the URL: https://www.sgcarmart.com/used_cars/info.php?ID=1292132
- title - title of the listing; free text attribute typically containing the **make, model, engine type/size**
- make - make/brand of the car
- model - model of the car (for the given make/brand)
- description - title of the listing; free text attribute
  - typo: not the title of the listing, it is the random text description 
- manufactured - Year the car was manufactured
- original_reg_date - Original registration data; date format string
  - Date of the car first registered in Singapore or other countries
- reg_date - Registration data; date format string
  - Registration day of the existing COE (COE valid for 10 years)
- type_of_vehicle - Type of vehicle (e.g., "sports car", "luxury sedan")
- category - Category of the car; list of categories, **comma-separated**
- transmission - Type of transmission ("auto" or "manual")
- curb_weight - Weight in kg of the vehicle without any passenger or items
- power - Power of engine in kW
- fuel_type - Fuel type (e.g, "petrol", "diesel", "electric")
- engine_cap - Displacement of engine in cc (cubic centimeter)
  - 发动机排气量
- no_of_owners - Number of previous owners (>=1 since all cars are used)
- depreciation - Annual depreciation in SGD is the amount the owner loses on the value of the vehicle per year based on the assumption that the vehicle is deregistered only at the end of its 10-yr COE life-span
  - e.g. 40K car, depreciation is about 4K/year
- coe - Certificate of entitlement value in SGD when first registered
  - original value of the COE when first registered
- road_tax - Road tax value in SGD calculated based on the engine capacity on a per annum basis
- dereg_value - deregistration value in SGD one will get back from the government upon deregistering the vehicle for use in Singapore
  - Amount received from government after 10 years COE if deregistered
- mileage - Number of kilometers driven
- omv - Open Market Value in SGD assessed by the Singapore Customs
  - estimated car value (without COE, ARF)
- arf - Additional Registration Fee in SGD is a tax imposed upon registration of a vehicle
  - higher omv leaed to even higher arf
- opc_scheme - Off-peak car scheme
  - cheap COE, red car plate which can only be used in off-peak hours
- lifespan - Date stated on the lifespan marks the day that the vehicle must be deregistered
  - seems only applicable to truck/ van/ bus etc
- eco_category - Eco category of vehicle
- features - Noteworthy features; free text attribute
- accessories - Noteworthy accessories ; free text attribute
- indicative_price - General guide to the price in SGD of the vehicle
- price - Resale price in SGD of the car

## Categorical attributes
* id: 
  * listing_id: to remove
* Free text: 
  * title (7.3k), description (20k), features (17k), accessories (17k): consider to remove, many attribute values across 25K train data
* Categories: 
  * **make** (96): NaN
  * **model** (799)
  * type_of_vehicle (11)
  * **category** (245): NaN (-), should reprocess common-seperated values. category like "parf car" affect the dereg_value
  * transmission (2)
  * fuel_type (6): NaN
  * opc_scheme (4): NaN values are valid, it means not on OPC scheme. affect COE value.
  * eco_category (1): to remove, every rows with value "uncategorized"
* Datetime: 
  * **manufactured** (72): NaN
  * original_reg_date (220): NaN values are valid, meaning it's only have 1 owner. However for some data with more than 6 owners, it can be NaN as well, which is not valid.
  * reg_date (4.7K): consider to reduce the attribute values by converting to year, month, day; some date are very old which is suspicious
  * lifespan (1.5K): NaN values are valid, because it only applies to truck/ van/ bus etc

## Numerical attributes
* car properties: 
  * curb_weight (974): NaN, feels like no impact to price
  * power (312): NaN
  * engine_cap (381): NaN, 0, duplicated to power?
  * **mileage** (6.5K): NaN
  * **no_of_owners** (7): NaN means invalid, no_of_owners should be bigger than 0
* price: 
  * depreciation (3.9K): NaN
  * **coe** (2.6K)
  * road_tax (610): NaN; not a super big factor to price. But if the road tax is high, the real price might be lower than estimated price
  * dereg_value (19K): NaN; dereg_value is used to calculate depreciation
  * **omv** (14K): NaN
  * **arf** (14K): NaN
  * indicative_price: all are NaN, to remove
  * price (3.4K)
    * estimated price = (coe+omv+arf) - depreciation*used_coe_duration