# Introduction

## Objective


Buying a car in general is a very big decision. This is no less true in Singapore, which has some of the highest vehicle running costs in the world. Due to these high costs the pre-owned market is a great alternative for most of the population. But the pre-owned car market can be a confusing one, especially with regards to the pricing of the second-hand cars: How much does a car with a certain mileage cost? Or how much will a particular brand of car cost? There are many factors which affect the price of a pre-owned car.
Therefore, we decided that our project would be to predict the price of a pre-owned car. This is a multivariant problem with many factors affecting the price, from mileage to brand and age etc.  


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
sb.set()

from bs4 import BeautifulSoup as bs
import requests
import datetime
from datetime import date
import time

## Beautiful Soup

SGCarMart (https://www.sgcarmart.com/main/index.php) is one of the largest car listing websites in Singapore. With a massive array of data available about used cars it makes it an ideal website to collect our data from. However, recording data of every used car would be inefficient, so one must find a way to scrape all this data efficiently.
Beautiful soup is a module which helps to efficiently traverse through a webpage and extract information. When applying this to SGCarMart we soon realised that the website has a sequence, and by simply changing the ID we could view different listings available on the website.
https://www.sgcarmart.com/used_cars/info.php?ID=888000&DL=1000 <- URL for pre-owned car listing.



In [4]:
url = 'https://www.sgcarmart.com/used_cars/info.php?ID=888000&DL=1000'
resp = requests.get(url)
sg_car_mart = bs(resp.text, 'html.parser')

![usedcar](sgcarmart.png)

As shown in the image above, we want to extract all the data in the table, as these variables may be possible predictors for our model to predict car price. 

In [5]:
print(sg_car_mart.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <title>
   Used Peugeot 2008 Car for Sale in Singapore, Classic Credit Pte Ltd - sgCarMart
  </title>
  <meta content="Best Condition.Well Maintained By Agent With Warranty.1 Owner With Low Mileage.100% Accident Free.Sporty Black With Panoramic Roof.Hurry!View Now.We Provide Lowest Interest Rate. Flexible Loan And High Trade In Are Available. Package With Unrivaled Warranty Complete With Certificate And Test Report. Meet Our Friendly Consultant For An Non Obligation Advise." property="og:description">
   <meta content="Latest Price, Photos, Promotions of Used Peugeot 2008, Classic Credit Pte Ltd For Sale in Singapore! The Only Place For Smart Car Buyers." name="description"/>
   <meta content="Used Peugeot, Used Peugeot 2008, Used Peugeot car, Used Peugeot 

## Our Functions

### convert_string:

As the name suggests, this function will convert strings to integers. All the data is in the form of strings when scraped, and so must be converted appropriately.  

### list_conversion:

Apart from the variables registration date, price, depreciation and vehicle type, all of the other variables follow a sequence in a list when scraped. This function is used to convert strings to integers, remove brackets and fill in missing data points with 'N.A'.




In [3]:
def convert_string(input_string):
    try:
        num1 = [str(i) for i in input_string if i.isdigit()]
        value = int("".join(num1))
        return str(value)
    except ValueError:
        return input_string
    
def list_conversion(variables):
    index = 0
    list1 = []
    listing = []
    for index in range(len(variables)):
        if (index == 7):
                       listing.append(variables[index])
        else:                
            for i in variables[index]:
                if (i == '(' or i == '.'): # remove brackets, ##remove decimals for power
                    break

                elif i.isdigit():
                    list1.append(i)
            try:
                listing.append(int("".join(list1)))
            except:
                listing.append('N.A')
            list1 = []

    return listing

In [4]:
combinedlist = []

## Webscraping

We did some inspection on the web page, and saw that there is a pattern for some datasets. We follow the HTML tags here as an exmaple: 

![inspect2](inspect2.png)

    




In [2]:
##As we can see the information of manufatured year is placed in <div class="row_info">2016</div>

##to get the info we just have to use the "findall" attribut to look for <div class = row infor > from there we will extract the year out

In [None]:
headers = ['Brand','Type', 'Reg_date', 'Coe_left', 'Dep']
count = 1
for ids in range(885500,890000,1): ## scraping 4500 data sets
    try:
        url_i = 'https://www.sgcarmart.com/used_cars/info.php?ID={}&DL=1000'.format(ids)
        resp = requests.get(url_i)
        sg_car_mart = bs(resp.text, 'html.parser')
        time.sleep(2)
        vehmake = sg_car_mart.find('a', {'class':'link_redbanner'}).text.strip().split()[0]
        price = convert_string(sg_car_mart.find('td', {'class':'font_red'}).text.strip())
        depreciation = convert_string(sg_car_mart.findAll('td', {'valign':'top'})[0].text.strip())
        vehtype = sg_car_mart.findAll('tr', {'class':'row_bg1'})[0].text.strip().split('\n')[1]
        reg_d = sg_car_mart.findAll('td', {'valign':'top'})[2].text.strip()
        try:
            reg_coe = reg_d.split('(')
            reg_date = reg_coe[0]
            coe = reg_coe[1]

        except:
            coe = 'NA'
            
        var_list = []
        for i in range(len(sg_car_mart.findAll('div',{'class':"row_info"}))):
            var_list.append(sg_car_mart.findAll('div',{'class':"row_info"})[i].text.strip())
        if count == 1:
            for i in range(len(sg_car_mart.findAll('div',{'class':"row_title"}))):
                headers.append(sg_car_mart.findAll('div',{'class':"row_title"})[i].text.strip())

        var_list = list_conversion(var_list)
        
        combinedlist.append(vehmake+','+vehtype+','+reg_date+','+coe+','+depreciation+','+','.join(str(i) for i in var_list)+','+price+'\n')
        print(count)
        print(url_i)
        count+=1
    except:
        continue

headers.append('Price')

There are about 16 predictors and one predictand, which is the car price.

In [7]:
print(headers)
len(combinedlist)

['Brand', 'Type', 'Reg_date', 'Coe_left', 'Dep', 'Mileage', 'Road Tax', 'Dereg Value', 'COE', 'Engine Cap', 'Curb Weight', 'Manufactured', 'Transmission', 'OMV', 'ARF', 'Power', 'No. of Owners', 'Price']


4390

In [8]:
headers

['Brand',
 'Type',
 'Reg_date',
 'Coe_left',
 'Dep',
 'Mileage',
 'Road Tax',
 'Dereg Value',
 'COE',
 'Engine Cap',
 'Curb Weight',
 'Manufactured',
 'Transmission',
 'OMV',
 'ARF',
 'Power',
 'No. of Owners',
 'Price']

We then placed all the data sets into a csv file, and used another file for data cleaning.

In [9]:
filename = 'sg_usedcar_data_4k.csv'
f = open(filename, 'w')

header = (','.join(str(i) for i in headers))+'\n'
f.write(header)
    
for var in combinedlist:
    f.write(var)

f.close()

## End of Webscrape