# Capstone

Sales transactions dominate the media in the Singapore property scene, while the rental market does not attract as much attention. I attempt to address this information gap by identifying factors that drive the rental price for private residential property market in Singapore.

A good understanding and robust prediction will help house owners to place their asking rent suitably, while potential tenants can benefit by short listing units that suit their budget.

This notebook is the first part of the project workflow.

## Webscraping

We begin by scraping the data we need for our analysis from property portals.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# sns.set_style('whitegrid')
%matplotlib inline

# Resizing print option to see all columns at once
pd.set_option('max_columns', 82)
pd.set_option('max_rows', 82)

from pprint import pprint
import warnings
warnings.filterwarnings('ignore')

In [2]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from bs4 import BeautifulSoup

import copy
import urllib
import requests
import pickle
from time import sleep
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [59]:
# We will scrape condominium listings from SRX property portal
# We will navigate by district, the portal has a neat & convenient url to do that
# 1st placeholder {} for district number, 2nd placeholder {} for page number
# url = "https://www.srx.com.sg/search/rent/condo?selectedDistrictIds={}&page={}"
# Sort postings by new to old
url = "https://www.srx.com.sg/search/rent/condo?selectedDistrictIds={}&orderCriteria=datePostedDesc&page={}"

# Manually set district to scrape, 1 <= district <= 28
district = 28
page = 1
# Search results starts from page 1 for this portal

chromedriver = "C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(executable_path=chromedriver)

# Call webpage
driver.get(url.format(district, page))
# Webpage needs time to load!
sleep(5) # feature rich website, better give more time to load!

# Ensure that the website returned some results before we proceed any further
assert "No rental listings found" not in driver.page_source

# Extract the property results count, discard thousand separators, then cast as integer
propertycount = Selector(text=driver.page_source).xpath("//div[@class='has-properties']/strong[1]/text()").extract()
propertycount = int(propertycount[0].replace(',',''))

# Website conveniently provided number of pages for the results, let's grab that too.
pages = Selector(text=driver.page_source).xpath("//div[@class='has-properties']/strong[3]/text()").extract()
pages = int(pages[0].replace(',',''))

print('Found {} condo results in {} pages for district {}.'.format(propertycount, pages, district))


AssertionError: 

In [45]:
# WARNING: This cell may take a long time to run!!
# We need to go page by page and pull off those 20 links repeatedly
baseurl = 'https://www.srx.com.sg'
condo_links = []

# Results page count starts from 1 on the website URL
for page in range(pages):
    page += 1  # Need this workaround because page count starts from 1, not 0.
    print('Grabbing URLs from page {} of {}...'.format(page, pages))
    driver.get(url.format(district, page))
    sleep(5)
    
    soup = driver.page_source
    soup = BeautifulSoup(soup, 'lxml')
    
    # URL is embedded in each picture object, we'll start from there
    for item in soup.find_all("div", {"class" : "col-xs-12 col-sm-6 col-md-6 listingPhotoMain"}):
        # Under that picture object, get the first child 'a href' object and read the href link
        condo_links.append(baseurl + item.find_next("a").get("href"))
    
print("Total URLs grabbed:", len(condo_links))

# Export and backup the created list as CSV
print("\nExporting grabbed URLs to CSV...")
filename = 'condo_links_D{}.csv'.format(district)
path = './dataset/'+filename
pd.DataFrame(condo_links, columns=['URL']).to_csv(path, index=False)
print("Export done! Saved as {}.".format(path))


Grabbing URLs from page 1 of 5...
Grabbing URLs from page 2 of 5...
Grabbing URLs from page 3 of 5...
Grabbing URLs from page 4 of 5...
Grabbing URLs from page 5 of 5...
Total URLs grabbed: 89

Exporting grabbed URLs to CSV...
Export done! Saved as ./dataset/condo_links_D28.csv.


In [51]:
# Create dataframe to store job data
condo_df = pd.DataFrame(columns=["condo_id","URL","prop_name","prop_type","district","bed","bath","furnish",
                                 "area","tenure","unit_count","built_year","date_avail","room_type","lease",
                                 "model","developer","address","latitude","longitude","description","psf","rent"])

# Feature list
features = ['Property Name','Property Type','Asking','PSF','Built Year','Date Available From','Room Type',
            'Lease Term','Model','Developer','Address','District','Bedrooms','Bathrooms','Furnish','Area',
            'Land Tenure','No. of Units']

In [53]:
# Function to find condo features in posting webpage
def grab(feature):
    # Some posting may omit certain feature, set this as default sentinel value
    result = 'EMPTY'
    
    # The 18 features we want to grab are conveniently located within this object   
    for i in range(len(condo_info.find_all('span', {'class':'listing-about-main-key'}))):
        try:
            if feature in condo_info.find_all('span', {'class':'listing-about-main-key'})[i].text:
                # The key/value pairs are placed side by side, so we have a neat way to access it.
                result = condo_info.find_all('span', {'class':'listing-about-main-key'})[i].find_next_sibling().text
        except:
            result = 'FAIL'  # Set a sentinel value for flagging any weird problem
    return result

In [54]:
# WARNING: This cell may take a long time to run!!
# Keep track of the URLs skipped due to errors/exception
skiplist = []

print("Grabbing condo data...")
for link in condo_links:
    # If webpage cannot be loaded for whatever reasons, we will just skip it
    try:
        driver.get(link)
    except:
        print("Skipping: ", link)
        skiplist.append(link)
        continue
    sleep(5)
    soup = driver.page_source
    soup = BeautifulSoup(soup, 'lxml')
    
    # Grab the listing id
    try:
        condo_id = soup.find('input', {'class':'listingId'}).get('value')
    except:
        condo_id = 'EMPTY'
    
    # Grab the GPS coordinates
    try:
        latitude = soup.find('input', {'id':'sideLatitude'}).get('value')
        longitude = soup.find('input', {'id':'sideLongitude'}).get('value')
    except:
        latitude =  'EMPTY'
        longitude = 'EMPTY'
    
    # The 18 features we want to grab are conveniently located within this object
    condo_info = soup.find('div', {'class':'listing-about-main row'})  # grab() reference this object
    
    prop_name  = grab(features[0])
    prop_type  = grab(features[1])
    rent      = grab(features[2])
    psf        = grab(features[3]).strip()
    builtyr    = grab(features[4])
    date_avail = grab(features[5])
    room_type  = grab(features[6])
    lease      = grab(features[7])
    model      = grab(features[8])
    developer  = grab(features[9])
    address    = grab(features[10])
    district   = grab(features[11]).strip()
    bed        = grab(features[12])
    bath       = grab(features[13])
    furnish    = grab(features[14])
    area       = grab(features[15]).strip()
    tenure     = grab(features[16])
    unit_count = grab(features[17])
    
    # Condo description is found in another object, but it contains a child node which we do not need
    # Make a copy first since we will take a destructive action, we do not want to alter the original page
    try:
        desc = copy.copy(soup.find('div', {'id':'listingDesc'}))
        desc.find('div', {'class':'listingDetailRoom'}).decompose()  # Prune unwanted child node 
        description = desc.text.strip()
    except:
        description = 'EMPTY'
    
    # Append to the DataFrame.
    condo_df.loc[len(condo_df)] = [condo_id, link, prop_name, prop_type, district, bed, bath, furnish, 
                                   area, tenure, unit_count, builtyr, date_avail, room_type, lease, model, 
                                   developer, address, latitude, longitude, description, psf, rent]
    
    if (len(condo_df) % 10) == 0:
        print(len(condo_df), "condo records grabbed")

print("\nTotal condo records grabbed:", len(condo_df))


Grabbing condo data...
10 condo records grabbed
20 condo records grabbed
30 condo records grabbed
40 condo records grabbed
50 condo records grabbed
60 condo records grabbed
70 condo records grabbed
80 condo records grabbed

Total condo records grabbed: 89


In [55]:
condo_df.shape

(89, 23)

In [56]:
# Export and backup the scraped condo data as CSV
print("\nExporting grabbed condo data to CSV...")
filename = 'condo_data_{}.csv'.format(district.split()[0])
# filename = 'condo_data_D2.csv'
path = './dataset/'+filename
condo_df.to_csv(path, index=False)
print("Export done! Saved as {}.".format(path))



Exporting grabbed condo data to CSV...
Export done! Saved as ./dataset/condo_data_D28.csv.


### Adrian's Comments:
This conclude the section on web scraping.