# Cape Town Property Prices Web Scraping
In this notebook, we are going to be constructing our own Cape Town house prices data set. To do this, we will be making use of the BeautifulSoup web scraping library. Before proceeding, if you have absolutely no idea about how websites work, then check out [this video](https://www.youtube.com/watch?v=OZeoiotzPFg).   

### What is web scraping?
Web scraping is the process of extracting (or scraping) data from websites. We _could_ do this manually, but this would involve an unreasonable amount of work! Web scraping typically involves gathering data from websites using automated means - such as a Python script.

### Ethics of web scraping
Ideally we would want to use the Property24 website, but unfortunately their [terms and conditions](https://www.property24.com/terms-and-conditions) prohibit the act of web scraping on their website. Luckily, Pam Golding's website does not mention anything prohibiting this sort of activity, so we will scrape from there instead!   

> NOTE: always make sure you read the Ts&Cs of a website before you do any web scraping!

If you are unsure as to whether a website allows webscraping, check out their `robots.txt` file. You can do this simply by going to their homepage and adding `/robots.txt` to the end of the URL in your browser, e.g. https://www.pamgolding.co.za/robots.txt.   

As always, we will start by importing the libraries that we will be using.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import math
import requests
from itertools import cycle
import time
from tqdm import tqdm

### Where do we scrape the data?
Unless you have been living under a rock (we really hope you haven't), you should be familiar with URLs, commonly referred to as web addresses.   
The URL is just like a street address - it tells your browser _where_ on the internet the page that you are looking for can be found.   

Take a moment to visit the [Pam Golding website](https://www.pamgolding.co.za) and do a few searches for properties in Cape Town. Do a search for houses, and then for apartments, then changes the area and so on. Each time you change a part of your search, take note of how the URL in your browser changes.   

The data set we are going to construct will contain a column for the house type, and another for the search area. So let's construct a few search URLs that we will use to iteratively scrape data for different property types in various Cape Town areas.   

Doing it in this way will allow us to customise the areas and/or property types we want to scrape.

In [2]:
# all urls start with the same base
home_url = 'https://www.pamgolding.co.za'

# extensions for each property type
types = {
    'house': '/property-search/houses-for-sale-',
    'apartment': '/property-search/apartments-for-sale-'
}

# extensions for each area 
# adding a filter for property type results in a seemingly random code being appended
# the links below corespond to searches which have been filtered to include only houses
# or apartments under R7 500 000
searches = [
    {'Area': 'city-bowl', 'Type': {'house': '/dad78287-41e7-4c1f-85ec-1ff461a00e9b',
                                   'apartment': '/9d49e802-a4cc-49a5-8379-d33181e515eb'}},
    {'Area': 'atlantic-seaboard', 'Type': {'house': '/32177ab3-3abe-43ba-8f00-cc7190b9ad46',
                                           'apartment': '/388d14b9-e65b-4e58-83a9-fbb51b13256e'}},
    {'Area': 'southern-suburbs', 'Type': {'house': '-cape-town/24b2fd12-6b12-46a0-bc47-3a2c3ae3935e',
                                           'apartment': '-cape-town/1009a515-22c9-4fc2-904e-93e77f9a8aeb'}},
    {'Area': 'southern-peninsula', 'Type': {'house': '/73e51a67-0108-40eb-926e-ee8b24dddc2e',
                                            'apartment': '/b2a28d2d-531a-46c5-93f2-e9fab54df72e'}},
    {'Area': 'northern-suburbs', 'Type': {'house': '-cape-town/24b2fd12-6b12-46a0-bc47-3a2c3ae3935e',
                                          'apartment': '-cape-town/3b70525b-132a-434f-8b50-b41f06e4b77f'}},
    {'Area': 'boland-winelands', 'Type': {'house': '/59b1a872-7e4c-47f2-afe2-805b6e0bba72',
                                          'apartment': '/b746329e-af2c-4dbb-9090-6d4212cdab2b'}}
]

Let's just give that a test to see if our URLs get constructed correctly.

In [3]:
# search each area listed in searches list
for search in searches:
    area = search['Area']
    
    for p_type in search['Type']:
        url_string = home_url + types[p_type] + area + search['Type'][p_type]
        print(url_string)

https://www.pamgolding.co.za/property-search/houses-for-sale-city-bowl/dad78287-41e7-4c1f-85ec-1ff461a00e9b
https://www.pamgolding.co.za/property-search/apartments-for-sale-city-bowl/9d49e802-a4cc-49a5-8379-d33181e515eb
https://www.pamgolding.co.za/property-search/houses-for-sale-atlantic-seaboard/32177ab3-3abe-43ba-8f00-cc7190b9ad46
https://www.pamgolding.co.za/property-search/apartments-for-sale-atlantic-seaboard/388d14b9-e65b-4e58-83a9-fbb51b13256e
https://www.pamgolding.co.za/property-search/houses-for-sale-southern-suburbs-cape-town/24b2fd12-6b12-46a0-bc47-3a2c3ae3935e
https://www.pamgolding.co.za/property-search/apartments-for-sale-southern-suburbs-cape-town/1009a515-22c9-4fc2-904e-93e77f9a8aeb
https://www.pamgolding.co.za/property-search/houses-for-sale-southern-peninsula/73e51a67-0108-40eb-926e-ee8b24dddc2e
https://www.pamgolding.co.za/property-search/apartments-for-sale-southern-peninsula/b2a28d2d-531a-46c5-93f2-e9fab54df72e
https://www.pamgolding.co.za/property-search/houses-

## Let's do some scraping!
Now, onto the fun stuff! Let's iterate over the search areas and property types and extract data on all of the properties that come up.   

There are two parts to web scraping. First, we need to fetch all of the raw HTML that makes up the website. Then, we sift through all that raw code to extract the pieces of information we are looking for.   

### Hypertext Markup Language (HTML)
A website typically has three components that make it work: HTML, CSS and JavaScript. **CSS** is used to define the _style_ of the page - fonts, colours etc. **JavaScript** is used to define the functionality of the website - _"what happens when I click this button?"_ etc. **HTML** can be thought of as the skeleton of the website, it is used to define exactly what goes on the website.   

Since we are just trying to extract words and numbers, we only need to look at the HTML of a website to get what we want.   

### Google Chrome developer tools
Google's Chrome browser has a really useful feature, that we will be using to extract our data.
  - Open up Google Chrome
  - Go to http://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html
  - Right click anywhere on the page and select "Inspect"
  - In the panel on the right, move your mouse over the various **HTML tags** - notice how the different parts on the page are highlighted?  

## Step 1: Fetch HTML
Let's practice some scraping on this very simple website. First, we will use the `requests` library to reach out to the web server and load the website:

In [4]:
url = "http://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html"

request = requests.get(url)

request.text

'<html>\n\n<head>\n<title>A very simple webpage</title>\n<basefont size=4>\n</head>\n\n<body bgcolor=FFFFFF>\n\n<h1>A very simple webpage. This is an "h1" level header.</h1>\n\n<h2>This is a level h2 header.</h2>\n\n<h6>This is a level h6 header.  Pretty small!</h6>\n\n<p>This is a standard paragraph.</p>\n\n<p align=center>Now I\'ve aligned it in the center of the screen.</p>\n\n<p align=right>Now aligned to the right</p>\n\n<p><b>Bold text</b></p>\n\n<p><strong>Strongly emphasized text</strong>  Can you tell the difference vs. bold?</p>\n\n<p><i>Italics</i></p>\n\n<p><em>Emphasized text</em>  Just like Italics!</p>\n\n<p>Here is a pretty picture: <img src=example/prettypicture.jpg alt="Pretty Picture"></p>\n\n<p>Same thing, aligned differently to the paragraph: <img align=top src=example/prettypicture.jpg alt="Pretty Picture"></p>\n\n<hr>\n\n<h2>How about a nice ordered list!</h2>\n<ol>\n  <li>This little piggy went to market\n  <li>This little piggy went to SB228 class\n  <li>This l

## Step 2: Extract information
Now, let's use the `BeautifulSoup` library to sift through this raw HTML.

Let's start by extracting the h2 header item from the page:

In [5]:
# first use Beautiful Soup to parse the request as HTML
soup = BeautifulSoup(request.text, 'html.parser')
print(soup)

<html>
<head>
<title>A very simple webpage</title>
<basefont size="4"/>
</head>
<body bgcolor="FFFFFF">
<h1>A very simple webpage. This is an "h1" level header.</h1>
<h2>This is a level h2 header.</h2>
<h6>This is a level h6 header.  Pretty small!</h6>
<p>This is a standard paragraph.</p>
<p align="center">Now I've aligned it in the center of the screen.</p>
<p align="right">Now aligned to the right</p>
<p><b>Bold text</b></p>
<p><strong>Strongly emphasized text</strong>  Can you tell the difference vs. bold?</p>
<p><i>Italics</i></p>
<p><em>Emphasized text</em>  Just like Italics!</p>
<p>Here is a pretty picture: <img alt="Pretty Picture" src="example/prettypicture.jpg"/></p>
<p>Same thing, aligned differently to the paragraph: <img align="top" alt="Pretty Picture" src="example/prettypicture.jpg"/></p>
<hr/>
<h2>How about a nice ordered list!</h2>
<ol>
<li>This little piggy went to market
  <li>This little piggy went to SB228 class
  <li>This little piggy went to an expensive restaura

In [6]:
# now find the h2 item
h2 = soup.find('h2')
h2

<h2>This is a level h2 header.</h2>

In [7]:
# extract only the text
h2.text

'This is a level h2 header.'

Now, let's be a little more specific. Inspect the paragraph that is aligned to the center of the screen, you should see that it has an additional attribute inside its `<p>` tag called `align="center"`.

In [8]:
# can you spot the element we are looking for in the html below?
soup.find_all('p')

[<p>This is a standard paragraph.</p>,
 <p align="center">Now I've aligned it in the center of the screen.</p>,
 <p align="right">Now aligned to the right</p>,
 <p><b>Bold text</b></p>,
 <p><strong>Strongly emphasized text</strong>  Can you tell the difference vs. bold?</p>,
 <p><i>Italics</i></p>,
 <p><em>Emphasized text</em>  Just like Italics!</p>,
 <p>Here is a pretty picture: <img alt="Pretty Picture" src="example/prettypicture.jpg"/></p>,
 <p>Same thing, aligned differently to the paragraph: <img align="top" alt="Pretty Picture" src="example/prettypicture.jpg"/></p>,
 <p>And finally, how about some <a href="http://www.yahoo.com/">Links?</a></p>,
 <p>Or let's just link to <a href="../../index.html">another page on this server</a></p>,
 <p>Remember, you can view the HTMl code from this or any other page by using the "View Page Source" command of your browser.</p>]

We can use this extra bit of information to extract only the `<p>` item that we want:

In [9]:
center = soup.find('p', {'align': 'center'})
print(center.text)

Now I've aligned it in the center of the screen.


## The real reason we are here
The code below is just a more complicated version of what we did above. We have a list of URLs that correspond to different areas and property types, so we will iterate over each of these URLs and extract the information on all of the properties found in those areas.   

The `tqdm` library has been used to show the progress of the scraping in a nice and clean way. The status bar will show how many pages have been scraped for each search. Each page contains 10 listings on it.

In [10]:
rows = []

# search each area listed in searches list
for search in searches:
    area = search['Area']
    
    for p_type in search['Type']:
        p_type_name = p_type
                         
        url = home_url + types[p_type] + area + search['Type'][p_type]
    
        # find the number of search pages to iterate over for the given area
        request = requests.get(url)
        soup = BeautifulSoup(request.text, 'html.parser')
        pages = int(soup.find('span', {'class': 'propCountHdr'}).text.split()[0])
        pages = math.ceil(pages/10)
        
        print('Scraping data for all ' + p_type_name + 's in ' + area + '...')

        # search every page
        for page in tqdm(range(pages)):

            # create the url and request html
            url = home_url + types[p_type] + area + search['Type'][p_type] + '/page' + str(page+1)
            request = requests.get(url)
            soup = BeautifulSoup(request.text, 'html.parser')

            # find all properties listed on page
            search_results = soup.find_all('article', {'class': 'searchResult'})

            # each result is one property
            for result in search_results:

                prop_data = {}
                prop_data['area'] = area
                prop_data['type'] = p_type_name

                # this information is contained in 'div' elements
                features = ['bedroom', 'bathroom', 'garage', 'erfSize', 'buildingSize']

                for feature in features:
                    try:
                        value = result.find('div', {'class': feature}).text.strip()
                        prop_data[feature] = value
                    except:
                        continue

                # these will be taken directly from the tag
                data_tags = ['data-isonshow', 'data-price', 
                             'data-location', 'data-date']

                for tag in data_tags:
                    try:
                        prop_data[tag] = result[tag]
                    except:
                        continue

                # property description is contained in a 'p' element
                #  try:
                    #  value = result.find('p', {'class': 'property-description'}).text.strip()
                    #  prop_data['description'] = value
                #  except:
                    #  continue

                # the url needs to have the base added on
                try:
                    prop_data['data-url'] = home_url + result['data-url']
                except:
                    continue
                
                rows.append(prop_data)

        
# turn this all into a data frame
df = pd.DataFrame.from_dict(rows, orient='columns')

Scraping data for all houses in city-bowl...


100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:17<00:00,  5.66s/it]


Scraping data for all apartments in city-bowl...


100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:51<00:00,  5.65s/it]


Scraping data for all houses in atlantic-seaboard...


100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.58s/it]


Scraping data for all apartments in atlantic-seaboard...


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:08<00:00,  5.63s/it]


Scraping data for all houses in southern-suburbs...


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:07<00:00,  5.59s/it]


Scraping data for all apartments in southern-suburbs...


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:08<00:00,  5.74s/it]


Scraping data for all houses in southern-peninsula...


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:12<00:00,  5.82s/it]


Scraping data for all apartments in southern-peninsula...


100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.76s/it]


Scraping data for all houses in northern-suburbs...


100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [01:10<00:00,  5.78s/it]


Scraping data for all apartments in northern-suburbs...


100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.81s/it]


Scraping data for all houses in boland-winelands...


100%|██████████████████████████████████████████████████████████████████████████████████| 47/47 [03:44<00:00,  3.76s/it]


Scraping data for all apartments in boland-winelands...


100%|██████████████████████████████████████████████████████████████████████████████████| 47/47 [04:17<00:00,  5.79s/it]


## Cleaning
Before we save the data frame as a csv, let's do a quick bit of cleaning.

In [11]:
df.head()

Unnamed: 0,area,bathroom,bedroom,buildingSize,data-date,data-isonshow,data-location,data-price,data-url,erfSize,garage,type
0,city-bowl,3.5,3,,5/15/2018 12:48:53 PM,False,Bo-kaap,6950000,https://www.pamgolding.co.za/property-details/...,156 m²,2.0,house
1,city-bowl,2.0,4,,10/9/2018 3:07:41 PM,False,Oranjezicht,6800000,https://www.pamgolding.co.za/property-details/...,270 m²,,house
2,city-bowl,2.0,3,180 m²,7/31/2018 10:03:10 AM,False,Tamboerskloof,6595000,https://www.pamgolding.co.za/property-details/...,144 m²,,house
3,city-bowl,2.0,3,,6/28/2018 1:00:40 PM,False,Tamboerskloof,6490000,https://www.pamgolding.co.za/property-details/...,197 m²,,house
4,city-bowl,2.0,3,,10/4/2018 4:06:34 PM,False,Oranjezicht,6450000,https://www.pamgolding.co.za/property-details/...,153 m²,,house


In [12]:
# convert data types
df = df[df['data-url'].isna()==False]
df['bathroom'] = pd.to_numeric(df["bathroom"])
df['bedroom'] = pd.to_numeric(df["bedroom"])
df['data-price'] = pd.to_numeric(df["data-price"])
df['garage'] = pd.to_numeric(df["garage"])
df['data-date'] = pd.to_datetime(df["data-date"])

# remove m^2 from the size columns
df['buildingSize'] = df['buildingSize'].astype('str').apply(lambda x: 
                                                            ''.join([c for c in x if c in '1234567890']))

df['erfSize'] = df['erfSize'].astype('str').apply(lambda x: 
                                                  ''.join([c for c in x if c in '1234567890']))
df['buildingSize'] = pd.to_numeric(df["buildingSize"])
df['erfSize'] = pd.to_numeric(df["erfSize"])

In [13]:
df.head()

Unnamed: 0,area,bathroom,bedroom,buildingSize,data-date,data-isonshow,data-location,data-price,data-url,erfSize,garage,type
0,city-bowl,3.5,3.0,,2018-05-15 12:48:53,False,Bo-kaap,6950000,https://www.pamgolding.co.za/property-details/...,156.0,2.0,house
1,city-bowl,2.0,4.0,,2018-10-09 15:07:41,False,Oranjezicht,6800000,https://www.pamgolding.co.za/property-details/...,270.0,,house
2,city-bowl,2.0,3.0,180.0,2018-07-31 10:03:10,False,Tamboerskloof,6595000,https://www.pamgolding.co.za/property-details/...,144.0,,house
3,city-bowl,2.0,3.0,,2018-06-28 13:00:40,False,Tamboerskloof,6490000,https://www.pamgolding.co.za/property-details/...,197.0,,house
4,city-bowl,2.0,3.0,,2018-10-04 16:06:34,False,Oranjezicht,6450000,https://www.pamgolding.co.za/property-details/...,153.0,,house


In [14]:
df['area'].value_counts()

boland-winelands      940
southern-suburbs      231
northern-suburbs      149
atlantic-seaboard     140
southern-peninsula    125
city-bowl             104
Name: area, dtype: int64

In [15]:
df.dtypes

area                     object
bathroom                float64
bedroom                 float64
buildingSize            float64
data-date        datetime64[ns]
data-isonshow            object
data-location            object
data-price                int64
data-url                 object
erfSize                 float64
garage                  float64
type                     object
dtype: object

In [16]:
df.shape

(1689, 12)

In [17]:
df.to_csv('pam.csv', index=False)