# Web scraping from Speedhome

I am building this scraper to get rental data from [Speedhome](https://speedhome.com) to analyze the rental industry in Kuala Lumpur based on data available on the website. Below are documentation of my work.

**Disclaimer**: I scraped this site for my own personal project. Please scrape responsibly and ethically. [Ethics in Web Scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)


## Modules for web scraping
Import modules for web scraping: `requests` and `BeautifulSoup`

In [1]:
import requests
from bs4 import BeautifulSoup
from time import sleep
from random import randint
import pandas as pd

BASE_URL = "https://speedhome.com"

Below is the core code to scrape from Speedhome.

This current code is taking only 3 pages from Speedhome as a proof of concept. The real data that I have scraped all 78 pages of rental adverts on Speedhome.

In [2]:
rental = []
url = "/rent?pg=1"
i = 0

while url:
    res = requests.get(f"{BASE_URL}{url}")
    print(f"Now scraping {BASE_URL}{url}")
    soup = BeautifulSoup(res.text, "html.parser")
    rent = soup.find_all(class_="pro-col pro-grid col-xs-12 col-sm-4")

    for home in rent:
        rental.append({
            "name":home.find(class_="pro-title").get_text(),
            "price":home.find(class_="price").get_text(),
            "features":home.find(class_="features-sub").get_text(",", strip=True).split(","),
            "facilities":[i[0] + " " + i[1] for i in list(zip(home.find(class_="facilities").get_text(",", strip=True).split(","), [i["src"][16:-4] for i in home.find(class_="facilities").find_all("img")]))],

        })
    next = soup.find(class_ = "next")
    url = next["href"] if next else None
    
    
    # Spacing out request to not overload the server.
    print("Sleeping")
    sleep(randint(10,60))  
    
    
    # Stop scraping after 3 page as a proof of concept. Real data scraped from all pages
    i += 1
    if i > 2:
        break

Now scraping https://speedhome.com/rent?pg=1
Sleeping
Now scraping https://speedhome.com/rent?pg=2
Sleeping
Now scraping https://speedhome.com/rent?pg=3
Sleeping


The image below is snipped from Speedhome. The code above scraped the `name`, `price`, `features` and `facilities` every rental unit advertised.

![alt text](images/speedhomesnippet.png "Speedhome Rental Snippet")

For example, looking at the first image, the scraper will get the data as below:
```python
    {'name': 'Regalia @ Jalan Sultan Ismail, Kuala Lumpur',
     'price': 'RM 2,800',
     'features': ['864 sqft', 'High-Rise', 'Fully furnished'],
     'facilities': ['2 bed', '2 bath', '1 parking'],
     ...}
```

The first three data available which is equivalent to the image above is shown below

In [3]:
rental[:3]

[{'name': 'Regalia @ Jalan Sultan Ismail, Kuala Lumpur',
  'price': 'RM 2,800',
  'features': ['864 sqft', 'High-Rise', 'Fully furnished'],
  'facilities': ['2 bed', '2 bath', '1 parking']},
 {'name': 'Puncak Prima Condo, Sri Hartamas, Kuala Lumpur',
  'price': 'RM 2,300',
  'features': ['1095 sqft', 'High-Rise', 'Fully furnished'],
  'facilities': ['3 bed', '2 bath', '2 parking']},
 {'name': 'Apartment Unit for Rent',
  'price': 'RM 2,000',
  'features': ['1800 sqft', 'High-Rise', 'Partially furnished'],
  'facilities': ['3 bed', '2 bath', '1 parking']}]

There are 2 types of rentals advertised in Speedhome. 
1. A whole apartment/home
2. A room

I am only interested in the 1st type of rental. So, I will remove the 2nd type of data that is available.

In [4]:
rental_clean = []

for i in rental:
    if len(i["facilities"]) == 3:
        rental_clean.append(i)

In [5]:
len(rental)

63

In [6]:
len(rental_clean)

44

So from 63 available data, we have removed 19 from our dataset to get only the 1st type of rental.

Continuing from that, I further clean the data by separating the `features` and `facilities` of the rental units.

1. `features` is divided into:
    
    * `sqft`: Square foot of the unit, 
    * `high_rise`: High-rise or landed unit, 
    * `furnished`: The furnishing of the unit
    
    
2. `facilities` is divided into:

    * `bed`: No of bedrooms, 
    * `bath`: No of bathrooms, 
    * `parking`: No of parking available

In [7]:
for i in rental_clean:
    i.update({"sqft":i["features"][0], "high_rise":i["features"][1], "furnished":i["features"][2]})
    i.update({"bed":i["facilities"][0], "bath":i["facilities"][1], "parking":i["facilities"][2]})

rental_clean[:3]

[{'name': 'Regalia @ Jalan Sultan Ismail, Kuala Lumpur',
  'price': 'RM 2,800',
  'features': ['864 sqft', 'High-Rise', 'Fully furnished'],
  'facilities': ['2 bed', '2 bath', '1 parking'],
  'sqft': '864 sqft',
  'high_rise': 'High-Rise',
  'furnished': 'Fully furnished',
  'bed': '2 bed',
  'bath': '2 bath',
  'parking': '1 parking'},
 {'name': 'Puncak Prima Condo, Sri Hartamas, Kuala Lumpur',
  'price': 'RM 2,300',
  'features': ['1095 sqft', 'High-Rise', 'Fully furnished'],
  'facilities': ['3 bed', '2 bath', '2 parking'],
  'sqft': '1095 sqft',
  'high_rise': 'High-Rise',
  'furnished': 'Fully furnished',
  'bed': '3 bed',
  'bath': '2 bath',
  'parking': '2 parking'},
 {'name': 'Apartment Unit for Rent',
  'price': 'RM 2,000',
  'features': ['1800 sqft', 'High-Rise', 'Partially furnished'],
  'facilities': ['3 bed', '2 bath', '1 parking'],
  'sqft': '1800 sqft',
  'high_rise': 'High-Rise',
  'furnished': 'Partially furnished',
  'bed': '3 bed',
  'bath': '2 bath',
  'parking': '1

We then turn the data that we have into a dataframe using `pandas` for better storage in csv file.

In [8]:
df = pd.DataFrame(rental_clean)

In [9]:
df.head()

Unnamed: 0,bath,bed,facilities,features,furnished,high_rise,name,parking,price,sqft
0,2 bath,2 bed,"[2 bed, 2 bath, 1 parking]","[864 sqft, High-Rise, Fully furnished]",Fully furnished,High-Rise,"Regalia @ Jalan Sultan Ismail, Kuala Lumpur",1 parking,"RM 2,800",864 sqft
1,2 bath,3 bed,"[3 bed, 2 bath, 2 parking]","[1095 sqft, High-Rise, Fully furnished]",Fully furnished,High-Rise,"Puncak Prima Condo, Sri Hartamas, Kuala Lumpur",2 parking,"RM 2,300",1095 sqft
2,2 bath,3 bed,"[3 bed, 2 bath, 1 parking]","[1800 sqft, High-Rise, Partially furnished]",Partially furnished,High-Rise,Apartment Unit for Rent,1 parking,"RM 2,000",1800 sqft
3,2 bath,3 bed,"[3 bed, 2 bath, 1 parking]","[1080 sqft, High-Rise, Fully furnished]",Fully furnished,High-Rise,"putra majestik, jalan ipoh",1 parking,"RM 1,800",1080 sqft
4,4 bath,4 bed,"[4 bed, 4 bath, 1 parking]","[2902 sqft, High-Rise, Fully furnished]",Fully furnished,High-Rise,THE MARC RESIDENCE KLCC SUITES,1 parking,"RM 12,000",2902 sqft


In [10]:
df.columns

Index(['bath', 'bed', 'facilities', 'features', 'furnished', 'high_rise',
       'name', 'parking', 'price', 'sqft'],
      dtype='object')

Since `features` and `facilities` are redundant as we have split it, we will remove it from our data frame and order the data frame as below.

In [11]:
df = df[["name", "price", "sqft", "high_rise", "furnished", "bath", "bed", "parking"]]

In [12]:
df.head()

Unnamed: 0,name,price,sqft,high_rise,furnished,bath,bed,parking
0,"Regalia @ Jalan Sultan Ismail, Kuala Lumpur","RM 2,800",864 sqft,High-Rise,Fully furnished,2 bath,2 bed,1 parking
1,"Puncak Prima Condo, Sri Hartamas, Kuala Lumpur","RM 2,300",1095 sqft,High-Rise,Fully furnished,2 bath,3 bed,2 parking
2,Apartment Unit for Rent,"RM 2,000",1800 sqft,High-Rise,Partially furnished,2 bath,3 bed,1 parking
3,"putra majestik, jalan ipoh","RM 1,800",1080 sqft,High-Rise,Fully furnished,2 bath,3 bed,1 parking
4,THE MARC RESIDENCE KLCC SUITES,"RM 12,000",2902 sqft,High-Rise,Fully furnished,4 bath,4 bed,1 parking


Data is now saved in a csv file.

In [13]:
df.to_csv("data/speedhome_proofofconcept.csv", encoding='utf-8', index=False)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 8 columns):
name         44 non-null object
price        44 non-null object
sqft         44 non-null object
high_rise    44 non-null object
furnished    44 non-null object
bath         44 non-null object
bed          44 non-null object
parking      44 non-null object
dtypes: object(8)
memory usage: 2.8+ KB


We've come to the end of scraping the data from Speedhome. This is the first time I'm scraping for data from a website. The core code of scraping is not perfect due to my inexperience but as a proof of concept, this totally works.

As I said, please scrape responsibly and ethically if you want to do so.

Thank you for reading!