## ToDo
 * [x] get 10000 links
 * [x] locate required data in links
 * [ ] clean data (empty rows, numerical values, etc.)
 * [x] build the df and save it to a csv file


In [27]:
import requests
import lxml.html
import bs4
from bs4 import BeautifulSoup

import json
import pandas as pd

import random
import time

import logging
import collections

import selenium

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

## Getting the required number of links

For this project we needed to collect 10000 datapieces and save those to a csv. To get that many we used ImmoWeb as they have a large amount of listings for us to use. Adding some of the required filters (no annuity sales) we set up our base link: https://www.immoweb.be/en/search/house/for-sale?countries=BE&isALifeAnnuitySale=false&page=1&orderBy=newest

As this is just a demo project we will only work with the first page of links rather than looping over our links, but in the full project we use string formatting to replace the pagenumber and thus loop through the necessary pages and do this untill we can no longer get a next page and encounter ImmoWebs error page, after which we jump out of our loop.

This does take a while, and we have included a text file with all the links we scraped and in a lower cell included the code you need to fill up your list of links from this text file if necessary.

In [28]:
links = []
page = 1

while True:
    if page > 1:
        break
    
    url = f"https://www.immoweb.be/en/search/house/for-sale?countries=BE&isALifeAnnuitySale=false&page={page}&orderBy=newest"

    driver = webdriver.Firefox()
    driver.implicitly_wait(2)
    driver.get(url)
    
    try:
        driver.find_element(By.CLASS_NAME, "page-error")
    except:
        link_soup = BeautifulSoup(driver.page_source)
    else:
        driver.close()
        break

    driver.close()

    list_item = link_soup.find_all("li", attrs={"class": "search-results__item"})
    for index in range(10):
        for link in list_item[index].find_all("a", attrs={"class": "card__title-link"}):
            if link.get("href") not in links:
                links.append(link.get("href"))
    
    page += 1

This code is for you to fill the list of links with a list of links saved in a file. The assumed name and location of the file are saved in `file_name` and the assumed separator is `\n`. If the file you use is structured or located differently, edit the necessary variables before running the code.

In [3]:

links = []
filename = "./data/demo_links.txt" #If you have the links saved elsewhere, just replace the path
my_file = open(filename, "r")
for link in my_file.read().split("\n"):
    links.append(link)
    
links.remove("")
    
my_file.close()

Now we have our links we need to remove any duplicates.

In [38]:
print(len(links))
links = list(dict.fromkeys(links))
print (len(links))

38
38


If we don't have enough links yet you can run this code to every so often recheck the newest listings to see if any new properties have been added untill we fill our list to the required amount. 

In this demo project the required number of links has been set to 30, normally this would be 10 000 (or however many links we need).

In [30]:
while len(links) < 30:
    url = "https://www.immoweb.be/en/search/house/for-sale?countries=BE&isALifeAnnuitySale=false&page=1&orderBy=newest"

    driver = webdriver.Firefox()
    driver.implicitly_wait(2)
    driver.get(url)
    
    link_soup = BeautifulSoup(driver.page_source)
    
    driver.close()

    for list_item in link_soup.find_all("li", attrs={"class": "search-results__item"}):
        for link in list_item.find_all("a", attrs={"class": "card__title-link"}):
            if link.get("href") not in links: #This does not seem to actually filter out duplicates, not sure why
                links.append(link.get("href")) 
    if len(links) < 30:
        time.sleep(180)
    
print("Done") 
print(len(links))

Done
38


If you really want to, you could save the links to a text file. This will make it so you do not need to scrape all the same data over and over again.

In [31]:
filename = "./data/demo_links.txt"
my_file = open(filename, "w")
for link in links:
    my_file.write(link)
    my_file.write("\n")

my_file.close()

## Get the data from the links

To give a short overview of what we'll be doing from now on, we now make a soup from just 1 of the links so we can start looking at the data we have and selecting which data we need. Further below we will loop over all the links, building soups from the different links and saving the required data to our dataframe. For this step we can also just use a regular get request alongside BeautifulSoup to find the data we need

In [32]:
r = requests.get(links[0])
property_soup = BeautifulSoup(r.content, "lxml")

Now we'll take our soup and look for the relevant data. Since ImmoWeb fills it's site using data from a database we simply need to locate this data, parse it and store the relevant parts for easier access.

In [33]:
div_with_script = property_soup.find("div", attrs={"class": "classified"})
script_text = div_with_script.script.text.split("= ", 1)[1]
json_data = json.loads(script_text.rstrip()[:-1])

property_data = json_data["property"]
price_data = json_data["price"]
for key in property_data.keys():
    print(key)
    print(f"{property_data[key]}\n")

type
HOUSE

subtype
VILLA

title
Villa with 3 bedr., garage, cellar, attic and paved garden

description
Perfectly maintained, ready to move in 4 façade house located in the heart of Kortenberg. Literally within walking distance of all shops, schools, public transport (bus and train), motorway access, etc. The house underwent several updates over the years and is as good as ready to move in. Beautiful light. The layout is as follows. Ground floor: entrance hall +/- 10m². Guest toilet with washbasin. Beautiful living room and dining area (laminate, together +/- 32m²) with gas fireplace and air conditioning. Individual kitchen +/- 8,5 m², fully equipped with: dishwasher, combi-oven, gas-cooker, cooker-hood, fridge, double sink and water-heater. Garage for one car + connections for washer and dryer. The garden is fully paved and fenced. At theside there is also a possibility to place bicycles etc. First floor: attic room with boiler (electric). 3 spacious bedrooms : +/- 15,37 -14,62 – 7,2

As you can see, in the captured json, we get pretty much all the data we need. For the data we are currently missing it's just a matter of accessing the different levels and attributes of the json. Luckily I already went through the entire structure of the json to locate all the necessary parts as seen below.

## Finish the project

With all this knowledge and info, we can now simply loop over our links, make a call for every single one and put the required data in a dataframe. This is going to take a while, so drink some coffee, take a nap, review some code or just do anything you like. And if anyone asks what you're doing, just tell them you're creating the dataframe.

ETA: 4h 10min

In [34]:
location = []
property_type = []
property_subtype = []
price = []
type_of_sale = []
number_of_bedrooms = []
living_area = []
kitchen = []
furnished = []
open_fireplace = []
terrace = []
terrace_orientation = []
garden = []
garden_orientation = []
surface_area_land = []
number_of_facades = []
pool = []
condition = []

index = 0

for link in links:
    index += 1
    
    try:
        r = requests.get(link)
        soup = BeautifulSoup(r.content, "lxml")

        div_with_script = soup.find("div", attrs={"class": "classified"})
        script_text = div_with_script.script.text.split("= ", 1)[1]
        json_data = json.loads(script_text.rstrip()[:-1])


        property_data = json_data["property"]
        price_data = json_data["price"]


        location.append(property_data["location"]["locality"] if property_data["location"] != None else "Unknown")

        property_type.append(property_data["type"])

        property_subtype.append(property_data["subtype"])

        price.append(price_data["mainValue"])

        type_of_sale.append(price_data["type"])

        number_of_bedrooms.append(property_data["bedroomCount"])

        living_area.append(property_data["netHabitableSurface"])

        kitchen.append(property_data['kitchen']["type"] if property_data["kitchen"] != None else "Unknown")

        furnished.append(json_data["transaction"]["sale"]["isFurnished"])

        open_fireplace.append(property_data["fireplaceExists"])

        terrace.append(property_data["hasTerrace"] if property_data["hasTerrace"] != None else "Unknown")
        if property_data["hasTerrace"] == False:
            terrace_orientation.append("No Terrace")
        else:
            terrace_orientation.append(property_data["terraceOrientation"] if property_data["terraceOrientation"] != None else "Unknown")

        garden.append(property_data["hasGarden"] if property_data["hasGarden"] != None else "Unknown")
        if property_data["hasGarden"] == False:
            garden_orientation.append("No Garden")
        else:
            garden_orientation.append(property_data["gardenOrientation"] if property_data["gardenOrientation"] != None else "Unknown")

        surface_area_land.append(property_data["land"]["surface"] if property_data["land"] != None else "NaN")

        number_of_facades.append(property_data["building"]["facadeCount"] if property_data["building"] != None else "Unknown")

        pool.append(property_data["hasSwimmingPool"])

        condition.append(property_data["building"]["condition"] if property_data["building"] != None else "Unknown")
    
    except Exception as exception:
        print(type(exception).__name__)
        print(index)
        print(link)
        
    time.sleep(random.uniform(1.0, 2.0))

print("Done")

Done


We now have all the necessary data saved across multiple lists, so now we simply build a dataframe using these lists, and a csv using this dataframe and we're all done. 

Congratulations, you just scraped a ton of data and saved it to a csv! This should conclude this project.

In [39]:
df = pd.DataFrame({})
df["Location"] = location
df["Property type"] = property_type
df["Property subtype"] = property_subtype
df["Price"] = price
df["Type of sale"] = type_of_sale
df["Number of bedrooms"] = number_of_bedrooms
df["Living area"] = living_area
df["Kitchen"] = kitchen
df["Furnished"] = furnished
df["Open fireplace"] = open_fireplace
df["Terrace"] = terrace
df["Terrace orientation"] = terrace_orientation
df["Garden"] = garden
df["Garden orientation"] = garden_orientation
df["Surface area land"] = surface_area_land
df["Number of facades"] = number_of_facades
df["Pool"] = pool
df["Condition"] = condition


df.to_csv("./data/demo_houses.csv", index=True)
df

Unnamed: 0,Location,Property type,Property subtype,Price,Type of sale,Number of bedrooms,Living area,Kitchen,Furnished,Open fireplace,Terrace,Terrace orientation,Garden,Garden orientation,Surface area land,Number of facades,Pool,Condition
0,Kortenberg,HOUSE,VILLA,459000,residential_sale,3,157.0,USA_HYPER_EQUIPPED,False,True,True,WEST,True,WEST,194,4,,AS_NEW
1,Oostende,HOUSE,HOUSE,275000,residential_sale,3,,INSTALLED,False,False,True,Unknown,Unknown,Unknown,0,,False,GOOD
2,Zonnebeke,HOUSE,HOUSE,215000,residential_sale,2,,INSTALLED,False,False,True,Unknown,True,SOUTH_WEST,529,3,,TO_BE_DONE_UP
3,Ieper,HOUSE,HOUSE,345000,residential_sale,3,,INSTALLED,False,False,True,Unknown,Unknown,Unknown,2685,4,,TO_BE_DONE_UP
4,Roeselare,HOUSE,HOUSE,239000,residential_sale,3,193.0,SEMI_EQUIPPED,False,False,Unknown,Unknown,Unknown,Unknown,149,2,,GOOD
5,Roeselare,HOUSE,HOUSE,270000,residential_sale,3,148.0,SEMI_EQUIPPED,False,False,Unknown,Unknown,True,SOUTH,173,2,,GOOD
6,Linkebeek,HOUSE,HOUSE,565000,residential_sale,3,140.0,USA_INSTALLED,False,True,True,Unknown,True,SOUTH_EAST,500,,,AS_NEW
7,Blaton,HOUSE,HOUSE,240000,residential_sale,2,167.0,INSTALLED,False,False,True,Unknown,Unknown,Unknown,972,4,,
8,Kortenberg,HOUSE,VILLA,459000,residential_sale,3,157.0,USA_HYPER_EQUIPPED,False,True,True,WEST,True,WEST,194,4,,AS_NEW
9,Oostende,HOUSE,HOUSE,275000,residential_sale,3,,INSTALLED,False,False,True,Unknown,Unknown,Unknown,0,,False,GOOD


Next step: data cleaning, because sometimes we actually want fully unique and useable values.

In [36]:
print(len(df))
print(len(df.drop_duplicates()))

38
29
