# Scraping Huizenzoeker.nl to Analyse the Dutch Housing Market

### Introduction
Which places in the Netherlands are hit hardest by the Dutch Housing crisis, and which the least?
Momentarily, the housing crisis is one of the most prominent societal challenges in the Netherlands. This script scrapes information of the Dutch housing market, enabling use to analyse the housing market and clearify which areas are hit hardest by the housing crisis. This script provides information such as gem. vraagprijs, # verkochte woningen, gem. vierkante meter prijs, and % overboden. The dataframe that is generated through this script offers very interesting information, for example for first-time buyers that are having a hard time purchasing their first home on the current stressed Dutch housing market.

The script is divided into four main steps:
* **Step 1. Loading all the basics**: this step loads all the relevent packages and sets up the BeautifulSoup basis.
* **Step 2. Collecting the municipality URLs**: this step collects the urls of the municipalities in the Netherlands. For this step, we first need to create a list of the province URLs (twelve in total; for each province in the Netherlands). From these twelve province URLs, we are able to scrape the municipality URLs, since each province URL offers the opportunity to navigate to their corresponding municipalities.
* **Step 3. Scrape data from each url (municipality-level)**: this step scrapes the data from the municipality URLs that we have generated in step 2.
* **Step 4: Scrape data from each url (province-level)**: this step scrapes the data from each province. The same code that was used for scraping data on municipality-level is employed in this step.
* **Step 5: Scrape data for individual streets (street-level)**: only included for a subset of the streets in Noord-Brabant/Tilburg/Tiburg. 

## Step 1: Loading all the basics

These packages are needed to run our scraper, so make sure you install/load these first!

In [2]:
from bs4 import BeautifulSoup 
import requests
import re
import pandas as pd 
import time 
import json
from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

To use BeautifulSoup, we request to use the source code of the huizenzoeker woningmarkt page. 

In [3]:
url = 'https://www.huizenzoeker.nl/woningmarkt/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

## Step 2: Collecting the municipality URLs

**Construct a list of URLs for all provinces of the Netherlands** 

We first construct a base url and a province_url which once appended together creates the URL to each of the woningmarkt pages for each province. We generate the generate_links() function to append these parts of the URL together. 

In [6]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/' #fixed part municipality URLs
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/'] #variable part municipality URLS

In [7]:
def generate_links(base_url,province_url): 
    '''Takes in a base_url and province_url, returns these inputs pasted together'''
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links

page_links = generate_links(base_url,province_url)

**Construct a list of URLs for all municipalities of each province of the Netherlands**

We then use this list of all province URLs, to extract all municipalities from each and thus generate URLs directing leading us to each municipality in each province, by making use of window handling. 

In [8]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 94.0.4606
Get LATEST driver version for 94.0.4606
Driver [C:\Users\danie\.wdm\drivers\chromedriver\win32\94.0.4606.61\chromedriver.exe] found in cache


In [9]:
page_urls_full = []

for link in page_links:
    driver.switch_to.window(driver.window_handles[-1])
    driver.get(link)
    time.sleep(2)
    
    for elem in driver.find_elements_by_xpath("//li//div//a[@href]"):
        urls = elem.get_attribute('href')
        page_urls_full.append(urls)

## Step 3: Scrape data from each url (municipality-level)

**Scraper for municipality data**

For each municipality we extract:
* *Trend data*: gem. vraagprijs, verkochte woningen, gem.vierkantemeter prijs, % overboden (and how these numbers how changed t.o.v. vorige maand) 
* *Other information*: besteedbaar inkomen, aantal inwoners

#### Warning: Running the next cell for 'page_urls_full' will take aprox. 30 minutes. You might want to replace page_urls_full for 'subset'!

In [9]:
fn = 'saved_municipality_data.json' #saving the data as a JSON file

def extract_city_trends(page_urls_full):
    '''Takes in a list of municipality urls to scrape, returns a dictionary with all trend data, and the besteedbaar inkomen, aantal inwoners and populatie groei'''
    trend_list = []
    for page_url in page_urls_full:
        driver.get(page_url)
        time.sleep(5) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # 'Provincie naam'
        provincie_naam = soup.find_all('a')[6].get_text()
            # 'Stadsnaam'
        stad_naam = soup.find_all('h2')[0].get_text()
        stad_naam = stad_naam.replace('Woningmarkt','')
        stad_naam = stad_naam.replace(' ', '')
            # 'Gemiddelde vraagprijs'
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # 'Aantal verkochte woningen'
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # 'Gemiddelde vierkante meter prijs'
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # 'Percentage overboden'
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # 'Besteedbaar inkomen'
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # 'Inwoners'
        content = soup.find("div", {"class": "buurt-info"})
        inwoners = content.find_all('p')[3].get_text
        inwoners = str(inwoners)
        inwoners = re.search('Dat zijn(.+?)inwoners', inwoners)
        if inwoners:
            found_inwoners = inwoners.group(1)
            found_inwoners = found_inwoners.strip()
            found_inwoners = found_inwoners.replace(".", ",")
        else:
            found_inwoners = 'NA'
            # 'Bevolkingsgroei'
        content = soup.find("div", {"class": "buurt-info"})
        populatiegroei = content('p')[4].get_text
        populatiegroei = str(populatiegroei)
        populatiegroei_increase = re.search('afgelopen jaar met (.+?) gegroeid', populatiegroei)
        if populatiegroei_increase:
            found_populatiegroei = populatiegroei_increase.group(1)
            found_populatiegroei = found_populatiegroei.strip()
        else:
            found_populatiegroei = 'NA'
        populatiegroei_decline = re.search('afgelopen jaar met (.+?) gekrompen', populatiegroei)
        if populatiegroei_decline:
            found_populatiegroei_decline = populatiegroei_decline.group(1)
            found_populatiegroei_decline = found_populatiegroei_decline.strip() 
        else:
            found_populatiegroei_decline = 'NA'
            # Append list
        save_obj = {"Provincie":provincie_naam, "Stad":stad_naam, 
                    "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                    "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                    "Gem. m² prijs":m2_prijs, "%Δ m² prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                    "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                    "Besteedbaar inkomen (per huishouden)":bes_inkomen,
                    "Aantal inwoners": found_inwoners,
                    "% Populatie stijging":found_populatiegroei, "% Populatie daling":found_populatiegroei_decline}
        trend_list.append(save_obj)
        f=open(fn, 'a', encoding='utf-8')
        f.write(json.dumps(save_obj)+'\n')
        f.close()
    return(trend_list)

In [36]:
df = extract_city_trends(page_urls_full) 
pd.DataFrame(df) 

Unnamed: 0,Provincie,Stad,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m² prijs,%Δ m² prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden),Aantal inwoners,% Populatie stijging,% Populatie daling
0,Noord-Brabant,Alphen-Chaam,"€ 589,500",93.60%,6,200.00%,"€ 3,358",38.19%,7.15%,5.52%,"€ 45,700",10203,0.53%,
1,Noord-Brabant,Altena,"€ 344,500",-3.50%,20,-25.93%,"€ 2,773",-6.44%,8.45%,-0.34%,"€ 43,400",55967,1.05%,
2,Noord-Brabant,Asten,"€ 275,000",-40.86%,1,-85.71%,"€ 2,523",-10.18%,6.04%,-1.11%,"€ 40,200",16721,0.07%,
3,Noord-Brabant,Baarle-Nassau,€ 0,,0,-100.00%,€ 0,,13.37%,-1.65%,"€ 37,700",6859,0.18%,
4,Noord-Brabant,Bergeijk,"€ 442,500",36.15%,8,-27.27%,"€ 3,564",4.00%,6.10%,2.74%,"€ 43,600",18635,0.78%,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,Noord-Brabant,Vught,"€ 525,000",29.63%,12,-36.84%,"€ 4,005",8.24%,11.00%,4.11%,"€ 43,500",26558,0.61%,
57,Noord-Brabant,Waalre,"€ 375,000",-16.20%,7,75.00%,"€ 3,291",-11.86%,10.70%,0.06%,"€ 47,300",17456,1.21%,
58,Noord-Brabant,Waalwijk,"€ 380,000",33.33%,16,-30.43%,"€ 3,009",9.54%,7.04%,1.56%,"€ 37,500",48637,0.82%,
59,Noord-Brabant,Woensdrecht,"€ 315,000",-3.08%,7,-61.11%,"€ 2,520",-7.32%,6.35%,-0.67%,"€ 38,800",21876,0.05%,


In [25]:
df = extract_city_trends(page_urls_full) #saving the output as 'df'
final_dataframe = pd.DataFrame(df) #dataframe with all data for all municipalities in the Netherlands
final_dataframe #displaying the scraped municipality data as a Pandas dataframe 

Unnamed: 0,Province,City,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden),Aantal inwoners,% Populatie stijging,% Populatie daling
0,Noord-Holland,Aalsmeer,"€ 685,000",57.47%,12,-7.69%,"€ 4,476",9.22%,10.67%,3.42%,"€ 45,800",31859,0.41%,
1,Noord-Holland,Alkmaar,"€ 362,500",25.00%,38,-39.68%,"€ 3,926",10.62%,14.05%,1.43%,"€ 36,300",109436,0.81%,
2,Noord-Holland,Amstelveen,"€ 570,000",18.13%,21,-56.25%,"€ 4,724",1.88%,305.01%,296.30%,"€ 37,800",91675,0.92%,
3,Noord-Holland,Amsterdam,"€ 450,000",7.78%,230,-27.44%,"€ 6,961",5.90%,16.10%,0.37%,"€ 30,100",872757,1.13%,
4,Noord-Holland,Beemster,"€ 612,000",-12.26%,4,-33.33%,"€ 4,311",-6.89%,-0.23%,-12.10%,"€ 47,300",10022,2.81%,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,Limburg,ValkenburgaandeGeul,"€ 365,000",-5.19%,4,-55.56%,"€ 3,308",9.75%,12.60%,5.41%,"€ 35,600",16367,,0.63%
348,Limburg,Venlo,"€ 319,000",16.00%,32,-50.00%,"€ 2,727",17.80%,7.84%,-0.72%,"€ 33,700",101802,0.20%,
349,Limburg,Venray,"€ 297,000",8.99%,8,-33.33%,"€ 2,729",23.09%,8.98%,1.22%,"€ 39,100",43614,0.66%,
350,Limburg,Voerendaal,"€ 287,500",-11.13%,2,-75.00%,"€ 2,185",-12.53%,7.69%,-2.09%,"€ 40,800",12475,0.18%,


In [26]:
final_dataframe=pd.DataFrame(df) #dataframe with all data for all municipalities in the Netherlands

**Exporting the municipality-level dataframe as a CSV file, to be able to import it in R-Studio for analysis**

In [None]:
final_dataframe.to_csv('huizenzoeker_scraper_data.csv') 

## Step 4: Scrape data from each url (province-level)

**Constructing the list of province URLs**

Here we use the function that we constructed (generate_links) before to construct the province-urls. This time, we are going to scrape the data from these urls themselves, instead of then navigating to each individual municipality. 

In [37]:
page_links = generate_links(base_url,province_url) 

**Scraper for province data**

For each Province we again extract:
* *Trend data*: gem. vraagprijs, verkochte woningen, gem.vierkantemeter prijs, % overboden (and how these numbers how changed t.o.v. vorige maand) 
* *Other information*: besteedbaar inkomen, aantal inwoners

In [42]:
fn ='saved_province_data.json' #saving the data to a JSON file

def extract_provincie_trends(page_links):
    '''Takes in a list of province urls to scrape, returns a dictionary with all trend data, and the besteedbaar inkomen, aantal inwoners and populatie groei'''
    trend_list = []
    for page_link in page_links:
        driver.get(page_link)
        time.sleep(1) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # 'Provincie'
        provincie_naam = soup.find_all('h2')[0].get_text()
        provincie_naam = provincie_naam.replace('Woningmarkt','')
        provincie_naam = provincie_naam.replace(' ', '')
            # 'Gemiddelde vraagprijs'
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # 'Aantal verkochte woningen'
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # 'Gemiddelde vierkante meter prijs'
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # 'Percentage overboden'
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # 'Besteedbaar inkomen'
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # 'Inwoners'
        content = soup.find("div", {"class": "buurt-info"})
        inwoners = content.find_all('p')[3].get_text
        inwoners = str(inwoners)
        inwoners = re.search('Dat zijn(.+?)inwoners', inwoners)
        found_inwoners = 'NA'
        if inwoners:
            found_inwoners = inwoners.group(1)
            found_inwoners = found_inwoners.strip()
            found_inwoners = found_inwoners.replace(".", ",")
            # 'Bevolkingsgroei'
        content = soup.find("div", {"class": "buurt-info"})
        populatiegroei = content('p')[4].get_text
        populatiegroei = str(populatiegroei)
        populatiegroei_increase = re.search('afgelopen jaar met (.+?) gegroeid', populatiegroei)
        if populatiegroei_increase:
            found_populatiegroei = populatiegroei_increase.group(1)
            found_populatiegroei = found_populatiegroei.strip()
        else:
            found_populatiegroei = 'NA'
        populatiegroei_decline = re.search('afgelopen jaar met (.+?) gekrompen', populatiegroei)
        if populatiegroei_decline:
            found_populatiegroei_decline = populatiegroei_decline.group(1)
            found_populatiegroei_decline = found_populatiegroei_decline.strip() 
        else:
            found_populatiegroei_decline = 'NA'
            # Append list
        save_ojb = {"Provincie":provincie_naam, 
                    "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                    "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                    "Gem. m² prijs":m2_prijs, "%Δ m² prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                    "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                    "Besteedbaar inkomen (per huishouden)":bes_inkomen,
                    "Aantal inwoners": found_inwoners,
                    "% Populatie stijging":found_populatiegroei, "% Populatie daling":found_populatiegroei_decline}
        trend_list.append(save_obj)
        f=open(fn, 'a', encoding = 'utf-8')
        f.write(json.dumps(save_obj)+'\n')
        f.close()
    return(trend_list)

In [43]:
df2 = extract_provincie_trends(page_links) #saving the output as variable 'df2'
province_dataframe = pd.DataFrame(df2) #dataframe with all data for all provinces of the Netherlands
province_dataframe #displaying the Pandas dataframe

Unnamed: 0,Provincie,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m² prijs,%Δ m² prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden),Aantal inwoners,% Populatie stijging,% Populatie daling
0,Noord-Holland,"€ 425,000",13.33%,1076,-19.76%,"€ 4,381",7.83%,28.65%,15.98%,"€ 36,200",2879527,0.92%,
1,Zuid-Holland,"€ 362,500",6.93%,1385,-34.64%,"€ 3,584",5.13%,20.98%,10.76%,"€ 35,800",3708696,0.95%,
2,Zeeland,"€ 282,500",2.73%,184,-33.81%,"€ 2,642",4.18%,10.05%,1.92%,"€ 36,900",383488,0.12%,
3,Noord-Brabant,"€ 350,000",3.24%,802,-44.27%,"€ 3,188",5.15%,8.74%,0.87%,"€ 38,100",2548585,0.71%,
4,Utrecht,"€ 425,000",9.25%,661,-1.34%,"€ 4,167",3.58%,13.03%,1.06%,"€ 39,500",1354834,0.94%,
5,Flevoland,"€ 335,000",3.08%,180,-30.23%,"€ 2,941",0.31%,16.40%,1.69%,"€ 39,500",423021,1.55%,
6,Friesland,"€ 285,000",3.64%,344,-8.99%,"€ 2,433",-1.42%,11.71%,1.07%,"€ 34,900",649957,0.35%,
7,Groningen,"€ 250,000",11.11%,332,-5.95%,"€ 2,535",6.56%,20.27%,4.51%,"€ 30,600",540009,0.38%,
8,Drenthe,"€ 304,000",3.05%,252,-22.46%,"€ 2,479",1.22%,11.58%,1.09%,"€ 37,100",493682,0.31%,
9,Overijssel,"€ 300,000",1.69%,497,-9.64%,"€ 2,693",4.87%,10.50%,0.66%,"€ 36,900",1162406,0.52%,


**Exporting the province-level dataset as a CSV file, to import it into RStudio for analysis**

In [21]:
province_dataframe.to_csv(r'huizenzoeker_province_data.csv')

## Step 4: Scrape data from each url (street-level)

To be able to scrape data on street-level, we first need to scrape the urls of each residence ("woonplaats") after which we are able to scrape the urls of each street in each residence. To be able to scrape the urls of each residence, we make use of the municipality-urls. However, since each municipality in the Netherlands contains multiple residences, and each residence contains a lot streets, generating this dataset for each province in the Netherlands will take a lot of time (which is beyond the scope of this project). 

Therefore, to give an impression of how data on street-level looks like, we decided to focus on the province 'Noord-Brabant', the municipality 'Tilburg' and the residence 'Tilburg' (the municipality Tilburg has four residences: Berkel-Enschot, Tilburg, Udenhout and Zundert). From the residence 'Tilburg' we will extract all the streets (such as  Warandelaan, the street name of Tilburg University!). From these streets pages, we are able to extract data such as the distance to the nearest cinema or to the nearest child day care.

#### Scraping residence URLS for the subset Noord-Brabant 

In [11]:
subset = page_urls_full[112:173] # subset for municipalities Noord-Brabant 

page_urls_residences = [] 

for link in subset:
    driver.switch_to.window(driver.window_handles[-1])
    driver.get(link)
    time.sleep(2)
    
    for elem in driver.find_elements_by_xpath("//li//div//a[@href]"):
        urls = elem.get_attribute('href')
        page_urls_residences.append(urls)

#### Using button.click( ) to navigate to the next pages of '../tilburg/tilburg'

In [12]:
x = ['/html/body/div[2]/div/section[4]/div/div[2]/div/button[1]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[2]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[3]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[4]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[5]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[6]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[7]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[8]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[9]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[10]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[11]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[12]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[13]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[14]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[15]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[16]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[17]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[18]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[19]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[20]',
    '/html/body/div[2]/div/section[4]/div/div[2]/div/button[21]']

**Scraping a list of streetnames for Tilburg/Tilburg**

In [13]:
residence_tilburgtilburg = page_urls_residences[255:256] #subset for streetnames of Tilburg
try_out = [] 

for link in residence_tilburgtilburg:
    driver.switch_to.window(driver.window_handles[-1])
    driver.get(link)
    time.sleep(2)
    try:
        driver.find_element_by_class_name('cookie-consent-accept').click()
    except:
        print('You probably already clicked on the accept button!')
    
    for button in x:
        click_next = driver.find_element_by_xpath(button)
        click_next.click() 
        time.sleep(4)
    
        for elem in driver.find_elements_by_xpath("//li//div//a[@href]"):
            urls = elem.get_attribute('href')
            try_out.append(urls)

KeyboardInterrupt: 

#### Scraping the data from the street pages

In [220]:
try_out_subset = try_out[1800:2014]

fn = 'saved_street_data.json' #saving the data as a JSON file

def extract_street_trends(try_out_subset):
    trend_list_streets = []
    for url in try_out_subset:
        driver.get(url)
        time.sleep(2) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # Street name
        street_name = soup.find_all('h2')[0].get_text()
        street_name = street_name.replace('Woningmarkt','')
        street_name = street_name.replace(' ', '')
            # Aantal woningen
        content = soup.find_all(class_='trend-graph')[0]
        aantal_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            # Woonplaats
        content = soup.find_all(class_='trend-graph')[1]
        woonplaats = content.find("h3",{"class":"trend-graph-value"}).get_text()
            # Gem. bouwjaar
        content = soup.find_all(class_='trend-graph')[2]
        av_bouwjaar = content.find("h3",{"class":"trend-graph-value"}).get_text()
            # Gem. woonoppervlakte 
        content = soup.find_all(class_='trend-graph')[3]
        gem_woonopv = content.find("h3",{"class":"trend-graph-value"}).get_text() 
        gem_woonopv = gem_woonopv.replace('m²','')
            # Bioscoop
        content = soup.find_all(class_='consumer-icon')[0]
        bioscoop = content.select('div > p')[1].get_text()
        bioscoop = bioscoop.replace(' m','')
            # Treinstation
        content = soup.find_all(class_='consumer-icon')[1]
        treinstation = content.select('div > p')[1].get_text()
        treinstation = treinstation.replace(' m','')
             # Busstation
        content = soup.find_all(class_='consumer-icon')[2]
        busstation = content.select('div > p')[1].get_text()
        busstation = busstation.replace(' m','')
            # School
        content = soup.find_all(class_='consumer-icon')[3]
        school = content.select('div > p')[1].get_text()   
        school = school.replace(' m','')
            # Kinderopvang
        content = soup.find_all(class_='consumer-icon')[4]
        kinderopvang = content.select('div > p')[1].get_text()
        kinderopvang = kinderopvang.replace(' m','')
            # Supermarkt
        content = soup.find_all(class_='consumer-icon')[5]
        supermarkt = content.select('div > p')[1].get_text()
        supermarkt = supermarkt.replace(' m','')
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Append list
        save_obj = {"Straat":street_name, 
                          "Woonplaats":woonplaats,
                           "Aantal woningen":aantal_woningen,
                          "Gem. bouwjaar":av_bouwjaar,
                          "Gem. woonoppervlakte (m²)":gem_woonopv,
                          "Bioscoop (m)":bioscoop,
                          "Treinstation (m)":treinstation,
                          "Busstation (m)":busstation,
                          "School (m)":school,
                          "Kinderopvang (m)":kinderopvang,
                          "Supermarkt (m)":supermarkt,
                          "Besteedbaar inkomen":bes_inkomen}
        trend_list_streets.append(save_obj)
        f=open(fn, 'a', encoding = 'utf-8')
        f.write(json.dumps(save_obj)+'\n')
        f.close()
    return(trend_list_streets)

In [223]:
df = extract_street_trends(try_out_subset) 
try_out_streets = pd.DataFrame(df) #displaying the scraped data in a Pandas dataframe
try_out_streets

Unnamed: 0,Straat,Woonplaats,Aantal woningen,Gem. bouwjaar,Gem. woonoppervlakte (m²),Bioscoop (m),Treinstation (m),Busstation (m),School (m),Kinderopvang (m),Supermarkt (m),Besteedbaar inkomen
0,VanMusschenbroekstraat,Tilburg,27,1972,76,1007,1478,336,215,171,497,"€ 17,204"
1,VanOldenbarneveltstraat,Tilburg,32,1957,90,1728,1763,115,430,335,90,"€ 17,204"
2,VanOosterzeestraat,Tilburg,43,1978,116,3389,3201,81,340,105,357,"€ 28,141"
3,VanOtterloostraat,Tilburg,26,2007,173,3172,2979,159,150,101,287,"€ 31,938"
4,VanSassevanYsseltstraat,Tilburg,60,1968,115,968,683,210,381,155,293,"€ 29,597"
...,...,...,...,...,...,...,...,...,...,...,...,...
209,Zwammerdamstraat,Tilburg,6,2018,154,6845,642,258,497,475,1570,-
210,Zwartsluishof,Tilburg,41,2019,67,6877,785,391,645,556,1655,-
211,Zwartvenseweg,Tilburg,136,1970,115,4456,1368,300,488,145,391,"€ 25,831"
212,Zwijsenplein,Tilburg,1,1900,14463,524,753,176,655,229,313,-


In [236]:
try_out_streets.to_csv(r'huizenzoeker_street_data.csv')