# Importing data from BBR into a relational database

This document shows how to retrieve data from the Danish building registry, BBR, and insert it into a relational database for easier use. <br>

As prerequisites, you need to be able to run Python and have a PostgresQL database ready with a table to receive the BBR data.

# 1) Obtaining the data

1. Go to https://datafordeler.dk/ <br>
Click on "log in" and create a new web user. You should get a user name and password to access datafordeler. <br>
<br>
2. With your username and password, log in to the datafordeler self-service to retrieve data: <br> https://selfservice.datafordeler.dk/ <br>
<br>
3. It's not intuitive, but your account is linked to several "users", each with different permissions. Check the Users tab (Brugere) - if you only have the user "Webbruger", you need to create a new one. Click on the + tab and create a service user with the "user name and access code" method.<br>
<br>
4. You are now ready to request public data on datafordeler. Go to the Downloads tab (Filudtræk). You should see an empty field - that's because you haven't requested data yet. To get access to data, you need to create a download. You have three choices:<br> - Clicking Opret will allow you to create a permanent download button, that is kept up to date and that you can use multiple times.<br> - Clicking Download will allow you to request a one-time download of the dataset.<br> - Clicking Predefined will allow you to download a dataset with a fixed set of parameters (instead of customizing everything). In particular, you can use Predefined to download the BBR dataset with only up-to-date entries, in JSON or XML format. <br>
<br>
5. You should now see a list of all available downloads. Give your download a name (Visningsnavn) and select BBR Totaludtræk in the list (or BBR Aktuelt Totaludtræk if using Predefined). Click Next. If you chose Opret or Download, you can now adjust a lot of parameters, such as downloading entries for only a specific municipality. If you used Predefined, the parameters are locked.<br><br>

6. Click Save (Gem). You will be taken back to the Download tabs. If you used Opret or Predefined, you should see your data subscription there. You can modify or delete it if you don't think that you will need to download it again in the future. You will receive an email with information on how to get your data.<br>
<br>
7. Actually getting the data is a bit tricky: you cannot download it from Datafordeler directly. You need to use a FTP client like https://filezilla-project.org/ <br>
Download and install FileZilla. When you launch it, enter the address provided in the email you got from Datafordeler, as well as your Datafordeler service user number (*not* your initial username: this is the user number you created in step 3) and password. Click Connect, and you should finally be able to see and download your files! 



### Note on file format

Because BBR is a very large dataset (if you download the whole thing), your computer will run out of memory when trying to parse it in one chunk. For this reason, you might want to download it as an XML file. Python methods exist to parse XML files iteratively without running out of memory. But as far as I know if you want to do the same with JSON you have to create that kind of function yourself.

# 2) Parsing the data

### Setup

In [None]:
import xml.etree.ElementTree as ET # package to parse XML 

In [None]:
import reprlib # package to limit print size if you want to avoid very long prints. 
# If you have many very long prints, the notebook might crash when you open it later.
# Tip: on Jupyter, press Esc then R then Y to reset a cell and delete its output.

In [None]:
xmlfile='C:/Users/.../myfile.xml' #Write the location of your XML file here.

### Parsing the XML file

The following bit of code is useful if you're working with a small dataset - for instance BBR for one municipality. But running it on the entire BBR dataset might crash your computer. See "Working with a large dataset" below.

In [None]:
tree = ET.parse(xmlfile)

In [None]:
root = tree.getroot()

In [None]:
root.findall("./") #To see the various item categories that make up BBR

In [None]:
root.findall("./{http://data.gov.dk/schemas/bbr}BygningList/{http://data.gov.dk/schemas/bbr}Bygning/") #To see all building parameters

### Working with a large dataset

Because BBR is very large, we want to use the iterparse method to parse it item-by-item instead of all at once. The procedure will scan the BBR database. When it encounters a new element, it will check if it's a building. If it is, it will extract the data we want about this building, and insert it into the PostgresQL database. But first, we need to define some functions to help us with each of these steps.

First, we create a dictionary of namespaces to avoid working with the "http://...." namespaces in all XML tags.

In [None]:
nsmap = {} #creating a dictionary of namespaces
for event, elem in ET.iterparse(xmlfile, events=('start-ns','end')):
    if event=='start-ns':
        ns, url = elem
        nsmap[ns] = url
    else:
        elem.clear()
        break
print(reprlib.repr(nsmap))

In [None]:
def fixtag(ns, tag, nsmap): #this function helps us build tag names based on the namespaces above
    return('{' + nsmap[ns] + '}' + tag)

Then we create a function to extract all the information we need about a building from its XML node, and return it as a single tuple that will be inserted in our database later on:

In [None]:
def get_building_properties(elem): #extract data about a building, return it as a tuple. Add and remove items based on your needs, but the order in the tuple must correspond to the order of the columns in your database.
    bbr_id= elem.find('{*}id_lokalId').text
    municipality_nr= elem.find('{*}kommunekode').text
    grund= elem.find('{*}grund').text
    husnummer= elem.find('{*}husnummer').text
    coord= elem.find('{*}byg404Koordinat').text
    construction_year= elem.find('{*}byg026Opførelsesår').text
    latest_renovation_year= elem.find('{*}byg027OmTilbygningsår').text
    bbr_use_category= elem.find('{*}byg021BygningensAnvendelse').text
    built_area= elem.find('{*}byg041BebyggetAreal').text
    total_building_area= elem.find('{*}byg038SamletBygningsareal').text
    n_floors= elem.find('{*}byg054AntalEtager').text
    wall_material= elem.find('{*}byg032YdervæggensMateriale').text
    roof_material= elem.find('{*}byg033Tagdækningsmateriale').text       
    return (bbr_id, municipality_nr, grund, husnummer, coord, construction_year, latest_renovation_year, bbr_use_category, built_area, total_building_area, n_floors, wall_material, roof_material)

Now all we need is a function to insert the tuples we get from get_building_properties into the database, and finally we can write code to parse the XML file iteratively, extract data with get_building_properties, and insert rows in the database.

# 3) Inserting in the PostgresQL database

### Setup

In [None]:
import psycopg as pg # package to communicate between Python and PostgresQL

In [None]:
with open('database_parameters.txt','r') as f: # Text file containing parameters to connect to the database
    params=f.read()
    f.close()

### Insertion function

Here the idea is to write a function that inserts one row into the database. You need to adjust the SQL code to fit the names of your table and columns. Remember that the order must be the same as in the get_building_properties function! Note: it is possible to write a slightly simpler function that inserts all rows at once, but that requires building a dictionary with all rows first - and we want to avoid this due to the size of the dataset.

In [None]:
def insert_bbr1(row_tuple):
    sql = "INSERT INTO buildings(bbr_id, municipality_nr, grund, husnummer, coord, construction_year, latest_renovation_year, bbr_use_category, built_area, total_building_area, n_floors, wall_material, roof_material) VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON CONFLICT ON CONSTRAINT buildings_pkey DO UPDATE SET (bbr_id, municipality_nr, grund, husnummer, coord, construction_year, latest_renovation_year, bbr_use_category, built_area, total_building_area, n_floors, wall_material, roof_material) = (EXCLUDED.bbr_id, EXCLUDED.municipality_nr, EXCLUDED.grund, EXCLUDED.husnummer, EXCLUDED.coord, EXCLUDED.construction_year, EXCLUDED.latest_renovation_year, EXCLUDED.bbr_use_category, EXCLUDED.built_area, EXCLUDED.total_building_area, EXCLUDED.n_floors, EXCLUDED.wall_material, EXCLUDED.roof_material);"
    #Replace table and column names by the ones in your PostgresQL database.
    connector = None
    bbrid = None
    try:
        # connect to the PostgreSQL database
        connector = pg.connect(params)
        # create a new cursor
        cur = connector.cursor()
        # execute the INSERT statement
        cur.execute(sql, row_tuple)
        # commit the changes to the database
        connector.commit()
        # close communication with the database
        cur.close()
    except (Exception, pg.DatabaseError) as error:
        print(error)
    finally:
        if connector is not None:
            connector.close()

# 4) Running the code

In [None]:
def fill_database(xml):
    for event, elem in ET.iterparse(xml): #The code starts scanning through each branch of the XML tree and records whenever it reaches the start and end of an object.
        if elem.tag == fixtag('', 'BBRSag', nsmap): #When we reach the end of BBRSag and BBRSagList, print a check and clear the memory.
            print('reached end of BBRSag')
            elem.clear()
        elif elem.tag == fixtag('', 'BBRSagList', nsmap): 
            print('reached end of BBRSagList')
            elem.clear()    
        elif elem.tag == fixtag('', 'Bygning', nsmap): #When we reach the end of a building node, get that building's properties and insert them as a row in the PostgresQL table. Clear the memory after each building.
            row= get_building_properties(elem)
            insert_bbr1(row)
            elem.clear()
        elif elem.tag == fixtag('', 'BygningList', nsmap):
            print('reached end of building list') #The idea is to stop once we reach the end of the building list. For some reason it seems to never reach this point, the code just keeps running without adding building to the database. So check once in a while if the number of buildings in the database is still increasing; if not break manually.
            break

In [None]:
fill_database(xmlfile)