<a href="https://colab.research.google.com/github/Rohan-Dawar/CTBUH-scraper/blob/main/CTBUH_Scraper_ROHAN_DAWAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraper for the CTBUH (Council on Tall Buildings and Urban Habitat) Database

* Author: [Rohan Dawar](rohandawar.com)
  * [GitHub](https://github.com/Rohan-Dawar)
  * [LinkedIn](https://www.linkedin.com/in/rohan-dawar/)
  * [Jovian](https://jovian.ai/rohan-dawar)

---

## **What is CTBUH?**
* Council on Tall Buildings and Urban Habitat (CTBUH) is an organization that (among other things) tracks and maintains data relating to highrises, skyscrapers and other tall buildings
* [CTBUH Homepage](https://www.ctbuh.org/)
* [CTBUH Database](https://www.skyscrapercenter.com/explore-data)


## **What this notebook does**:
* This notebook contains the function `gen_df` which scrapes the CTBUH database and returns a pandas dataframe containing all buildings listed for your passed criteria
* Parameters to pass include:
  * Continent
  * Country
  * City
  * If you want to grab coordinates
  * If you want to one-hot encode specific columns

* Columns in the dataframe include:
  * Rank (height ranking among criteria)
  * Name (of the building, or address)
  * City
  * Completion (year)
  * Height (m)
  * Floors
  * Materials (Concrete, Steel, Composite, etc.)
  * Use (Residential, Retial, Office, Hotel, Mixed, etc.)
  * Status (Completed, Under Construction, Proposed, etc.)
  * BuildPage (skyscrapercenter building page)
  * CityPage (skyscrapercenter city page)
  * CountryPage (skyscrapercenter country page)
  * BID (building ID, used in CTBUH database)
  * Coords (if getCoords=True, will try to extract location data)
  * Country

* This notebook also contains dictionaries for the continent, country and city parameters to look up their reference numbers in CTBUH database

In [176]:
# Dependencies:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [177]:
base_url = 'https://www.skyscrapercenter.com'
map_url = f'{base_url}/map/building/'
db_url = f'{base_url}/explore-data'

In [178]:
# Finding the filter labels:
r = requests.get(db_url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find_all('div', class_="filter-label")
s

[<div class="filter-label">Structure Type</div>,
 <div class="filter-label">Status</div>,
 <div class="filter-label">Height</div>,
 <div class="filter-label">Material</div>,
 <div class="filter-label">Function</div>,
 <div class="filter-label">Region</div>,
 <div class="filter-label">Country</div>,
 <div class="filter-label">City</div>,
 <div class="filter-label">Min Year.</div>,
 <div class="filter-label">Max Year</div>,
 <div class="filter-label">Company</div>]

In [179]:
# Creating Region Code Dict, 5th index of filter labels:
regionCode = {}
filterKeys = s[5].find_next()
filterOptions = filterKeys.find_all('option')
filterOptions
for fo in filterOptions:
  if fo.get('value'):
    regionCode[fo.text] = fo.get('value')
regionCode

{'Africa': '5',
 'Asia': '7',
 'Central America': '3',
 'Europe': '1',
 'Middle East': '6',
 'North America': '2',
 'Oceania': '8',
 'South America': '4'}

In [180]:
# Creating Country Code Dict, 6th index of filter labels:
countryCode = {}
filterKeys = s[6].find_next()
filterOptions = filterKeys.find_all('option')
filterOptions
for fo in filterOptions:
  if fo.get('value'):
    countryCode[fo.text] = fo.get('value')
countryCode

{'Afghanistan': '3',
 'Albania': '4',
 'Algeria': '46',
 'Andorra ': '1',
 'Angola': '7',
 'Argentina': '9',
 'Armenia': '5',
 'Aruba': '12',
 'Australia': '11',
 'Austria': '10',
 'Azerbaijan': '13',
 'Bahamas': '23',
 'Bahrain': '18',
 'Bangladesh': '19',
 'Barbados': '185',
 'Belarus': '25',
 'Belgium': '16',
 'Benin': '186',
 'Bhutan': '190',
 'Bolivia': '21',
 'Bosnia and Herzegovina': '14',
 'Botswana': '24',
 'Brazil': '22',
 'Brunei': '178',
 'Bulgaria': '17',
 'Burkina Faso': '67',
 'Cambodia': '84',
 'Cameroon': '33',
 'Canada': '27',
 'Central African Republic': '29',
 'Chad': '151',
 'Chile': '32',
 'China': '34',
 'Colombia': '35',
 'Congo': '30',
 'Costa Rica': '36',
 'Croatia': '69',
 'Cuba': '37',
 'Cyprus': '39',
 'Czech Republic': '40',
 'Democratic Republic of the Congo': '28',
 'Denmark': '43',
 'Djibouti': '42',
 'Dominican Republic': '45',
 'Ecuador': '47',
 'Egypt': '49',
 'El Salvador': '148',
 'Equatorial Guinea': '192',
 'Eritrea': '50',
 'Estonia': '48',
 'Et

In [181]:
# Creating City Code Dict, 7th index of filter labels:
cityCode = {}
filterKeys = s[7].find_next()
filterOptions = filterKeys.find_all('option')
filterOptions
for fo in filterOptions:
  if fo.get('value'):
    cityCode[fo.text] = fo.get('value')
cityCode

{'Aarhus': '879',
 'Abbeville': '1496',
 'Aberdeen': '1805',
 'Abidjan': '736',
 'Abilene': '1497',
 'Abu Dhabi': '629',
 'Abuja': '1128',
 'Acapulco': '1107',
 'Accra': '955',
 'Addis Ababa': '910',
 'Adelaide': '647',
 'Agawam': '2202',
 'Agra': '2276',
 'Ajax': '2855',
 'Ajman': '628',
 'Akita': '2448',
 'Akkrum': '1135',
 'Akron': '1502',
 'Al Fujayrah': '631',
 'Al Khobar': '1448',
 'Albany': '1503',
 'Albuquerque': '1498',
 'Aleppo (Alep)': '2158',
 'Alexandria': '2155',
 'Alexandria (VA)': '1513',
 'Algiers': '2156',
 'Alicante': '892',
 'Allentown': '2251',
 'Almaty': '1082',
 'Almelo': '1137',
 'Almere': '1133',
 'Alphen aan den Rijn': '1144',
 'Amarillo': '1504',
 'Americana': '2963',
 'Amersfoort': '1138',
 'Amherst': '1726',
 'Amiens': '915',
 'Amman': '1028',
 'Amsterdam': '1140',
 'Anapolis': '2649',
 'Anchorage': '1505',
 'Andelst': '1142',
 'Andorra La Vela': '627',
 'Ankara': '1477',
 'Annapolis': '2349',
 'Anshan': '742',
 'Antananarivo': '1097',
 'Antiguo Cuscatlán':

In [182]:
# Example cityCode usage:
cityCode['Chicago']

'1539'

In [183]:
# Example countryCode usage:
countryCode['Panama']

'123'

In [184]:
# Example regionCode usage:
regionCode['Asia']

'7'

In [185]:
# Returns coordinates if possible when getCoords=True in gen_df
def get_building_coords(bid) -> tuple:
  r = requests.get(map_url+bid)
  soup = BeautifulSoup(r.text, 'html.parser')
  try:
    mainScript = soup.find_all('script', type="text/javascript")[0]
    cds = str(mainScript).split('var marker = L.marker(')[-1].split(',')[:2]
    s1, s2 = cds
    s1 = s1.replace('[','')
    s2 = s2.replace(']','')
    return (float(s1), float(s2)) 
  except:
    return None

In [186]:
def gen_df(city='', country='', region='', getCoords=False, OHEcols=[]) -> pd.core.frame.DataFrame:
  """Returns a pandas dataframe containing buildings and other data listed for your passed criteria

  If the arguments `city`, `country` or `region` are not passed in, the dataframe will contain all global skyscraper data from CTBUH.

  Parameters
  ----------
  city : str, optional
      A string containing the cityCode str(integer)
      cityCode can be found by using the `cityCode` dict
        eg. cityCode['Chicago'] -> '1539'

  country: str, optional
      A string containing the countryCode str(integer)
      countryCode can be found by using the `countryCode` dict
        eg. countryCode['Panama'] -> '123'

  region: str, optional
      A string containing the regionCode str(integer), region refers to the Continent
      regionCode can be found by using the `regionCode` dict
        eg. regionCode['Asia'] -> '7'

  getCoords: bool, default=False
    If set to True, the `get_building_coords` function attempts to return [latitude, longitude] coordinates in WGS 84, if location data in CTBUH database
    Returns None if building location data not in CTBUH database

  OHEcols: list, default:[]
    A list of columns to one-hot encode.
    `Use` and `Material` columns are most commonly passed to one-hot encode.
    Ideally used for columns with <= 12 unique values


  Returns
  ------
  Pandas Dataframe, type: pandas.core.frame.DataFrame
      Column Count >= 14
      Row Count <= 1000
  """

# Load Parser:
  url = f'{base_url}/explore-data?output=list&types%5B%5D=building&statuses%5B%5D=COM&statuses%5B%5D=UCT&statuses%5B%5D=STO&statuses%5B%5D=UC&statuses%5B%5D=PRO&height=&region_id={region}&country_id={country}&city_id={city}&min_year=&max_year=&filter_company=&output=list'
  r = requests.get(url)
  soup = BeautifulSoup(r.text, 'html.parser')
  s = soup.find_all('table', id='table-combined-base')
  sList = s[0].find('tbody').find_all('tr')

# Set up dataframe:
  df = pd.DataFrame(columns=['Rank', 'Name', 'City', 'Blank', 'Completion', 'Height', 'Floors', 'Material', 'Use', 'Status', 'BuildPage', 'CityPage', 'CountryPage'])
  
# Parse data each column
  for t in sList:
    tc = t.find_all('td')
    dfcols = ['Rank', 'Name', 'City', 'Blank', 'Completion', 'Height', 'Floors', 'Material', 'Use', 'Status', 'BuildPage', 'CityPage', 'CountryPage']
    status = t.find('div').get('data-tippy-content')
    bpage, citypage, countrypage = [s['href'] for s in t.find_all('a')]
    vals = [s.text for s in tc]
    vals.append(status)
    vals.append(bpage)
    vals.append(citypage)
    vals.append(countrypage)
    newdict = dict(zip(dfcols,  vals))
    newrow = pd.DataFrame(newdict, index=[0])
    df = df.append(newrow)

# Get Building ID From Building Page URL:
  df['BID'] = df['BuildPage'].apply(lambda x : x.split('/')[-1])

# Get Coords
  if getCoords:
    df['Coords'] = df['BID'].apply(get_building_coords)

# Data Cleaning
  df['City'] = df['City'].apply(lambda x: x.split('\n')[0])
  df = df[df['Height'] != '-']
  df['Height'] = df['Height'].apply(lambda x: float(x.split(' m')[0].replace(',','') ) )
  df['Country'] = df['CountryPage'].apply(lambda x : x.split('/')[-1].title())
  if 'Blank' in list(df.columns):
    df = df.drop(columns=['Blank'])

# One Hot Encode
  if OHEcols:
    for col in OHEcols:
      flat_list = [item for s in df[col].apply(lambda x : x.split(' / ')).to_list() for item in s]
      fL = list(dict.fromkeys(flat_list))
      fL.remove('')

    # Creating one-hot encoded columns based on the wordlist:
      for word in fL:
        colname = f'{col}: {word}'
        df[colname] = np.where(df[col].str.contains(word, case=False, na=False), 1, 0)

  df = df.reset_index(drop=True)
  return df

In [187]:
# Example Scrape 1 : generate the dataframe for all buildings in the city Toronto with coordinates True:
df = gen_df(city=cityCode['Toronto'], getCoords=True)
df

Unnamed: 0,Rank,Name,City,Completion,Height,Floors,Material,Use,Status,BuildPage,CityPage,CountryPage,BID,Coords,Country
0,1,CC3,Toronto,2026,349.6,67,concrete,office,Proposed,/building/cc3/31719,/city/toronto,/country/canada,31719,"(43.648006, -79.378372)",Canada
1,2,1200 Bay Street,Toronto,-,326.5,87,,residential / office,Proposed,/building/1200-bay-street/38685,/city/toronto,/country/canada,38685,"(43.67004, -79.389847)",Canada
2,3,SkyTower,Toronto,2024,312.5,95,concrete,residential / hotel,Under Construction,/building/skytower/15295,/city/toronto,/country/canada,15295,"(43.643524, -79.374924)",Canada
3,4,212 King Street West,Toronto,2026,311.8,80,,residential / office,Proposed,/building/212-king-street-west/40045,/city/toronto,/country/canada,40045,"(43.647701, -79.38662)",Canada
4,5,The Hub,Toronto,2026,311.8,60,,office,Proposed,/building/the-hub/32493,/city/toronto,/country/canada,32493,"(43.642345, -79.378181)",Canada
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
812,813,T3 Sterling Road Building 5A,Toronto,2023,39.8,8,timber,office,Under Construction,/building/t3-sterling-road-building-5a/42380,/city/toronto,/country/canada,42380,"(43.654118, -79.44632)",Canada
813,814,77 Wade,Toronto,2022,38.2,8,composite,office,Under Construction,/building/77-wade/37723,/city/toronto,/country/canada,37723,"(43.659348, -79.445526)",Canada
814,815,Metro Park Condos North Tower,Toronto,-,37.2,12,,residential,Proposed,/building/metro-park-condos-north-tower/42668,/city/toronto,/country/canada,42668,"(43.717815, -79.332695)",Canada
815,816,222 Spadina Ave,Toronto,1964,34.0,10,,residential,Completed,/building/222-spadina-ave/40887,/city/toronto,/country/canada,40887,"(43.6511, -79.398148)",Canada


In [189]:
# Example Scrape 2 : generate the dataframe for all buildings in the United Kingdom with coordinates True:
df = gen_df(country=countryCode['United Kingdom'], getCoords=True)
df

Unnamed: 0,Rank,Name,City,Completion,Height,Floors,Material,Use,Status,BuildPage,CityPage,CountryPage,BID,Coords,Country
0,1,The Shard,London,2013,306.0,73,composite,residential / hotel / office,Completed,/building/the-shard/451,/city/london,/country/united-kingdom,451,"(51.504478, -0.0865)",United-Kingdom
1,2,1 Undershaft,London,-,289.9,73,steel,office,Proposed,/building/1-undershaft/19785,/city/london,/country/united-kingdom,19785,"(51.514221, -0.08167)",United-Kingdom
2,3,22 Bishopsgate,London,2020,278.2,62,composite,office,Completed,/building/22-bishopsgate/18648,/city/london,/country/united-kingdom,18648,"(51.514542, -0.08285)",United-Kingdom
3,4,100 Leadenhall Street,London,-,247.0,57,,office,Proposed,/building/100-leadenhall-street/31353,/city/london,/country/united-kingdom,31353,"(51.513756, -0.080193)",United-Kingdom
4,5,1 Lansdowne Road East Tower,London,-,236.0,68,,residential,Proposed,/building/1-lansdowne-road-east-tower/31205,/city/london,/country/united-kingdom,31205,"(51.376789, -0.097322)",United-Kingdom
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
867,868,The Cube Building,London,2015,33.0,10,composite,residential,Completed,/building/the-cube-building/35631,/city/london,/country/united-kingdom,35631,"(51.531822, -0.094729)",United-Kingdom
868,869,Harbour Central Block B1,London,2019,31.0,9,concrete,residential,Completed,/building/harbour-central-block-b1/19445,/city/london,/country/united-kingdom,19445,"(51.499622, -0.020213)",United-Kingdom
869,870,Kidbrooke Station Square Building B,London,-,29.6,9,,residential,Proposed,/building/kidbrooke-station-square-building-b/...,/city/london,/country/united-kingdom,41074,"(51.462799, 0.027779)",United-Kingdom
870,871,Stadthaus,London,2009,29.0,9,timber/concrete,residential,Completed,/building/stadthaus/19918,/city/london,/country/united-kingdom,19918,"(51.53072, -0.08942)",United-Kingdom


In [190]:
# Example Scrape 3: generate the dataframe for all buildings in the Middle East with coordinates True:
df = gen_df(region=regionCode['Middle East'], OHEcols=['Use'])
df

Unnamed: 0,Rank,Name,City,Completion,Height,Floors,Material,Use,Status,BuildPage,CityPage,CountryPage,BID,Country,Use: residential,Use: hotel,Use: office,Use: serviced apartments,Use: retail,Use: education,Use: government,Use: hospital,Use: museum,Use: exhibition
0,1,Burj Mubarak Al Kabir,Kuwait City,-,1001.0,234,composite,residential / hotel / office,Canceled,/building/burj-mubarak-al-kabir/21,/city/kuwait-city,/country/kuwait,21,Kuwait,1,1,1,0,0,0,0,0,0,0
1,2,Jeddah Tower,Jeddah,-,1000.0,167,concrete,residential / serviced apartments,Under Construction,/building/jeddah-tower/2,/city/jeddah,/country/saudi-arabia,2,Saudi-Arabia,1,0,0,1,0,0,0,0,0,0
2,3,Burj Khalifa,Dubai,2010,828.0,163,steel/concrete,office / residential / hotel,Completed,/building/burj-khalifa/3,/city/dubai,/country/united-arab-emirates,3,United-Arab-Emirates,1,1,1,0,0,0,0,0,0,0
3,4,Uptown Dubai Tower 1,Dubai,-,711.0,-,concrete,office,Proposed,/building/uptown-dubai-tower-1/16264,/city/dubai,/country/united-arab-emirates,16264,United-Arab-Emirates,0,0,1,0,0,0,0,0,0,0
4,5,Dubai One,Dubai,-,711.0,161,,residential,Canceled,/building/dubai-one/20612,/city/dubai,/country/united-arab-emirates,20612,United-Arab-Emirates,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
933,934,PPA 30 Parcel 2.07 Tower A,Riyadh,2018,48.8,11,concrete,residential,Architecturally Topped Out,/building/ppa-30-parcel-207-tower-a/17187,/city/riyadh,/country/saudi-arabia,17187,Saudi-Arabia,1,0,0,0,0,0,0,0,0,0
934,935,PPA 30 Parcel 2.05 Office Tower,Riyadh,2018,48.0,9,concrete,office / retail,Architecturally Topped Out,/building/ppa-30-parcel-205-office-tower/17186,/city/riyadh,/country/saudi-arabia,17186,Saudi-Arabia,0,0,1,0,1,0,0,0,0,0
935,936,KAFD Conference Centre,Riyadh,2014,41.7,2,composite,exhibition,Completed,/building/kafd-conference-centre/17176,/city/riyadh,/country/saudi-arabia,17176,Saudi-Arabia,0,0,0,0,0,0,0,0,0,1
936,937,PPA 30 Parcel 2.05 Residential Tower,Riyadh,2018,40.0,8,concrete,residential / retail,Architecturally Topped Out,/building/ppa-30-parcel-205-residential-tower/...,/city/riyadh,/country/saudi-arabia,17185,Saudi-Arabia,1,0,0,0,1,0,0,0,0,0
