# Web Scraping Trail Forkes Using Selenium

Trail Forkes is a platform for logging mountain bike trails all over the world. A combination of requests/beautiful soup and Selenium is used.

In [2]:
import requests, bs4
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
import psutil

In [67]:
def get_soup(url):
    res = requests.get(url, verify=False)
    res.raise_for_status()
    return bs4.BeautifulSoup(res.text, "lxml")

Get the list of all trails in Australia and get all the *tr* (table row) elements:

<img src="Trailforkes.png" width="600">

In [68]:
url = 'https://www.trailforks.com/region/australia/ridingareas/'
soup = get_soup(url)
trails = soup.find_all("tr")

Remove the first row and the last 2 rows then get the name, link and number of trails in the area

In [69]:
trails.pop(0)
trails.pop(-1)
trails.pop(-1)
df = pd.DataFrame()
for i, trail in enumerate(trails):
    df.loc[i, 'name'] = trail.find('a').text
    df.loc[i, 'link'] = trail.find('a')['href']
    df.loc[i, 'routes'] = trail.find_all('td')[3].text
df

Unnamed: 0,name,link,routes
0,420 Acres,https://www.trailforks.com/region/420-acres-18...,10
1,80 Acres,https://www.trailforks.com/region/80-acres/,19
2,Abergowrie,https://www.trailforks.com/region/abergowrie-3...,29
3,Adare Homestead,https://www.trailforks.com/region/adare-homest...,
4,Aireys Inlet,https://www.trailforks.com/region/aireys-inlet...,
...,...,...,...
95,Candlebark Park,https://www.trailforks.com/region/candlebark-p...,15
96,Cape Hillsborough,https://www.trailforks.com/region/cape-hillsbo...,4
97,Cape Pallarenda,https://www.trailforks.com/region/cape-pallare...,11
98,Cape Range,https://www.trailforks.com/region/cape-range-3...,31


The latitude and longitude was not present in the html. This means Selenum, an automated web driver needs to be used to access the content produced by JavaScript. There is a Chrome driver for Google Chrome in the repo. The cell below initiates the web driver.

In [10]:
#driver = webdriver.Chrome('./chromedriver')
driver = webdriver.Safari(executable_path='/usr/bin/safaridriver')


Each link in the dataframe is entered into the webdriver and the bounds for the map are found. The middle of the two bounds provides the middle of all the trails.

In [77]:
for i, url in enumerate(df.link):
    driver.get(url)
    try:
        mapbounds = driver.find_element_by_id("mapbounds").get_attribute('value')
    except NoSuchElementException:
        continue
    latlong = [float(x) for x in mapbounds.replace('[', '').replace(']', '').split(',')]
    df.loc[i, 'lat'] = (latlong[1] + latlong[3])/2
    df.loc[i, 'long'] = (latlong[0] + latlong[2])/2


In [78]:
df

Unnamed: 0,name,link,routes,lat,long
0,420 Acres,https://www.trailforks.com/region/420-acres-18...,10,-37.884600,144.795400
1,80 Acres,https://www.trailforks.com/region/80-acres/,19,-35.096700,138.582250
2,Abergowrie,https://www.trailforks.com/region/abergowrie-3...,29,-18.436660,146.014265
3,Adare Homestead,https://www.trailforks.com/region/adare-homest...,,-27.497990,152.290850
4,Aireys Inlet,https://www.trailforks.com/region/aireys-inlet...,,-38.431325,144.079620
...,...,...,...,...,...
95,Candlebark Park,https://www.trailforks.com/region/candlebark-p...,15,-37.744770,145.140715
96,Cape Hillsborough,https://www.trailforks.com/region/cape-hillsbo...,4,-20.915530,149.033700
97,Cape Pallarenda,https://www.trailforks.com/region/cape-pallare...,11,-19.194105,146.744700
98,Cape Range,https://www.trailforks.com/region/cape-range-3...,31,-22.050150,113.976995


In [79]:
df['Sport'] = 'Moutain Biking'
df.to_excel('trailforkes.xlsx', index=None)