# Webscraping form Walkhighlands 


Most of the walk pages I'm interested in on the walkhighlands use relative URLs with the format base/region/walk-name. Unfortunately, there is no one list given on the site so I need to scrape multiple pages: first the home page, then region specific pages (Sutherland, Torridon, ect), then finally area pages (which the region pages are split up into). Note: only highlands and islands links are used.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import requests
import re

In [3]:
#used to find the right region to search (manually as there are not many to worry about here)
home=bs(requests.get('https://www.walkhighlands.co.uk/').content, "html.parser")
home_links=[a.get('href') for a in home.find_all('a') if a.get('href')!='#']
print(home_links)

I 'scraped' the region names from the home_links by copy and paste, this was efficient as the list I'm interested in is short.

In [3]:
walkhighlands_link='https://www.walkhighlands.co.uk'

region_links=['/sutherland/', '/ullapool/', '/torridon/', '/kintail/', '/lochness/', '/moray/', '/fortwilliam/', 
           '/cairngorms/', '/perthshire/', '/argyll/', '/lochlomond/', '/aberdeenshire/', '/angus/', '/skye/', 
           '/mull/', '/outer-hebrides/', '/arran/', '/islands/', '/islay-jura/', '/orkney/', '/shetland/']


Using each region link, the find_areas function scrapes the page for all the area page links, which it returns as a string

In [4]:
def find_areas(regions):
    base='https://www.walkhighlands.co.uk'
    area_pages=[]
    for region in regions:
        url=base+region
        soup=bs(requests.get(url).content,'html.parser')
        text=soup.find('tbody')
        area_pages+=[url+a.get('href') for a in text.find_all('a') 
                     if a.get('href')!='../grade.shtml'and a.get('href')!=None ]
    return area_pages


find_walks scrapes each area page for pages containing specific walks. The arguments passed are two lists, the first containing the urls for the area pages to be scraped, the second containing href that need to be excluded.

In [5]:
def find_walks(areas, href):
    walk_links=[]
    for area in areas:
        url=re.findall('https://www.walkhighlands.co.uk/[a-z-]*', area)[0]+'/' #strip .shtml
        soup=bs(requests.get(area).content,'html.parser')
        text=soup.find('tbody')
        try:
            walk_links+=[url+a.get('href') for a in text.find_all('a') if a.get('href') not in href 
                          and re.match('[a-zA-Z-]*.shtml', a.get('href'))]
        except:
            #print(area) #find any problematic walk links 
            pass
            
    
        try:
            walk_links+=[base+a.get('href') for a in text.find_all('a') if a.get('href') not in href 
                          and re.match('/[a-zA-Z-]*/[a-zA-Z-]*.shtml', a.get('href'))]
        
        except:
            #print(area) #find any problematic walk links
            pass
             
    
    return walk_links
    

This cell defines the list of href to exclude. The initial list was generated debugging by hand, then the loop searches for lond distance walking route to exclude. I chose to exclude long distance hikes as the walk highlands pages for them are not exactly comparible with the regular day hikes. 

In [4]:
soup=bs(requests.get('https://www.walkhighlands.co.uk/long-distance-routes.shtml').content,'html.parser')
#print(soup.prettify())
text=soup.find('tbody')
links=text.find_all('a')
excluded_href=['../grade.shtml','#','/lochness/south-loch-ness-trail.shtml',
               'affric-kintail-way.shtml','/lochness/affric-kintail-way.shtml','moray/dava-way.shtml','dava-way.shtml',
              'moray-coast-trail.shtml','cateran-trail.shtml','cowal-way.shtml','kintyre-way.shtml'
               ,'west-island-way.shtml','three-lochs-way.shtml','formartine-buchan-way.shtml','skye-trail.shtml',
              'hebridean-way.shtml','arran-coastal-way.shtml']

for link in links:
    href=link.get('href')
    #print(type(href))
    excluded_href+=[href, '/'+href]

'''for href in excluded_href:
    print(href)'''

'for href in excluded_href:\n    print(href)'

The final function required is get_walk_features, which scrapes the individual walk pages for information about each walk.

In [7]:
def get_walk_features(walk_link):
    walk=bs(requests.get(walk_link).content, 'html.parser')
    table=str(walk.find_all('dl'))
    
    name=walk.find('h1', {'itemprop':'name'}).string
    info=walk.find('p', {'itemprop':"description"}).string
    region=re.findall('/[a-zA-Z-]{1,100}/',walk_link)[0].strip('/')
    dist=float(re.findall('[0-9.]*km',table)[0].strip('km'))
    ascent=float(re.findall('[0-9]{1,6}m ',table)[0].strip('m '))
    corbett=len(re.findall('<dt>Corbett</dt>',table))
    munro=len(re.findall('<dt>Munro</dt>',table))
    grade=len(walk.find_all(src="//d3teiib5p3f439.cloudfront.net/images/boot.gif"))
    bog=len(walk.find_all(src="//d3teiib5p3f439.cloudfront.net/images/reed.gif"))
    rating=float(walk.find('span',{'itemprop':"ratingValue"}).string)
    
    array=[name,info,region,dist,ascent,corbett,munro,grade,bog,rating]
    return array

Finally I can run each of my fucntions in sequence to give me an array that I can turn into an DataFrame 

In [8]:
walk_pages=find_walks(find_areas(region_links), excluded_href)

In [14]:
final_array=[]
for page in walk_pages:
    final_array+=[get_walk_features(page)]

In [16]:
df=pd.DataFrame(final_array, columns=['name','info','region','dist','ascent','corbett','munro','grade','bog','rating'])

In [20]:
print(df)

                                                 name  \
0           Ceannabeinne Township Trail, near Durness   
1                Forsinard Flows and Tower, Forsinard   
2                                  Borgie Forest walk   
3                                       Melvich Beach   
4                           Portskerra pier and jetty   
...                                               ...   
1565  Da Kame, Da Sneug & Da Noup: the complete Foula   
1566         Fair Isle North Lighthouse & Observatory   
1567                             Ward Hill, Fair Isle   
1568         Malcolm's Head and Sheep Rock, Fair Isle   
1569                  The Complete Fair Isle explorer   

                                                   info      region   dist  \
0     This short walk round the site of an abandoned...  sutherland    1.0   
1     This short but truly unique walk gives a fasci...  sutherland    1.5   
2     This short forestry walk briefly follows the R...  sutherland   1.75   
3  

Saving my work as a csv to use in future notebooks

In [24]:
df.to_csv('Walkhighlands_raw.csv')

# Features

Let's take a quick look at what each feature means: 

name: the name of the walk (str)

info: a brief summary of the walk (str)

region: the region the walk is located in (str)

dist: distance in Kilometres (float)

ascent: Total ascent on the route in metres (float)

corbett: number of corbett (mountians between 2500-3000ft) summits on route (int)

munro: number of munro (mountians over 3000ft) summits on route (int)

grade: a value of 1-5 given to a walk based on how challenging the terrain is

bog: a value of 1-5 based on how wet/boggy the route is 


rating: The average score given to the route by uses on the site
