## Wild Swimming - Data Acquisition

In this project I use natural language processing to cluster descriptions of wild swimming locations. To do this I need a dataset of wild swimming location descriptions. Luckily for me, there is such a database here: http://www.wildswimming.co.uk/wild-swim-map-uk/?multi_region=wild-swim-map-uk . However, I need to scrape this data from the website before I can start work on it. 

I can do this using the web scraping skills I developed in my previous project "Web-Scraping---Fantasy-F1". I will use BeautifulSoup to scrape the text data. (The differnent locations have a separate webpage each so I don't need to use Selenium to deal with JavaScript elements like i did before.) I will store this information in a pandas dataframe then save to csv. 

In [1]:
import pandas as pd

# packages for web scraping
import requests
from bs4 import BeautifulSoup

The wild swimming website contains a list of wild swimming locations with links to a separate page for each location:  

e.g.http://www.wildswimming.co.uk/map/spitchwick-common/  

On the separate web pages there is a text description about the wild swim location:  

e.g. *Peaty water, clean from the mountain, this is the most popular and accessible Dart swimming location, especially in 
summer. Also known as Deeper Marsh, it has been a bathing place for generations. Grassy flats lead to rocky river shore, deeper on far side with high cliff behind.*  

These individual descriptions (one for each wild swim location) shall be the individual documents of my corpus. 

First we need to set up BeautifulSoup on the corpus webpage (again http://www.wildswimming.co.uk/wild-swim-map-uk/?multi_region=wild-swim-map-uk)

In [2]:
# website with the list of all wild swimming locations (the corpus)
corpus_website = "http://www.wildswimming.co.uk/wild-swim-map-uk/?multi_region=wild-swim-map-uk"

# get page with requests package and make 'soup' from it
# can search throught the 'soup' for elements of the webpage
corpus_page = requests.get(corpus_website)
corpus_soup = BeautifulSoup(corpus_page.text, 'html.parser')

Now we can find the list of locations, how many locations there are and the individual links for each location

In [3]:
locations = corpus_soup.find("ul", class_ = "category_list_view clearfix").find_all("li", class_ = "clearfix")

N_locations = len(locations)
print("There are", N_locations, "different wild swimming locations listed in the database")

# finding the links for each 
n = 3
document_website = locations[n].find("a")['href']  # webpage found here.
print("An example link found on the corpus webpage:", document_website)

# do same as earlier but for each page with the location desscription on (the documents)
document_webpage = locations[n].find("a")['href']  # webpage found here.

There are 200 different wild swimming locations listed in the database
An example link found on the corpus webpage: http://www.wildswimming.co.uk/map/spitchwick-common/


With the link we can use BeautifulSoup again to find the text description (the document) and the location name (the label)

In [4]:
# set up BeautifulSoup on new page
document_page = requests.get(document_webpage)
document_soup = BeautifulSoup(document_page.text, 'html.parser')

# find document and label
label = document_soup.find("h1", class_ = "main_title").find("a").contents[0]   # document label found here
document = document_soup.find("div", class_ = "posts post_spacer").find("p").contents[0]   # document found here

print("Location Name:", label)
print("Description:", document)

Location Name:  Spitchwick Common 
Description:  Peaty water, clean from the mountain, this is the most popular and accessible Dart swimming location, especially in summer. Also known as Deeper Marsh, it has been a bathing place for generations. Grassy flats lead to rocky river shore, deeper on far side with high cliff behind.


See how this matches what we expected from following the links before.  

Now all we need to do is to loop over all links on the in the corpus webpage. I'll store the data in a pandas dataframe.

In [5]:
# empty dataframe to fill with data
data = {'Location Name':  [],
        'Description': [] }
df = pd.DataFrame(data)

# loop through wild swim locations
for n in range(N_locations):
    
    if n == 37:         # for some reason this link does not work. http://www.wildswimming.co.uk/map/silent-pool-and-bolder-mere/
        continue        # all the others are fine so I'll just skip this one. 
    
    # navigate to page and get soup
    document_webpage = locations[n].find("a")['href']  # webpage found here.
    document_page = requests.get(document_webpage)
    document_soup = BeautifulSoup(document_page.text, 'html.parser')
    
    # extract document and label from soup
    label = document_soup.find("h1", class_ = "main_title").find("a").contents[0]   # document label found here
    document = document_soup.find("div", class_ = "posts post_spacer").find("p").contents[0]   # document found here

    # append to dataframe
    df = df.append({'Location Name': label, 'Description': document}, ignore_index=True)

    


There we go. Now we have all the data off the website and stored in a handy dataframe. It looks like this:

In [6]:
df.head(10)

Unnamed: 0,Location Name,Description
0,Cornish Tipi Holidays,"A luscious, chalk-green, spring-fed quarry lak..."
1,Colliford Lake and Dozmary,"A huge moorland lake, the highest and largest ..."
2,St Nectan’s Kieve,"A tall, slender Waterfall or Gorge at the head..."
3,Spitchwick Common,"Peaty water, clean from the mountain, this is..."
4,Respryn Bridge,Wooded National Trust estate riverside walk. S...
5,Bodmin Parkway,"Still in the Lanhydrock Estate, this swim is o..."
6,Holne Pool,Small Waterfall or Gorge with sunbathing rock ...
7,Golitha Falls,Beautiful stream of young river Fowey runs thr...
8,"Chagford Lido, Teign",River-fed swimming pool with cafﾎ set in field...
9,Wellsfoot Island,Wonderful wooded island with red sand beach sh...


I just need to save it to a .csv for use in the rest of my project. 

In [7]:
df.to_csv("corpus.csv", index=False)