# Web Scraping

## Extracting text


If you use spidering or just download a list of webpage URLs (e.g. with [curl](https://curl.haxx.se/docs/manpage.html) or [requests](http://docs.python-requests.org/en/latest/user/quickstart/)), you will be left with a collection of HTML files. These raw HTML files contain a lot of redundant information (e.g. adverts, menus, headings, etc.), this is known as "boilerplate". What we want is the main text of the webpage only (i.e. the news article, the blog post, etc.), in plain text without the HTML tags. If you are not familiar with HTML, [there is a guide here](https://www.w3schools.com/html/).

### Justext

Webpages come in all shapes and sizes, and it varies massively how easy it is to scrape the text of interest. If we're lucky, it's relatively easy to pick out the text, and there a fully automated tools available for that. One of these tools is [justext](https://github.com/miso-belica/jusText).

1. Have a read through the [description of the justext algorithm](http://corpus.tools/wiki/Justext/Algorithm).
2. There is a good [online demo of the tool](https://nlp.fi.muni.cz/projects/justext/). Try out a webpage, e.g. a BBC news article, this article demonstrates it well: <http://www.bbc.co.uk/bbcthree/article/cc72247b-e658-4af8-a838-dfe4e68e2776>.
3. Manually compare the filtered text and the text on the web page. How accurate is the text extraction? How much is missed, how much text is incorrectly included. What are the potential impacts of this in a later analysis of the text?
4. You can use justext on the command line too. Use it to process a previously gathered webpage, or a sample webpage is available alongside this notebook ([5.html](5.html)), e.g.:
```
$ python3 -m justext -s English -o 5.txt 5.html
```

5. Examine the text produced. How much of the text is correctly outputted?
6. Note the `<h>` and `<p>` tags, indicating headers and paragraphs. These may be useful, if for instance you are only interested in headers or the actual body of the text, or want to separate them in later analysis. You'll learn next week how you could filter these, e.g. using regular expressions.


Often an automated approach will not provide accurate enough results. Fortunately, there are other methods available, other than just manually copying and pasting the text. Many tools have been built that assist in parsing web pages and extracting the text of interest, although scripts need writing using these tools for different sets of websites. Some mimic a user's interaction with a website to get to the relevant data (e.g. [Selenium](http://seleniumhq.github.io/selenium/docs/api/py/)). [Scrapy](https://doc.scrapy.org/en/latest/index.html), as used in the other lab execsise, is another good option for grabbing and processing webpages. 

### Beautiful Soup

Here we will be using the Python requests package, which makes downloading webpages easy: <http://docs.python-requests.org/en/latest/user/quickstart/>

As above, this will provide raw HTML files. The key part is extracting the relevant parts of the webpage. For this we will use Beautiful Soup: <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>

In [4]:
import requests
from bs4 import BeautifulSoup

The basic process is to look at the HTML of the target webpage and look for ways of drilling down to the elements of interest, with the overall aim of extracting just the specific text of interest. This could be metadata, or actual text for further analysis. There are numerous methods provided by Beautiful Soup, please consult [the documentation]{https://www.crummy.com/software/BeautifulSoup/bs4/doc/} for options available.

All websites are different (although some have consistent structures), and often a custom scraper needs to be developed. You can use a standard web browser to look at the information of interest, right-click on the first part of the data of interest and select "Inspect Element" (Firefox & Safari) or "Inspect" (Chrome), you will then see the HTML code for that element, and the surrounding elements.

You can then use Beautiful Soup's functions for drilling down and traversing the relevant parts of the web page. You can also extract links and use the requests package to download further webpages for processing.

To demonstrate, we will be parsing Wikipedia pages. As an example we will look to extract plot summaries for Star Trek episodes: <https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes>. We can download this webpage with requests easily:

In [2]:
base_url = "https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes"
#load webpage
req = requests.get(base_url)

Next we use Beautiful Soup to put the HTML webpage into a parseable document:

In [3]:
soup = BeautifulSoup(req.text, "html.parser")

Use a web browser to view the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes). We are targetting the list of episodes in the tables starting under "Season 1 (1966–67)". Right click on the table cell with the first episode title ("The Man Trap") and "Inspect Element" (or just "Inspect" in Chrome). Note that this cell (`td`) is of class "summary", as is every title cell, and other cells are not. This is our way in. We are going to collect a list of titles from the table, along with the URL of the Wikipedia page about that episode.

In [4]:
episodes = [] #to store the list of episodes.
#find and loop through all tds (table cells) with class name ``summary'' (which we know is an episode title)
for episode_cell in soup.find_all('td', {'class': 'summary'}):
    title = episode_cell.a.text.strip() #Get the actual text from the cell.
    episode_url = episode_cell.a['href'] #extract the url
    episodes.append({'title': title, 'url': episode_url}) #store in dictionary format
    
episodes

[{'title': 'The Man Trap', 'url': '/wiki/The_Man_Trap'},
 {'title': 'Charlie X', 'url': '/wiki/Charlie_X'},
 {'title': 'Where No Man Has Gone Before',
  'url': '/wiki/Where_No_Man_Has_Gone_Before'},
 {'title': 'The Naked Time', 'url': '/wiki/The_Naked_Time'},
 {'title': 'The Enemy Within',
  'url': '/wiki/The_Enemy_Within_(Star_Trek:_The_Original_Series)'},
 {'title': "Mudd's Women", 'url': '/wiki/Mudd%27s_Women'},
 {'title': 'What Are Little Girls Made Of?',
  'url': '/wiki/What_Are_Little_Girls_Made_Of%3F'},
 {'title': 'Miri', 'url': '/wiki/Miri_(Star_Trek:_The_Original_Series)'},
 {'title': 'Dagger of the Mind', 'url': '/wiki/Dagger_of_the_Mind'},
 {'title': 'The Corbomite Maneuver', 'url': '/wiki/The_Corbomite_Maneuver'},
 {'title': 'The Menagerie',
  'url': '/wiki/The_Menagerie_(Star_Trek:_The_Original_Series)'},
 {'title': 'The Conscience of the King',
  'url': '/wiki/The_Conscience_of_the_King'},
 {'title': 'Balance of Terror', 'url': '/wiki/Balance_of_Terror'},
 {'title': 'Shor

Note the URLs are relative, they can be made absolute with urljoin, using the base url as a reference:

In [5]:
from urllib.parse import urljoin #https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin
for episode in episodes:
    episode['url'] = urljoin(base_url,episode['url'])
    
episodes

[{'title': 'The Man Trap',
  'url': 'https://en.wikipedia.org/wiki/The_Man_Trap'},
 {'title': 'Charlie X', 'url': 'https://en.wikipedia.org/wiki/Charlie_X'},
 {'title': 'Where No Man Has Gone Before',
  'url': 'https://en.wikipedia.org/wiki/Where_No_Man_Has_Gone_Before'},
 {'title': 'The Naked Time',
  'url': 'https://en.wikipedia.org/wiki/The_Naked_Time'},
 {'title': 'The Enemy Within',
  'url': 'https://en.wikipedia.org/wiki/The_Enemy_Within_(Star_Trek:_The_Original_Series)'},
 {'title': "Mudd's Women",
  'url': 'https://en.wikipedia.org/wiki/Mudd%27s_Women'},
 {'title': 'What Are Little Girls Made Of?',
  'url': 'https://en.wikipedia.org/wiki/What_Are_Little_Girls_Made_Of%3F'},
 {'title': 'Miri',
  'url': 'https://en.wikipedia.org/wiki/Miri_(Star_Trek:_The_Original_Series)'},
 {'title': 'Dagger of the Mind',
  'url': 'https://en.wikipedia.org/wiki/Dagger_of_the_Mind'},
 {'title': 'The Corbomite Maneuver',
  'url': 'https://en.wikipedia.org/wiki/The_Corbomite_Maneuver'},
 {'title': '

Now we have a list of episodes and their individual wikipedia pages for downloading. We can try justext on the first episode to see if automatic extraction will do:

In [6]:
import justext
episode_req = requests.get(episodes[0]['url']) #use requests to download episode webpage.



paragraphs = justext.justext(episode_req.text, justext.get_stoplist("English"))
text = ""

for p in paragraphs:
    if not p.is_boilerplate:
        text += p.text.strip() + "\n"
        
print(text)

ModuleNotFoundError: No module named 'justext'

Compare this to the [Wikipedia page](https://en.wikipedia.org/wiki/The_Man_Trap). It does a pretty good job of getting the text from the whole page, however, we want to only include the plot section. To do this, we need to be more specific about what to extract, so we can use Beautiful Soup.

Looking at the [Wikipedia page](https://en.wikipedia.org/wiki/The_Man_Trap), you can see the sections are headed with an `h2` element, so to find the "Plot" section we just need to go through the `h2` tags, find the one with "Plot" and hoover up all of the text between there and the next section. We can do this for the first episode as follows:

In [None]:
from bs4 import Tag #we need the Tag class from Beautiful Soup to check if a node we are looking at is Tag.

episode_soup = BeautifulSoup(episode_req.text, "html.parser") #use beautiful soup to decode into a parseable document.

episode_plot = ""

for h2 in episode_soup.find_all("h2"): #Go through all of the h2 elements.
            if(h2.text.strip().startswith("Plot")): #This is the h2 With "Plot" (and "Plot Summary")
                node = h2.next_sibling #start looking for tags after the Plot h2, will be strings and Tags.
                while True:
                    if isinstance(node, Tag): #Check if this element is actually a Tag.
                        if node.name == "p": #p tag, we want this.
                            episode_plot += node.text.strip() + "\n" #append the text from p.
                        elif node.name == "h2": #at the next h2, so a new section, no longer the plot. Stop processing.
                            break
                    node = node.next_sibling #get next element at same level.
                    
print(episode_plot)

You'll see we now just have the plot text. We just need to wrap this up in a loop and add the plot to each episode.

In [None]:
for episode in episodes:
    episode_plot = ""
    episode_req = requests.get(episode['url']) #do a new request for the episode page.
    episode_soup = BeautifulSoup(episode_req.text, "html.parser") #use beautiful soup to decode into a parseable document.
    for h2 in episode_soup.find_all("h2"): #Go through all of the h2 elements.
        if(h2.text.strip().startswith("Plot")): #This is the h2 With "Plot" (and "Plot Summary")
            node = h2.next_sibling #start looking for tags after the Plot h2, will be strings and Tags.
            while True:
                if isinstance(node, Tag): #Check if this element is actually a Tag.
                    if node.name == "p": #p tag, we want this.
                        episode_plot += node.text.strip() + "\n" #append the text from p.
                    elif node.name == "h2": #at the next h2, so a new section, no longer the plot. Stop processing.
                        break
                node = node.next_sibling #get next element at same level.

        episode['plot'] = episode_plot #add the plot to episode.




As this is in dictionary format, it's nice to export to JSON. Simple to change to outputting to file (see e.g. Twitter task: tweets_json)

In [None]:
import json
print(json.dumps(episodes,indent=4)) #print out the resulting json (pretty printed).

The code is split up here to explain it more neatly in a notebook, the the full code is available in [startrekscrape.py](startrekscrape.py).

## Wikipedia Exercise

To practice using Beautiful Soup, try extracting details of films by Stanley Kubrick from this page: <https://en.wikipedia.org/wiki/Filmography_and_awards_of_Stanley_Kubrick>. The aim is to extract the year and title for the 17 films listed. This is a little tricky as the year spans two rows in some cases. Some starting code is provided:

In [6]:
base_url = "https://en.wikipedia.org/wiki/Filmography_and_awards_of_Stanley_Kubrick"
#load webpage
req = requests.get(base_url)
#Use beautiful soup to decode webpage text into parseable document.
soup = BeautifulSoup(req.text, "html.parser")

table = soup.find('table', {'class': 'wikitable'}) #just find 1 table, as it's the first 1
trs = table.findAll('tr') #all table rows
trs = trs[1:] #skip header row


for row in trs:
    td_value = row.find_all("td")
    year  = td_value[0].text
    title = td_value[1].text
    year = year.replace("Yes","") 
    title = title.replace("Yes","")   
    
    print(year)
    print(title)



1951

Day of the Fight

Flying Padre



1953

Fear and Desire

The Seafarers



1955

Killer's Kiss

1956

The Killing

1957

Paths of Glory

1960

Spartacus

1962

Lolita

1964

Dr. Strangelove

1968

2001: A Space Odyssey

1971

A Clockwork Orange

1975

Barry Lyndon

1980

The Shining

1987

Full Metal Jacket

1999

Eyes Wide Shut

2001

A.I. Artificial Intelligence



Hint: You can use 'rowspan' to see if a cell covers multiple rows. You can also use [findAll with a limit](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-limit-argument) to return only a limited number of nodes from another node, e.g. `td`s in a `tr`.

#### Advanced

You can also use what you've learnt to parse details of something else from Wikipedia, e.g. other TV series, or albums from an artist's discography.

## Forum Exercise

The other lab workbook uses Scrapy to download pages from a forum, you can perform a similar task for forums with Requests and Beautiful Soup. You can choose any forum and use what you've learnt to parse thread posts into plain text. Start with an individual thread, and 1 page of posts within that thread. A good example to try is from Mumsnet on noisy baby toys (because... why not?): <https://www.mumsnet.com/Talk/toys_and_games_chat/3414974-noisy-baby-toys-which-are-the-worst>.

1. Try using justext to parse the the first page of posts, examine the results. Is this good enough?
2. Use Beautiful Soup to collect just the posts from the first page and output as plain text.

Hint: one issue you may come across is `<br>` tags instead of newlines. If ignored, this will run lines of text in the same post directly next to each other (without even a space), e.g. something horrible like this:
"Allllllll of the toot tootSpin and bounce zebraAnd best (worst!) of all... peppa pig alphaphonic board- it has no off switch and the slightest touch sets it off for ages..."

This makes tokenisation (see next week) difficult and is tricky to rectify. This is a good lesson in sanity checking the exported text, better to have the text as close to as appears on the webpage now. To deal with this issue, the `<br>` tags can be replaced with a new line character (or some other marker), e.g.:

```
for br in post.find_all("br"):
    br.replace_with("\n")
```

In [73]:
from bs4 import BeautifulSoup
import re

import requests

r  = requests.get("https://www.mumsnet.com/Talk/toys_and_games_chat/3414974-noisy-baby-toys-which-are-the-worst")

data = r.text

soup = BeautifulSoup(data)

posts = soup.find_all("div", {"class": "message"})

#Using regex
regexStripped = re.sub("<.*?>", " ", str(posts))
print(regexStripped)

#Using soup
soup = BeautifulSoup(str(posts))
all_text = ''.join(soup.findAll(text=True))
print(all_text)



[ 
 My best friend has a 6 month old and after 14 years of her gleefully finding the biggest most annoying toys she could for my children I am desperate to get my own back.  What are the current most ear shattering noisy toys ( preferably plastic with flashing lights) that you can buy now.  I remember vtech baby walkers were pretty appalling when mine were small. 
 ,  
 Following with interest, I'm in a similar situation   
 ,  
 Might just be us  @Sjjr23   I'm thinking the pink vtech walker looks horrific so might go for that 
 ,  
 I'm bumping this as I refuse to believe we are the only 2 vengeful people on here   
 ,  
  My 18 month is currently playing on her Vtech bounce and spin frog. It's very noisy I have the subtitles on TV . The volume has 2 settings though so she can always just switch it off  
 ,  
 Seems like vetch is the best/worst gift to give 
 ,  
 Lamaze Sunny Rabbit. It doesn't have an off button 😡 
 ,  
  The fisher price cookie jar is awful! One of those that the s

#### Advanced

You will notice that the topic's posts are spread across multiple pages, this means that "paging" needs to be performed to extract the posts from every page. This is a little tricky, but work from collecting 1 page. You will need to consult the [requests documentation](http://docs.python-requests.org/en/latest/user/quickstart/) to discover how to pass parameters to match the links to other pages. Be careful not to have duplicate posts in your extraction (the original post is repeated on each page).

The trickiest part is to know when to stop. Going past the number of pages available just brings the user to the final page. We will be covering regular expressions next week, so to help you out, the following code can be used to find the final page.

In [75]:
#this isn't complete code, you'll need to incorporate it into your solution.

from bs4 import BeautifulSoup
import re

import requests

results = []

lastpage = re.compile(r"page (\d+) of \1") #compiled regular expression, checking if we're at page n of n, i.e. last page.

url = "https://www.mumsnet.com/Talk/toys_and_games_chat/3414974-noisy-baby-toys-which-are-the-worst?pg=1"

for i in range()
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

posts = soup.find_all("div", {"class": "message"})

#Using regex
regexStripped = re.sub("<.*?>", " ", str(posts))
print(regexStripped)

lastpage = re.compile(r"page (\d+) of \1") #compiled regular expression, checking if we're at page n of n, i.e. last page.

#...

pages = soup.find('div', {'class': 'pages'}).text.strip() #find pages element which has the "This is page x of y" text.
if lastpage.search(pages) != None: #if our regex matches, then we're on last page, so make this the last one to parse.
    at_last = True
    
print(lastpage)

[ 
 My best friend has a 6 month old and after 14 years of her gleefully finding the biggest most annoying toys she could for my children I am desperate to get my own back.  What are the current most ear shattering noisy toys ( preferably plastic with flashing lights) that you can buy now.  I remember vtech baby walkers were pretty appalling when mine were small. 
 ,  
 Following with interest, I'm in a similar situation   
 ,  
 Might just be us  @Sjjr23   I'm thinking the pink vtech walker looks horrific so might go for that 
 ,  
 I'm bumping this as I refuse to believe we are the only 2 vengeful people on here   
 ,  
  My 18 month is currently playing on her Vtech bounce and spin frog. It's very noisy I have the subtitles on TV . The volume has 2 settings though so she can always just switch it off  
 ,  
 Seems like vetch is the best/worst gift to give 
 ,  
 Lamaze Sunny Rabbit. It doesn't have an off button 😡 
 ,  
  The fisher price cookie jar is awful! One of those that the s

You can go further by parsing multiple threads from a (page of a) subforum (or "Talk" on Mumsnet), e.g. try extracting all of the threads from the first page from baby names discussions: <https://www.mumsnet.com/Talk/baby_names>.