# Scraping comments from the Guardian using BeautifulSoup 

In this tutorial I will demonstrate how BeautifulSoup from the Python library bs4 can be used to scrape comments from news websites, using the Guardian as an example. The code that is presented is largely derived from [tfeltwells' repository](https://github.com/tfeltwell/Guardian-comment-scraper).

### Install bs4 and urllib 

The first thing to do is install the two libraries we are going to use, `bs4` and `urllib`. 

```
pip install bs4
pip install urllib

```

### Select an article from the Guardian 

Choose an article you wish to scrape the comments from. Note that if the article has no comments, the program will return an attribute error. More on this later...

For this tutorial we will use the following url:

```python
url = 'https://www.theguardian.com/music/2019/feb/21/peter-tork-monkees-dies-aged-77'
```

### Get the article in HTML 

`BeautifulSoup`, a function in the `bs4` library, reads HTML. In order to convert our url to HTML we can use `urllib`.

Start by importing `BeautifulSoup` and `urllib` and converting the url to HTML:

```python
from bs4 import BeautifulSoup
import urllib


html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
```

The object `soup` represents the webpage as a HTML document. 

#### Quick start HTML guide 

- HTML is a markup language for creating web pages and web applications
- Structurally it contains nested HTML **elements**
- The elements are indicated in the document by **tags**
- Tags are enclosed in angled brakcets, eg. `<p>` is the start tag for a paragraph and `</p>` is the end tag
- A useful list of element tags can be found [here](https://www.w3schools.com/tags/)

As an example, a basic HTML document may look like:

```html

<!DOCTYPE html>
<html>
  <head>
    <title> This is a title </title>
  </head>
  <body>
    <p name = "paragraph-1"> My first paragraph </p>
  </body>
</html>

```

In the above, four different tags are used: `<html>`, `<head>`, `<title>`, `<body>` and `<p>`. Each tag has a corresponding end tag where the name of the tag is preceeded by a `/`. 

Looking at the paragraph element: 

```html 
<p name = "paragraph-1"> My first paragraph </p> 
```
you see that the element has been given an **attribute**. The attribute is `name` and its value is `paragraph-1`. 

Attributes are useful for searching HTML documents. As a document can have many instances of the same tag (eg. contain many paragraphs), if each tag has unique attributes then by searching for a tag and attribute pair the correct element will be returned.

For more on HTML see the [Wikipedia]() page or have a go at [codecademy's](https://www.codecademy.com) course.

#### Back to web scraping 

We now have a HTML document for our url contained in the object `soup`. What we want to do is search the document for the comments and other relevant information.

At this point it is useful to open the `url` in your browser and view the HTML directly. This can be done by entering developer mode or, in Safari, by displaying the web inspector (Develop -> Show Web Inspector).

Lets start by finding the title of the article. The title is a *level one heading* and will therefore have a `h1` tag. By exploring the HTML in your browser, you will find that the `h1` tag also has attributes. Here we will use the `class` attribute to find the title.

```python
article_title = soup.find('h1', class_='content__headline')
```

Looking at `article_title`:

you can see that the `find()` function has returned the entire HTML element. As we only want the title text, we can use the `getText()`, `strip()` and `encode()` functions to convert the title to a string.

```python
article_title = article_title.getText().strip().encode('utf-8')
```

Next we need to find the url that links to the comments section of the article. As we are looking for a hyperlink, the element needed will have an `a` tag and an attribute `href` corresponding to the url.

```python

comment_url = soup.find(class_='discussion__heading').find('a')['href']

```

The above line of code is first searching the HTML document for an element whose `class` attribute is `discussion__heading`. It then searches within that element for an element with an `a` tag. Finally it selects the value of the attribute `href` within this element. The corresponding HTML structure is:

```html

<div class="discussion__heading">
    <div class="container__meta modern-hidden">
        <h2 class="container__meta__title">
            <a href="https://www.theguardian.com/discussion/p/az4e5" data-link-name="View all comments">
            View all comments &gt;</a>
        </h2>
    </div>
</div>

```

From this we can see that `comment_url` now has the value `https://www.theguardian.com/discussion/p/az4e5` which is a webpage containing only the comments from the article.

If it is not possible to comment on the article, as is the case with some articles on the Guardian, you will get an error message at this point. This is because the class `discussion__heading` doesn't exist and therefore it is not possible to proceed with scraping comments.

We now want to get the HTML document for the comments using, again, urllib and BeautifulSoup:

```python
html = urllib.urlopen(comment_url).read()
comment_soup = BeautifulSoup(html)
```

where `comment_soup` is now a HTML document containing the details of all comments on the article.

In developer mode you will find that the comments are stored as a list; each comment has its own element with a `li` tag. In order to scrape the comments and any additional information that might be needed we want to iterate over this list, storing the data for each comment in `comment_array`. 

To start, we will find the ID and time stamp of the comment along with the author's name and ID and create a dictionary to store the information: 

```python

comment_array=[]

for comment in comment_soup.select('li.d-comment'):
    
    commentObj = {}
    
    commentObj['id'] = comment['data-comment-id']
    commentObj['timestamp'] = comment['data-comment-timestamp']
    commentObj['author'] = comment['data-comment-author'].encode('utf-8')
    commentObj['author-id'] = comment['data-comment-author-id']
```

As all of this information is contained in `attributes` of the `li` tag it can be accessed directly from each `comment` we iterate over. The `attributes` of each `li` tag can be accessed by treating the tag like a python dictionary. You can see that the `attributes` are `keys` to a dictionary and can be used to get the `values` of each `attribute`.   

To obtain the text within each `li` tag corresponding to the coment text, we search for the element using the class attribute:

```python
    body = comment.find(class_='d-comment__body')

    if body.blockquote is not None:
        body.blockquote.clear()

    commentObj['text'] = body.getText().strip().encode('utf-8')
```

Here the `if` statement is needed to account for comments that have been blocked: while the content of blocked comments cannot be viewed on the webpage the data is still contained in the HTML document. Similarly to obtaining the title text, we need to use the `getText()`, `strip()` and `encode()` functions.

Other useful attributes include the number of people that have recommended or agreed with the comment and, if the comment is a reply to someone else's, the ID of the original comment:

```python

    recommend = comment.find(class_='d-comment__recommend')
    if recommend is not None:
        commentObj['reccomend-count'] = recommend['data-recommend-count']
    else:
        commentObj['reccomend-count'] = ''
            
    replyto = comment.find(class_='d-comment__reply-to-author')

    if replyto is not None:
        link = replyto.parent['href'].replace('#comment-', '')
        commentObj['reply-to'] = link
    else:
        commentObj['reply-to'] = ''
    
```

Again we have used `if` statements to account for comments that are not replies and comments that cannot be recommended.

Finally, we should add our data for this comment to `comment_array` before moving on to the next comment:

```python

    comment_array.append(commentObj)

```

### Put it all together 

Below is a script that will extract the comments from your chosen url and save them to a `json` file. It includes functions for:
- getting the HTML document, `get_html()` 
- saving the comments, `save()`
- scraping the comments, `get_comments()`

In [10]:
from bs4 import BeautifulSoup
import urllib
import json


def get_html(url):
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
    return soup

def save(fname,data):
    fileName = fname+'.json'
    outFile = open(fileName,'w')
    json.dump(data, outFile)

def get_comments(url):
    
    soup = get_html(url)

    comment_array=[]

    for comment in soup.select('li.d-comment'):

        commentObj = {}

        commentObj['id'] = comment['data-comment-id']
        commentObj['timestamp'] = comment['data-comment-timestamp']
        commentObj['author'] = comment['data-comment-author'].encode('utf-8')
        commentObj['author-id'] = comment['data-comment-author-id']
        
        body = comment.find(class_='d-comment__body')

        if body.blockquote is not None:
            body.blockquote.clear()

        commentObj['text'] = body.getText().strip().encode('utf-8')
        

        recommend = comment.find(class_='d-comment__recommend')

        if recommend is not None:
            commentObj['reccomend-count'] = recommend['data-recommend-count']
        else:
            commentObj['reccomend-count'] = ''
            

        replyto = comment.find(class_='d-comment__reply-to-author')

        if replyto is not None:
            link = replyto.parent['href'].replace('#comment-', '')
            commentObj['reply-to'] = link
        else:
            commentObj['reply-to'] = ''
            

        comment_array.append(commentObj)
        
    return comment_array

        

url = 'https://www.theguardian.com/music/2019/feb/21/peter-tork-monkees-dies-aged-77'
soup=get_html(url)
comment_url = soup.find(class_='discussion__heading').find('a')['href']

comments=get_comments(comment_url)


articletitle = soup.find('h1', class_='content__headline').getText().strip().encode('utf-8')   
fname=articletitle+'_comments'
save(fname,comments)

### Multiple pages of comments? 

If the article has more than one page of comments the above script will not work. This is because `comment_url` only directs to the first page of comments and therefore comments on any additional pages will be missed.

The urls for additional pages of comments have the structure `comment_url` with the additional text `?page=n` added to the end. Here we use `n` to denote the page number.

In order to adapt the above code to deal with articles that have multiple pages of comments we need to count the number of pages there are:

```python

comment_soup = get_html(comment_url)

pagination_btn = comment_soup.find_all('a', class_='pagination__action')
lastpagination_btn = comment_soup.find('a', class_='pagination__action--last')

if lastpagination_btn is not None:
    total_pages = int(lastpagination_btn['data-page'])
elif pagination_btn:
    total_pages = int(pagination_btn[-1]['data-page'])
else:
    total_pages = 1
    
```

To count the number of pages we find the elements that contain hyperlinks to the additional pages of comments. As they contain links they will have `a` tags therefore we can search for `a` tags with the correct `class`. 

Depending on the number of additional pages an article can have:
- a button to take you straight to the last page, `lastpaignation_btn`
- buttons to take you to specific pages, `paignation_btn` 
- only one page of comments and no buttons.

Each of these situations is handled within the `if...elif...else` statement.

Now we know how many comment pages we have, we can extend the original script to loop through all pages and scrape all of the comments with two new functions:

In [23]:
def count_pages(soup):
    
    pagination_btn = soup.find_all('a', class_='pagination__action')
    lastpagination_btn = soup.find('a', class_='pagination__action--last')

    if lastpagination_btn is not None:
        total_pages = int(lastpagination_btn['data-page'])
    elif pagination_btn:
        total_pages = int(pagination_btn[-1]['data-page'])
    else:
        total_pages = 1
    
    return total_pages

def all_comments(soup):
    
    try:
        comment_url = soup.find(class_='discussion__heading').find('a')['href']
    except AttributeError:
        return 0
        
    comment_soup=get_html(comment_url)
    pages=count_pages(comment_soup)

    allcomments=[]
    for n in range(1,pages+1):
        
        page_url = comment_url+'?='+urllib.urlencode({'page': n})
        comments=get_comments(pageurl)
        
        allcomments=allcomments+comments

    return allcomments


url = 'https://www.theguardian.com/music/2019/feb/21/peter-tork-monkees-dies-aged-77'
soup=get_html(url)
comments=all_comments(soup)


article_title = soup.find('h1', class_='content__headline').getText().strip().encode('utf-8')   
fname=article_title+'_allcomments'
save(fname,comments)

### Scraping comments from multiple articles 

If you have used the Guardian's [API](https://open-platform.theguardian.com) to obtain article data, including urls, scraping comments from all of the articles you have obtained involves iterating over a list of the urls. For instructions on how to use the Guardian's API see our tutorial [Downloading Article Data from the Guardian API]("Downloading_Article_Data_from_the_Guardian_API.ipynb"). 

Assuming you have a text file with one url per line, we can again adapt the code to loop through the file:

```python

url_fname = './Guardian-comment-scraper/urls-2018-01-01-2018-01-02.txt'

url_list = [line.rstrip('\n') for line in open(url_fname)]
    
for url in url_list:
    
    soup=get_html(url)
    comments=all_comments(soup)
    
    if comments==0:
        continue
    output.append(comments)
    
fname='all_url_comments'
save(fname,output)

```

Here you can see that we use read the lines in the text file to a list `url_list`. We then iterate over the urls in the list using a `for` loop. For each url we scrape the comments using the functions we have used previously.

**Note:** the `if` statement in our for loop is there to account for articles that dont have any comments. You will notice that in the function `all_comments()`, we have used a `try` block for finding `comment_url` which allows us to handle the error if there are no comments. The error is handled using the `except` block, causing the function to return `0`, equivalent to `None`, if the article has no comments.

An easier solution is to use the `commentable` field when accessing the Guardian's API: by adding this (boolean) parameter to the parameter dictionary with a value of 1, only articles that can be commented upon will be returned.

In [29]:
url_fname = './Guardian-comment-scraper/urls-2018-01-01-2018-01-02.txt'

url_list = [line.rstrip('\n') for line in open(url_fname)]
    
for url in url_list:
    
    soup=get_html(url)
    comments=all_comments(soup)
    
    if comments==0:
        continue
    output.append(comments)
    
fname='all_url_comments'
save(fname,output)