## BBC News Page



### In this example we are going to have a look at the BBC News page with a view to extracting details of the top read and watched stories.


### Using the Browser Inspector

All modern browsers allow you to access the underlying HTML Code which makes up a Web page

It is the job of the Browser to interpret the HTML and present the information it represents on the screen in a user friendly manner.

In order to Web scrape, you do need to have some understanding of HTML but not a great deal. Like most coding languages it is easier to read than to write and we only need to be able to read it a little bit, e.g. recognise different components or tags and a bit about the syntax of tags. 

A more important requirement is to be able to match what we see on the screen with the underlying HTML. A thorough understanding of the HTML and CSS code will allow you to do this, but there is a far easier way.

This involves using the developer tools found in all modern browsers and in particular the 'element inspector'. This allows you to select an element on the web page; a table, part of a table, a link, almost anything and have the corresponding HTML code highlighted.

These developer tools are available in:

* MS Edge
* Chrome
* Firefox

and probably more modern Web Browsers.

You will generally find the 'Developer Tools ' somewhere in the Browser Menu system. However (currently) the shortcut Ctrl + Shift + c works for all browsers.

## Information that we might want to scrape and save

### Using the BBC News page

1. List of most read with links
2. List of most watched with links

### If we are going to accumulate data over time we will also want a timestamp which we will generate when we run the code and for each run and append the new data to a file

## The packages we need

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs      # pip install beautifulsoup4 but import from bs4
import time
from datetime import datetime
import csv


The 'get' methods from requests only needs to be given a string representing a url

Quite often if you need to provide multiple parameters you would build the url string up and then call  

## Quick example to show how Beautifulsoup works

In [None]:
r = requests.get('https://www.bbc.co.uk')
#print(r.text)


## We can make the output look a bit better

In [None]:
soup = bs(r.text)               
prettyHTML = soup.prettify() 
#print(prettyHTML)

## We can save the output to a file in this formatted way

In [None]:
with open('page.html', 'w', encoding='utf-8') as fw :
    fw.write(prettyHTML)


## `find` and `find_all` allow you to search for tags

### `get` allows you to select parameter values or tag values

### Typically we will be finding tags and then extracting values from them.

### What we need to ensure when doing this is that we have selected the correct tags. In any given webpage some of the common tags will occur many times as we will see.

### we can do this by either using a chain of tags which is unique and ends in the tag we want or make use of the parameters and values within a tag and find a unique combination which will identify the specific tag we want.

### This is why we need to inspect the HTML in order to identify these unique combinations.

### In HTML tags are written in a specific way 

## We can find all of the images within the Web page

In [None]:
for imagelink in soup.find_all('img'):
    url = imagelink.get('src')
    print(url)

## We can find all of the links within the Web page

In [None]:
for url in soup.find_all('a') :
    print(f" {url.text} --> {url.get('href')}")

## Changing to the BBC News page

In [None]:
r = requests.get('https://www.bbc.co.uk/news')
soup = bs(r.text)               
prettyHTML = soup.prettify() 
with open('news_page.html', 'w', encoding='utf-8') as fw :
    fw.write(prettyHTML)

## What about list items?

In [None]:
for list_items in soup.find_all('li'):
    #print(f'{list_items} -->  ')
    print(f'{list_items.text}')

##  I know from using the developer tools that I am interested in 'li' items with 'data-entityid' properties, so just list them

In [None]:
for list_items in soup.find_all('li'):
    id = list_items.get('data-entityid')
    if id is not None :
        print(id)


### I want to further refine this to get rid of the social media items


In [None]:
for list_items in soup.find_all('li'):
    id = list_items.get('data-entityid')
    if id is not None :
        if id[0:4] == 'most' :
            print(id)

## But these do not appear to be simple list items

In [None]:
for list_items in soup.find_all('li'):
    id = list_items.get('data-entityid')
    if id is not None :
        if id[0:4] == 'most' :
            print(f'{list_items.text} -------> {list_items}')

## We can use our saved file of HTML to help us find what we want. 

1. load the file into a text editor like Notepad++
2. seach for the string 'data-entityid' and check you are in a 'li' item
3. Within the structure of the 'li' we want to find the tags associated with the displayed text and the URL of the link


```html
          <li class="gel-layout__item gs-o-faux-block-link gs-u-mb+ gel-1/2@m gel-1/5@xxl gs-u-float-left@m gs-u-clear-left@m gs-u-float-none@xxl" data-entityid="most-popular-read-1">
           <span class="gs-o-media">
            <span class="nw-c-most-read__rank gs-o-media__img gel-canon gel-1/12@xs gel-1/8@m gel-1/10@l gel-2/12@xxl gs-u-align-center">
             1
            </span>
            <div class="gs-o-media__body">
             <a class="gs-c-promo-heading nw-o-link gs-o-bullet__text gs-o-faux-block-link__overlay-link gel-pica-bold gs-u-pl-@xs" href="/news/business-57712618">
              <span class="gs-c-promo-heading__title gel-pica-bold">
               John Lewis plans to build 10,000 rental homes
              </span>
             </a>
            </div>
           </span>
          </li>
```

4. The URL has to specified as the href value in an \<a\> tag, so that is quite straight forward. But notice that it is a relative address. We will need to fix that later.
5. The displayed text is within  \<span\>\<\\span\> tags of which there are several, so we need to find something that makes this <span> unique. The answer appears to be the class value.

In [None]:
for list_items in soup.find_all('li'):
    id = list_items.get('data-entityid')
    if id is not None :
        if id[0:4] == 'most' :
            print(id)
            for anc in list_items.find('div') :
                print(anc.get('href'))
            for text in list_items.find('span',{'class' : 'gs-c-promo-heading__title'} ) :
                print(text)



## Now put it all together

                

### First create our csv file with a header record

#### We only need to run this once

In [None]:
# The output file

with open('BBC_top_hits.csv', 'w', encoding = 'utf-8') as fw:
    outfile = csv.writer(fw, delimiter=',', lineterminator='\r')
    outfile.writerow(["Date", "Item_pos", "Title", "Link"])


### A note about getting our timestamp

#### We will add a timestamp to each record we write to the file. This will allow us to show how the favourite change over time. To do this we use the now() method from the datetime package (this was loaded at the beginning of the notebook). The now() mthod returns a timestamp that you can print but is actually stored in an internal format. So to use it we use the strftime() method and pass it the format that we want returned

In [None]:
# get current timestamp for file
now = datetime.now()
print(now)
current_time = now.strftime("%Y-%m-%dT%H:%M:%S")
print(current_time)
print(now[0:11])

In [None]:
## set up 

url_prefix = 'https://www.bbc.co.uk'

r = requests.get('https://www.bbc.co.uk/news')
#print(r.text)

soup = bs(r.text)               
#prettyHTML = soup.prettify() 
#print(prettyHTML)

# The output file
with open('BBC_top_hits.csv', 'a', encoding = 'utf-8') as fw:
    outfile = csv.writer(fw, delimiter=',', lineterminator='\r')

    # get current timestamp fot file
    now = datetime.now()
    current_time = now.strftime("%Y-%m-%dT%H:%M:%S")

    for list_items in soup.find_all('li'):
        id = list_items.get('data-entityid')
        if id is not None :
            if id[0:4] == 'most' :
                #print(id)
                item_no = id
                for anc in list_items.find('div') :
                    #print(anc.get('href'))
                    href = anc.get('href')
                    full_url = url_prefix + href
                    
                for text in list_items.find('span',{'class' : 'gs-c-promo-heading__title'} ) :
                    #print(text)
                    print(f'{current_time},{item_no},{text}, {full_url}')
                    outfile.writerow([current_time,item_no,text, full_url])


### The code above works so we could stop at this point, but as we have isolated the text and the URL that we want uniquely, we can rewrite the code without the `for` loops.
### We do however need to explicitly extract the `a` tag from the `div` tag.

In [None]:
## set up 

url_prefix = 'https://www.bbc.co.uk'

r = requests.get('https://www.bbc.co.uk/news')
#print(r.text)

soup = bs(r.text)               
#prettyHTML = soup.prettify() 
#print(prettyHTML)

# The output file
with open('BBC_top_hits.csv', 'a', encoding = 'utf-8') as fw:
    outfile = csv.writer(fw, delimiter=',', lineterminator='\r')

    # get current timestamp fot file
    now = datetime.now()
    current_time = now.strftime("%Y-%m-%dT%H:%M:%S")

    for list_items in soup.find_all('li'):
        id = list_items.get('data-entityid')
        if id is not None :
            if id[0:4] == 'most' :
                item_no = id
                
                anc = list_items.find('div')
                a_tag = anc.find('a')
                href = a_tag.get('href')
                full_url = url_prefix + href
                    
                span = list_items.find('span',{'class' : 'gs-c-promo-heading__title'} )
                text = span.text

                print(f'{current_time},{item_no},{text}, {full_url}')
                outfile.writerow([current_time,item_no,text, full_url])
