# Practicing web scraping

This workbook is for me to practice using Jupyter and also to see if I can use Python to understand webscraping.  Maybe, maybe not, we shall see...

Initial thoughts are we need to do the following:

- Load the correct Python libraries
- Choose a target website
- Practice opening webpages in the browser
- Practice reading them in code
- Practice Parsing them in code
- Output some data to a useable text file or csv

Need a good example to do this.  Will try with a bbc sport article, originally this one: https://www.bbc.com/sport/football/44659159

But then trying this one using Beautiful Soup: https://www.bbc.com/sport/football/44665445


## Open webpage in the browser

To do this I need to load in the webbrowser module and open the page.


In [1]:
import webbrowser
webpage = 'https://www.bbc.com/sport/football/44659159'
if webbrowser.open(webpage):
    print("It opened a browser and is awesome!")
else:
    print("Big fat fail!")

It opened a browser and is awesome!


## Read in a simple webpage...

Need to load in the requests module, and I'm away...


In [2]:
import requests
res = requests.get(webpage)
type(res)

requests.models.Response

In [3]:
res.status_code == requests.codes.ok

True

In [4]:
len(res.text)

160877

In [5]:
print(res.text[:250])

    <!DOCTYPE html>  <html id="sport-html" class="b-reith-sans-font no-js no-enhanced no-touch no-font-face no-av no-app no-csscolumns no-css-transitions no-css-2d-transforms no-flexbox no-svg"  lang="en"> <head> <title>Marouane Fellaini: Manchester 


## Test for correct read

The above should work, and show some of the code.

Should use a programmatic check though of method raise_for_status(), with example given below.


In [6]:
res.raise_for_status()

In [7]:
res_test = requests.get('http://inventwithpython.com/page_that_does_not_exist')
try:
    res_test.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


## Parse Text Required

In this case, lets try and read the headline, and body of the article into a new text file, excluding any formatting, links, etc.


In [8]:
saved_str = ''
if res.status_code == requests.codes.ok:
    
    print('okay!')

okay!


In [9]:
    # Find Header
    position = res.text.find('<h1')
    header_start = res.text.find('>', position)+1
    header_end = res.text.find('<', header_start)
    header_text = res.text[header_start:header_end]
    print(header_text)

Marouane Fellaini: Manchester United midfielder signs new two-year deal


In [10]:
    # Find Body
    position = res.text.find('sp-story-body__introduction')
    body_start = res.text.find('>', position)+1
    body_end = res.text.find('</div> <script>', body_start)
    body_text = res.text[body_start:body_end]
    print(body_text)

Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.</p><p>The 30-year-old, whose previous deal was due to expire on 30 June, had been expected to leave Old Trafford.</p><p>United manager <a href="https://www.bbc.co.uk/sport/football/43940979">Jose Mourinho said</a> he wanted to keep Fellaini and the club spent the past few months trying to persuade him to stay.</p><div id="bbccom_mpu_1_2_3" class="bbccom_slot" aria-hidden="true">
    <div class="bbccom_advert">
        <script type="text/javascript">
            (function() {
                if (window.bbcdotcom && bbcdotcom.adverts && bbcdotcom.adverts.slotAsync) {
                    bbcdotcom.adverts.slotAsync("mpu", [1,2,3]);
                }
            })();
        </script>
    </div>
</div><p>"I made this decision because I am very happy here," said Fellaini.</p><p>"I feel like this team, under Jose, still has a lot we want to achieve. I wo

## Clean up the body text

Remove the bits of code within <div\> and </div\> statements.  Alse deal with <p\> and <a\> statements.


In [11]:
    # Remove any code in <div> statements, starting from the end as these can 
    # be nested.
    temp_text = ''
    position = body_text.rfind('<div')
    while position > -1:
        position_end = max(body_text.find('</div>', position) + 6, \
                           body_text.find('>', position) + 1)
        body_text = body_text[:position] + body_text[position_end:]
        position = body_text.rfind('<div')
    
    print(body_text)

Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.</p><p>The 30-year-old, whose previous deal was due to expire on 30 June, had been expected to leave Old Trafford.</p><p>United manager <a href="https://www.bbc.co.uk/sport/football/43940979">Jose Mourinho said</a> he wanted to keep Fellaini and the club spent the past few months trying to persuade him to stay.</p><p>"I made this decision because I am very happy here," said Fellaini.</p><p>"I feel like this team, under Jose, still has a lot we want to achieve. I would like to say a special thank you to Jose for the faith he has always shown in me."</p><p>Fellaini has scored 20 goals in 156 appearances for United since a &pound;27m move from Everton in 2013.</p><p>His new contract gives the option for a further season.</p><p>"I am very happy Marouane is staying with us," said Mourinho. "I always believed in his desire to stay with the club and I am de

In [12]:
    # Remove any <p> and </p> replacing each with new line.
    temp_text = body_text.replace('<p>','\n')
    body_text = temp_text.replace('</p>','\n')
    print(body_text)

Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.

The 30-year-old, whose previous deal was due to expire on 30 June, had been expected to leave Old Trafford.

United manager <a href="https://www.bbc.co.uk/sport/football/43940979">Jose Mourinho said</a> he wanted to keep Fellaini and the club spent the past few months trying to persuade him to stay.

"I made this decision because I am very happy here," said Fellaini.

"I feel like this team, under Jose, still has a lot we want to achieve. I would like to say a special thank you to Jose for the faith he has always shown in me."

Fellaini has scored 20 goals in 156 appearances for United since a &pound;27m move from Everton in 2013.

His new contract gives the option for a further season.

"I am very happy Marouane is staying with us," said Mourinho. "I always believed in his desire to stay with the club and I am delighted that he has signed a new co

In [13]:
    # Remove <a> and </a> recording the embedded links.
    position = body_text.find('<a ')
    link_list = []
    i = 0
    while position > -1:
        href_start = body_text.find('href=', position) + 5
        href_char = body_text[href_start:href_start+1]
        href_end = body_text.find(href_char, href_start+2)
        link_list.append(body_text[href_start+1:href_end])

        print(link_list[i])
        i=i+1
        
        position_end = body_text.find('>', position) + 1
        body_text = body_text[:position] + body_text[position_end:]
        
        position = body_text.find('<a ')
        
    body_text = body_text.replace('</a>','')
    body_text = body_text.strip()
    print(body_text)

https://www.bbc.co.uk/sport/football/43940979
Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.

The 30-year-old, whose previous deal was due to expire on 30 June, had been expected to leave Old Trafford.

United manager Jose Mourinho said he wanted to keep Fellaini and the club spent the past few months trying to persuade him to stay.

"I made this decision because I am very happy here," said Fellaini.

"I feel like this team, under Jose, still has a lot we want to achieve. I would like to say a special thank you to Jose for the faith he has always shown in me."

Fellaini has scored 20 goals in 156 appearances for United since a &pound;27m move from Everton in 2013.

His new contract gives the option for a further season.

"I am very happy Marouane is staying with us," said Mourinho. "I always believed in his desire to stay with the club and I am delighted that he has signed a new contract."

Fell

## Save the output to a text file

Need to create a file and save the output to it, including web address, clean text and links.

In [14]:
playFile = open('Fellaini Story.txt', 'w')
playFile.write(header_text + "\nFrom: " + webpage + "\n\n")
playFile.write(body_text )
if link_list != []:
    playFile.write("\n\n\n" + "Links from text:" + "\n")
    
for link in link_list:
    playFile.write(link +"\n")

playFile.close()

# Using Beautiful Soup.

The initial practice above worked well.  Now lets test out Beautiful Soup on this page and a more complicated couple of webpages.  This list will be our guinea pigs:
- https://www.bbc.com/sport/football/44659159
- https://www.bbc.com/sport/football/44665445
- https://www.bbc.com/sport/football/45053886


Some work to do for fixtures data...
- https://www.bbc.com/sport/football/world-cup/scores-fixtures/2018-07-07

## Load in Beautiful Soup and create initial conditions

Load in Beautiful Soup module and create an initial list of web addresses to use, including one which doesn't exist to test the exception code.


In [15]:
import bs4, requests

initial_link_list = ["https://www.bbc.com/sport/football/44659159", \
                     "https://www.bbc.com/sport/football/44665445", \
                     "http://inventwithpython.com/page_that_does_not_exist", \
                     "https://www.bbc.com/sport/football/45053886"]

In [16]:
wpi=1
for webpage in initial_link_list:
    # Get the webpage and check it has been acquired.  If not, skip to next item in the list.
    res = requests.get(webpage)
    try:
        res.raise_for_status()
    except Exception as exc:
        print('There was a problem: %s' % (exc))
        continue

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


## With a single webpage

It's all well and good trying to do this in a for loop, but lets use a single webpage for now... This one: https://www.bbc.com/sport/football/44659159

In [17]:
webpage = initial_link_list[0]
res= requests.get(webpage)
page_soup = bs4.BeautifulSoup(res.text, "html.parser")
print(type(page_soup))

<class 'bs4.BeautifulSoup'>


### Header

Acquire the header information for the page

In [18]:
# Not sure why the next line doesn't work...
# headers = page_soup.select('h1[.story_headline]')

headers = page_soup.select('.story-headline')
print(len(headers))
for header in headers:
    print(header)
    print(header.attrs)
    print(header.getText())

1
<h1 class="story-headline gel-trafalgar-bold ">Marouane Fellaini: Manchester United midfielder signs new two-year deal</h1>
{'class': ['story-headline', 'gel-trafalgar-bold', '']}
Marouane Fellaini: Manchester United midfielder signs new two-year deal


### Body text and links

Next pull out the body text and the links as we did above.

In [19]:
# help(page_soup)
intropara = page_soup.select('.sp-story-body__introduction')
print(len(intropara))
print(intropara[0].getText())

1
Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.


In [20]:
paragraphs = page_soup.select('div.story-body p')
print(len(paragraphs))
i=0
for paragraph in paragraphs:
    if paragraph.getText()==intropara[0].getText():
        print(i)
        break
    else: 
        i = i+1
else:
    print("\"" + intropara[0].getText() + "\" not found in paragraphs")
print(i)
opening_para = i

10
0
0


In [21]:
    # Open and output to a file    
    playfile = open('Example Output File ' + str(wpi) + '.txt', 'w')
    
    # CREATE OUTPUT HERE....
    playfile.write(headers[0].getText() + "\nFrom: " + webpage + "\n\n")
    for i in range(opening_para,len(paragraphs)):
        playfile.write(paragraphs[i].getText() + "\n\n")

    print("It ran something...")
    
    playfile.close()
    res.close()
    wpi=wpi+1

It ran something...


## Create Loop to do this for multiple pages

Amalgamating all the above code to give:

In [22]:
# Initial Conditions
import bs4, requests

initial_link_list = ["https://www.bbc.com/sport/football/44659159", \
                     "https://www.bbc.com/sport/football/44665445", \
                     "http://inventwithpython.com/page_that_does_not_exist", \
                     "https://www.bbc.com/sport/football/45053886"]
wpi = 1

# Loop through webpages
for webpage in initial_link_list:
    print('Start of ' + webpage)
    res= requests.get(webpage)

    # Test webpage request.
    try:
        res.raise_for_status()
    except Exception as exc:
        print('There was a problem: %s' % (exc))
        continue
    
    # Read in webpage and relevent sections    
    page_soup = bs4.BeautifulSoup(res.text, "html.parser")
    headers = page_soup.select('.story-headline')
    paragraphs = page_soup.select('div.story-body p')
    intropara = page_soup.select('.sp-story-body__introduction')

    # Find initial paragraph and move on to next webpage if not possible
    print(intropara[0].getText())
    opening_para = -1
    i=0
    for paragraph in paragraphs:
        if paragraph.getText()==intropara[0].getText():
            opening_para = i
            break
        else: 
            i = i+1
    else:
        print("\"" + intropara[0].getText() + "\" not found in paragraphs")
        continue
    
    # Open and output to a file    
    playfile = open('Example Output v1 File ' + str(wpi) + '.txt', 'w')
    
    # CREATE OUTPUT HERE....
    playfile.write(headers[0].getText() + "\nFrom: " + webpage + "\n\n")
    for i in range(opening_para,len(paragraphs)):
        if paragraphs[i].attrs != {}:
            print("Paragraph " + str(i))
            if 'class' in paragraphs[i].attrs and \
                paragraphs[i].attrs["class"] == ["sp-media-asset__smp-message"]:
                continue
            else:
                print("Additional attributes found!")
        playfile.write("Paragraph "+str(i) + "\n" + paragraphs[i].getText() + "\n\n")

    print("File created:  " + 'Example Output v1 File ' + str(wpi) + '.txt')
    
    playfile.close()
    res.close()
    wpi=wpi+1

Start of https://www.bbc.com/sport/football/44659159
Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.
Paragraph 0
Additional attributes found!
File created:  Example Output v1 File 1.txt
Start of https://www.bbc.com/sport/football/44665445
England reached the World Cup semi-final for the first time since Italia 90 as Harry Maguire and Dele Alli struck either side of the interval to beat Sweden in Samara.
Paragraph 1
Additional attributes found!
Paragraph 9
Paragraph 17
Paragraph 27
Paragraph 33
File created:  Example Output v1 File 2.txt
Start of http://inventwithpython.com/page_that_does_not_exist
There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist
Start of https://www.bbc.com/sport/football/45053886
Manchester United manager Jose Mourinho praised "monster" Paul Pogba after he captained the side and scored the opening goal in the win over

## Repeat loop for multiple pages using Beautiful Soup navigation

Improving the above code to use the Beautiful Soup methods to navigate the tree:

In [23]:
# Initial Conditions
import bs4, requests

initial_link_list = ["https://www.bbc.com/sport/football/44659159", \
                     "https://www.bbc.com/sport/football/44665445", \
                     "http://inventwithpython.com/page_that_does_not_exist", \
                     "https://www.bbc.com/sport/football/45053886"]
wpi = 1

# Loop through webpages
for webpage in initial_link_list:
    print('Start of ' + webpage)
    res= requests.get(webpage)

    # Test webpage request.
    try:
        res.raise_for_status()
    except Exception as exc:
        print('There was a problem: %s' % (exc))
        continue
    
    # Read in webpage and relevent sections    
    page_soup = bs4.BeautifulSoup(res.text, "html.parser")
    headers = page_soup.select('.story-headline')
    intropara = page_soup.select('.sp-story-body__introduction')
    paragraphs = intropara[0].next_siblings
    
    # Print initial paragraph
    print(intropara[0].getText())
    opening_para = -1
    i=0
    
    # Open and output to a file    
    playfile = open('Example Output v2 File ' + str(wpi) + '.txt', 'w')
    
    # CREATE OUTPUT HERE....
    playfile.write(headers[0].getText() + "\nFrom: " + webpage + "\n\n")
    playfile.write("Paragraph "+str(i) + "\n" + intropara[0].getText()+"\n\n")
    for paragraph in paragraphs:
        
        if paragraph.name == 'p' or paragraph.name == 'paragraph' or paragraph.name == 'h2' or paragraph.name == 'h3':
            i=i+1
            playfile.write("Paragraph "+str(i) + "\n" + paragraph.getText() + "\n\n")

    print("File created:  " + 'Example Output v2 File ' + str(wpi) + '.txt')
    
    playfile.close()
    res.close()
    wpi=wpi+1

Start of https://www.bbc.com/sport/football/44659159
Belgium midfielder Marouane Fellaini has thanked Jose Mourinho for his "faith" after signing a new two-year contract at Manchester United.
File created:  Example Output v2 File 1.txt
Start of https://www.bbc.com/sport/football/44665445
England reached the World Cup semi-final for the first time since Italia 90 as Harry Maguire and Dele Alli struck either side of the interval to beat Sweden in Samara.
File created:  Example Output v2 File 2.txt
Start of http://inventwithpython.com/page_that_does_not_exist
There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist
Start of https://www.bbc.com/sport/football/45053886
Manchester United manager Jose Mourinho praised "monster" Paul Pogba after he captained the side and scored the opening goal in the win over Leicester City on the first day of the Premier League season.
File created:  Example Output v2 File 3.txt
