# Scraping the web in Python

## Introduction

In order to do some web scraping there are certain libraries that we need to install. These are [requests](https://pypi.org/project/requests/) and [Beautiful Soup](https://pypi.org/project/beautifulsoup4/)

If you do not have these installed you can do so by running the commands 

<pre>
>> pip install requests

>> pip install beautifulsoup4 
</pre>

The first, __requests__, allows us to send HTTP requests and retrieve conplete web pages (the full HTML - not just the presented text). The second, __Beautiful Soup__, is used to parse the retrived code, find elements and extract text or other aspects that we are interested in. 

## Example

In this example we are going to get the population of the city of Stirling from Wikipedia. 

### Loading the libraries

We load these libraries as follows

In [94]:
# import our libraries
import requests
from bs4 import BeautifulSoup

### Create variables

We'll have a string for city, another for the url of the page to be scraped, and an integer for the population which we don't know the value of yet. 

In [95]:
school = "Lochside Academy" 
url = "https://lochside.aberdeen.sch.uk/school-uniform/"
items_s1_s3 =list()
items_s4_s6 = list()


### Using requests, get the code from our URL

In [96]:
r = requests.get(url)


### Create a Beautiful Soup object

The coding convention is to call this 'soup' but you can call it what you like. Sticking to the convention makes reading similar code more straight-forward

In [97]:
soup = BeautifulSoup(r.content,"html.parser")


If we want to we can have a look at the contents of the soup object. It may not be particularly readable! 

Having Chrome or a similar browser open while you write your code, and using the Inspector to look at the page HTML can help us to find what we're looking for.



In [98]:
print (soup)

<!DOCTYPE html>

<html lang="en-GB">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<link href="https://lochside.aberdeen.sch.uk/xmlrpc.php" rel="pingback"/>
<title>School Uniform</title>
<meta content="max-image-preview:large" name="robots">
<link href="//fonts.googleapis.com" rel="dns-prefetch">
<link href="https://lochside.aberdeen.sch.uk/feed/" rel="alternate" title=" » Feed" type="application/rss+xml">
<link href="https://lochside.aberdeen.sch.uk/comments/feed/" rel="alternate" title=" » Comments Feed" type="application/rss+xml"/>
<script type="text/javascript">
/* <![CDATA[ */
window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/14.0.0\/svg\/","svgExt":".svg","source":{"concatemoji":"https:\/\/lochside.aberdeen.sch.uk\/wp-includes\/js\/wp-emoji-release.min.js?ver=6.4.3"}};

### Parsing the code

Now we can navigate through the _soup_ object lookjing for what we need. 

We'll start by using the soup.find method. We can use this technique to find elements, rows, lists etc. 

In this case we are looking for a __div__ tag with a specific class name. 

In [99]:
mydiv = soup.find("div", {"class": "so-widget-sow-editor so-widget-sow-editor-base"})


What does that return?

In [100]:
print (mydiv)

<div class="so-widget-sow-editor so-widget-sow-editor-base"><h3 class="widget-title">School Uniform</h3>
<div class="siteorigin-widget-tinymce textwidget">
<h5 style="text-align: left;"><strong>S1-S3 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white formal shirt.<br/>
• A black school sweatshirt with the school badge or/and a black school blazer (the blazer is optional for S1-S3)<br/>
• A school tie is optional for S1-S3</h5>
<h5 style="text-align: left;"><strong>S4-S6 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white or black formal shirt.<br/>
• A black blazer with badge<br/>
• A school tie<br/>
• A black school sweatshirt, plain black v neck jumper or plain black cardigan can be worn under the blazer.<br/>
* Skirts should be a reasonable length as excessive

We can find headers within that.

In [101]:
headers = soup.find('div', attrs={'class': 'so-widget-sow-editor so-widget-sow-editor-base'}).h5

But if we check what it returns we see that it only finds the first one. This is what the _soup.find_ method does. 

In [102]:
print(headers)


<h5 style="text-align: left;"><strong>S1-S3 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white formal shirt.<br/>
• A black school sweatshirt with the school badge or/and a black school blazer (the blazer is optional for S1-S3)<br/>
• A school tie is optional for S1-S3</h5>


So we need something like the _find_all_ nethod. We can test it

In [103]:
for headlines in mydiv.find_all("h5"):
    print(headlines.text.strip())
    print("----------------")


S1-S3 :
• NO leggings or denim
• Any footwear
• A black skirt* or black trousers
• A white polo shirt with or without school badge or a white formal shirt.
• A black school sweatshirt with the school badge or/and a black school blazer (the blazer is optional for S1-S3)
• A school tie is optional for S1-S3
----------------
S4-S6 :
• NO leggings or denim
• Any footwear
• A black skirt* or black trousers
• A white polo shirt with or without school badge or a white or black formal shirt.
• A black blazer with badge
• A school tie
• A black school sweatshirt, plain black v neck jumper or plain black cardigan can be worn under the blazer.
* Skirts should be a reasonable length as excessively short skirts are unsuitable for a school setting.
----------------
School bag: All pupils will be expected to have a school bag with them at all times.
----------------
School uniform can either be purchased direct from the suppliers online from Promprint Designs at http://www.pomprintdesigns.com/
------

So, we can create a list (it's what _find_all_ returns) of all of the H5 headings

In [104]:
fives = mydiv.find_all("h5")


This will show the full list: 

In [105]:
print(fives)


[<h5 style="text-align: left;"><strong>S1-S3 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white formal shirt.<br/>
• A black school sweatshirt with the school badge or/and a black school blazer (the blazer is optional for S1-S3)<br/>
• A school tie is optional for S1-S3</h5>, <h5 style="text-align: left;"><strong>S4-S6 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white or black formal shirt.<br/>
• A black blazer with badge<br/>
• A school tie<br/>
• A black school sweatshirt, plain black v neck jumper or plain black cardigan can be worn under the blazer.<br/>
* Skirts should be a reasonable length as excessively short skirts are unsuitable for a school setting.</h5>, <h5 style="text-align: left;">School bag: All pupils will be expected to have a school bag with

We can find out how long the list is:

In [106]:
print(len(fives))

5


And we can look at individual list items

In [107]:
print(fives[0])

<h5 style="text-align: left;"><strong>S1-S3 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white formal shirt.<br/>
• A black school sweatshirt with the school badge or/and a black school blazer (the blazer is optional for S1-S3)<br/>
• A school tie is optional for S1-S3</h5>


In [108]:
print(fives[1])

<h5 style="text-align: left;"><strong>S4-S6 :</strong><br/>
• NO leggings or denim<br/>
• Any footwear<br/>
• A black skirt* or black trousers<br/>
• A white polo shirt with or without school badge or a white or black formal shirt.<br/>
• A black blazer with badge<br/>
• A school tie<br/>
• A black school sweatshirt, plain black v neck jumper or plain black cardigan can be worn under the blazer.<br/>
* Skirts should be a reasonable length as excessively short skirts are unsuitable for a school setting.</h5>


We can see that in the first two items the year group is held in _strong_ tags. We can use __find__ and __get_text__ methods to extract that. 

In [109]:
print(fives[0].find("strong").get_text())
print(fives[1].find("strong").get_text())

S1-S3 :
S4-S6 :


And we can split on the space (creating a list) and get the first element of that. 



In [110]:
print(fives[0].find("strong").get_text().split(" ")[0])
print(fives[1].find("strong").get_text().split(" ")[0])

S1-S3
S4-S6


So now we have something to work with. 

In [183]:

for year in fives:
    try:
        year.find("strong").get_text().split(" ")[0][0] == 'S'
        year_group = year.find("strong").get_text().split(" ")[0]
        print("[-----------------------]")
        print(f"{year_group=}\n")
        
        for line in year:
            line_text = str(line)
            if not "strong" in line_text and not "<br/>" in line_text and not "NO " in line_text:
                if "• " in line_text:
                    print(line_text.replace("*", "")[3:])
                    print("~~~~~~~~~~")
            
    except:
        pass # we're only interested in those with "S1-S3" or "S4-S6"
        

[-----------------------]
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
[-----------------------]
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~
~~~~~~~~~~


So, let's try to capture that data in some structure. In this case we can use a dictionary, with the year group as a key, and a list of uniform items as the values. 

In [180]:
Lochside=dict()

for year in fives:
    item_list = list()
    try:
        
        year.find("strong").get_text().split(" ")[0][0] == 'S'
        year_group = year.find("strong").get_text().split(" ")[0]
        
        for line in year:
            line_text = str(line)
            if not "strong" in line_text and not "<br/>" in line_text and not "NO " in line_text:
                if "• " in line_text:
                    item_list.append(line_text.replace("*", "")[3:])
        Lochside[year_group] = item_list 
    except:
        pass # we're only interested in those with "S1-S3" or "S4-S6"

What happens if we print the dictionary?

In [181]:
Lochside