# Scraping the web in Python

## Introduction

In order to do some web scraping there are certain libraries that we need to install. These are [requests](https://pypi.org/project/requests/) and [Beautiful Soup](https://pypi.org/project/beautifulsoup4/)

If you do not have these installed you can do so by running the commands 

<pre>
>> pip install requests

>> pip install beautifulsoup4 
</pre>

The first, __requests__, allows us to send HTTP requests and retrieve conplete web pages (the full HTML - not just the presented text). The second, __Beautiful Soup__, is used to parse the retrived code, find elements and extract text or other aspects that we are interested in. 

## Example

In this example we are going to get the population of the city of Stirling from Wikipedia. 

### Loading the libraries

We load these libraries as follows

In [None]:
# import our libraries
import requests
from bs4 import BeautifulSoup

### Create variables

We'll have a string for city, another for the url of the page to be scraped, and an integer for the population which we don't know the value of yet. 

In [None]:
school = "Lochside Academy" 
url = "https://lochside.aberdeen.sch.uk/school-uniform/"
items_s1_s3 =list()
items_s4_s6 = list()


### Using requests, get the code from our URL

In [None]:
r = requests.get(url)


### Create a Beautiful Soup object

The coding convention is to call this 'soup' but you can call it what you like. Sticking to the convention makes reading similar code more straight-forward

In [None]:
soup = BeautifulSoup(r.content,"html.parser")


If we want to we can have a look at the contents of the soup object. It may not be particularly readable! 

Having Chrome or a similar browser open while you write your code, and using the Inspector to look at the page HTML can help us to find what we're looking for.



In [None]:
print (soup)

### Parsing the code

Now we can navigate through the _soup_ object lookjing for what we need. 

We'll start by using the soup.find method. We can use this technique to find elements, rows, lists etc. 

In this case we are looking for a __div__ tag with a specific class name. 

In [None]:
mydiv = soup.find("div", {"class": "so-widget-sow-editor so-widget-sow-editor-base"})


What does that return?

In [None]:
print (mydiv)

We can find headers within that.

In [None]:
headers = soup.find('div', attrs={'class': 'so-widget-sow-editor so-widget-sow-editor-base'}).h5

But if we check what it returns we see that it only finds the first one. This is what the _soup.find_ method does. 

In [None]:
print(headers)


So we need something like the _find_all_ nethod. We can test it

In [None]:
for headlines in mydiv.find_all("h5"):
    print(headlines.text.strip())
    print("+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-")
    

So, we can create a list (it's what _find_all_ returns) of all of the H5 headings

In [None]:
fives = mydiv.find_all("h5")


This will show the full list: 

In [None]:
print(fives)


We can find out how long the list is:

In [None]:
print(len(fives))

And we can look at individual list items

In [None]:
print(fives[0])

In [None]:
print(fives[1])

We can see that in the first two items the year group is held in _strong_ tags. We can use __find__ and __get_text__ methods to extract that. 

In [None]:
print(fives[0].find("strong").get_text())
print(fives[1].find("strong").get_text())

And we can split on the space (creating a list) and get the first element of that. 



In [None]:
print(fives[0].find("strong").get_text().split(" ")[0])
print(fives[1].find("strong").get_text().split(" ")[0])

So now we have something to work with. 

In [None]:

for year in fives:
    try:
        year.find("strong").get_text().split(" ")[0][0] == 'S'
        year_group = year.find("strong").get_text().split(" ")[0]
        print("[-----------------------]")
        print(f"{year_group=}\n")
        
        for line in year:
            line_text = str(line)
            if not "strong" in line_text and not "<br/>" in line_text and not "NO " in line_text:
                if "• " in line_text:
                    print(line_text.replace("*", "")[3:])
                    print("~~~~~~~~~~")
            
    except:
        pass # we're only interested in those with "S1-S3" or "S4-S6"
        

So, let's try to capture that data in some structure. In this case we can use a dictionary, with the year group as a key, and a list of uniform items as the values. 

In [None]:
Lochside=dict()

for year in fives:
    item_list = list()
    try:
        
        year.find("strong").get_text().split(" ")[0][0] == 'S'
        year_group = year.find("strong").get_text().split(" ")[0]
        
        for line in year:
            line_text = str(line)
            if not "strong" in line_text and not "<br/>" in line_text and not "NO " in line_text:
                if "• " in line_text:
                    item_list.append(line_text.replace("*", "")[3:])
        Lochside[year_group] = item_list 
    except:
        pass # we're only interested in those with "S1-S3" or "S4-S6"

What happens if we print the dictionary?

In [None]:
Lochside

### Questions or comments?

## Coding Challenge

This month's coding challenge can be found [here](https://github.com/PythonAberdeen/user_group/tree/master/2024/2024-03/challenge)