**Disclaimer**: This educational content, including any code examples, is provided for instructional purposes only. The author does not endorse or encourage the unauthorised or illegal scraping of websites.

While Python with releveant libraries can be used for web scraping, it's crucial to conduct scraping activities in compliance with applicable laws, the website's terms of service, and ethical considerations. Always review and respect the rules set by the website you are scraping to ensure legal and responsible data collection practices.

#  Week 1: Web and Web Analytics

## Scraping an html page (loading and searching it's contents)

* Local: saved in a file on your computer
* Remote: somewhere on the web

To fully understand this notebook, please open `example_html.html` file in another tab, and open it's `example_html.html`'s source code in a third tab (or even better: in browser's view > developer tools). You will see in a minute what is the exact address in that file.

For scraping, we need a few of different libraries, most notably Beautifulsoup. Let's first import these:

In [None]:
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup

We can simply enter a web page as a string and open it. Afterwards, BeautifulSoup converts it into a BeautifulSoup object which has many interesting functions and attributes:

### Local Website

In [None]:
# for now we use a local file (os.getcwd() gets the Current Working Directory, aka. the folder you're in)
file_url = "file:///"+os.getcwd()+"/example_html.html"
website_source_code = urlopen(file_url)

# convert the website's content, for this a parser is needed. In this case a html parser
soup = BeautifulSoup(website_source_code, 'html.parser')

# # Convert the website's content using a parser, we can also use "lxml"
# soup = BeautifulSoup(website, 'lxml')

In [None]:
# here's a complete html of the page, but it's easier to read if you open it's source using the url above
print(soup.prettify())

In [None]:
# .find_all retrieves all tags containing 'h1':
h1Tags = soup.find_all('h1')
for h1 in h1Tags:
    print('Complete tag code: ', h1)
    print("Just the text in the tag: ", h1.text)

However, it does not work with attributes of tags:

In [None]:
titleTags = soup.find_all('title')
for title in titleTags:
    print('Complete tag code: ', title)
    print("Just the text in the tag: ", title.text)
    
# nothing will be printed. there are no tags <title> </title> there

## Understanding the html is all about finding components you need:

* .find_all( ) will find all things that match criteria, in a **list**
* .find( ) will find just the **first** item that mathes the criteria

You can use it on the whole website, like `a_table = soup.find("table")` or on an element you found before `rows = a_table.find("tr")`

You can seek for types of tags, classes or ids
* `soup.find("h1")`, 
* `soup.find(id="main_navigation")`,
* `soup.find(class="warning_message")`

But it is very frequent to fetch an element by its unique id:

In [None]:
middle_row = soup.find(id='middle_row')

print('Complete tag code: ', middle_row)
print("Just the text in the tag: ", middle_row.text)

## Find children:

When, like above, a tag contains some children (tags inside it) you can extract them into a list. The example would be above table row `<tr></tr>` includes three table data `<td></td>`

`.findChildren()` will give you alist with all tags inside of a given tag

You can specify exactly which chhildre, if you want, like with the `.find()`. So you could use 
* `.findChildren("tr")` or
* `.findChildren(class="warning_message")`

In [None]:
middle_row = soup.find(id='middle_row')
cells_in_the_row = middle_row.findChildren()
for cell in cells_in_the_row:
    print('Complete tag code: ', cell, "Just the text in the tag: ", cell.text)

You can dive deeper into certain tags, for example here you look for all divs from the (CSS) class called hipster:

In [None]:
class_elements = soup.find_all("div", {"class" : "hipster" })
for element in class_elements:
    print('whole tag:\n', str(element), '\n')
    print('Just the text: ', element.text)

Getting all the elements out of the table:

In [None]:
# list all tables, since we only have 1, use the first in the list at index 0
my_table = soup.find_all('table')[0]
# or just use: my_table = soup.find('table')

# loop the rows and keep the row number
row_num = 0
for row in my_table.find_all('tr'):
    print("Row: "+str(row_num))
    row_num = row_num+1

    #loop the cells in the row
    for cell in row.find_all('td'):
        print("whole html:", str(cell)+" \tJust content: "+cell.text)
        
# if you'd like, try to change this code to use .findChildren( ) rather t

## Minitask: Now attempt to scrape something from a real online website:

Use the above code to make a list of all the degrees available in business school of University of Edinburgh.
* You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this: https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)
* Get the html component that holds all the degrees. Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list"). Does this component have a class or an id? How would you get a component when you know it's id? (hint: proxy_degreeList )
* What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course?
* Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them.

I am posting the solution lower down, but do try to solve it by yourself first!

In [None]:
# copy-paste relevant parts of the code from above to start:

Only uncover the solutions once you tried to complete the task:

CLICK HERE TO SEE THE THE HINT 1. 
1. You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this: https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)
``` 
file_url = "https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12" 
website_source_code = urlopen(file_url) 
soup_degrees_website = BeautifulSoup(website_source_code, 'html.parser') 
``` 

CLICK HERE TO SEE THE THE HINT 2. 

2. get the html component that holds all the degrees. Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list"). 
Does this component have a class or an id? How would you get a component when you know it's id? (hint: proxy_degreeList )
``` 
degrees = soup_degrees_website.find(id='proxy_degreeList')
``` 

CLICK HERE TO SEE THE THE HINT 3. 

3. What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course? 
``` 
for list_item in degrees.findChildren("a"): 
``` 

CLICK HERE TO SEE THE THE HINT 4. 

4. Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them. 
``` 
print("Degree Name:", list_item.text) 
```

In [None]:
file_url = "https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12" 
website_source_code = urlopen(file_url) 
soup_degrees_website = BeautifulSoup(website_source_code, 'html.parser') 

In [None]:
degrees = soup_degrees_website.find(id='proxy_degreeList')

In [None]:
for list_item in degrees.findChildren("a"):
    print("Degree Name:", list_item.text) 