# Basic Scraping with Beautiful Soup

_Note: adapted from Ivan Smirnov's exercises_


### HTML

HTML - Hyper Text Markup Language (https://www.w3schools.com/html/html_intro.asp)

HTML elements are defined by tags
```
<b>Bold text</b>
```
Tags have attributes
```
<a href="https://rwth-aachen.de">RWTH Aachen</a>
```
Tags could be nested
```
<div id="main">
    <div class="sub">
        <a href="https://rwth-aachen.de">RWTH Aachen</a>
    </div>
    <div class="sub">
        <a href="https://hse.ru">HSE University</a>
    </div>
</div>
<div id="footer">
</div>
```
HTML-elements could be selected by name
```
a
```
By ID
```
#main
```
By class
```
.sub
```
Nested selection
```
#main .sub
```

[web page example](https://www.rwth-aachen.de/cms/root/Studium/Vor-dem-Studium/Studiengaenge/~yev/Liste-Aktuelle-Studiengaenge/lidx/1/?page=1&aaaaaaaaaaaaaum=aaaaaaaaaaaaxqh&showall=1)

### Beautiful Soup

[BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
#!pip install beautifulsoup4




In [3]:

html_doc = """
<html>
    <head>
        <title>Test html-document</title>
    </head>
    <body>
        <div id="main">
        <div class="sub">
            <a href="https://uni-mannheim.de">Uni Mannheim</a>
        </div>
        <div class="sub">
            <a href="https://rwth-aachen.de">RWTH Aachen</a>
        </div>
        <div class="sub">
            <a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>
        </div>
        <div id="footer"></div>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup, type(soup)

(
 <html>
 <head>
 <title>Test html-document</title>
 </head>
 <body>
 <div id="main">
 <div class="sub">
 <a href="https://uni-mannheim.de">Uni Mannheim</a>
 </div>
 <div class="sub">
 <a href="https://rwth-aachen.de">RWTH Aachen</a>
 </div>
 <div class="sub">
 <a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>
 </div>
 <div id="footer"></div>
 </body>
 </html>,
 bs4.BeautifulSoup)

In [4]:
soup.html.head.title
soup.html.head.title.text

'Test html-document'

In [5]:
print(len(soup.select('div')))

5


In [6]:
print(len(soup.select('#main div')))

3


In [7]:
print(soup.prettify())

<html>
 <head>
  <title>
   Test html-document
  </title>
 </head>
 <body>
  <div id="main">
   <div class="sub">
    <a href="https://uni-mannheim.de">
     Uni Mannheim
    </a>
   </div>
   <div class="sub">
    <a href="https://rwth-aachen.de">
     RWTH Aachen
    </a>
   </div>
   <div class="sub">
    <a href="https://uni-heidelberg.de">
     Uni Heidelberg
    </a>
   </div>
  </div>
  <div id="footer">
  </div>
 </body>
</html>



In [8]:
soup.select('a')

[<a href="https://uni-mannheim.de">Uni Mannheim</a>,
 <a href="https://rwth-aachen.de">RWTH Aachen</a>,
 <a href="https://uni-heidelberg.de">Uni Heidelberg</a>]

In [9]:
first = soup.select('a')[0]
first, type(first)

(<a href="https://uni-mannheim.de">Uni Mannheim</a>, bs4.element.Tag)

In [10]:
first.parent


<div class="sub">
<a href="https://uni-mannheim.de">Uni Mannheim</a>
</div>

In [11]:
first.parent.parent.parent

<body>
<div id="main">
<div class="sub">
<a href="https://uni-mannheim.de">Uni Mannheim</a>
</div>
<div class="sub">
<a href="https://rwth-aachen.de">RWTH Aachen</a>
</div>
<div class="sub">
<a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>
</div>
<div id="footer"></div>
</body>

In [None]:
for el in first.parents:
    print(type(el), el.name)

In [13]:
for el in first.parent.parent.children:
    print(el)



<div class="sub">
<a href="https://uni-mannheim.de">Uni Mannheim</a>
</div>


<div class="sub">
<a href="https://rwth-aachen.de">RWTH Aachen</a>
</div>


<div class="sub">
<a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>




In [12]:
## Empty text is also an HTML node, so we need to repeat next_sibling twice
first.parent.next_sibling.next_sibling

<div class="sub">
<a href="https://rwth-aachen.de">RWTH Aachen</a>
</div>

In [None]:
soup.select('#main')

In [None]:
soup.select('.sub')

In [None]:
soup.find_all('div', class_ = 'sub')