In [10]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../Data/www/styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

In [11]:
import requests
import json
from IPython.display import HTML
import bs4
from IPython.display import display, Image

# Synopsis

In this unit we will cover:

* The structure of Web pages
* What is HTML/CSS
* How to extract information from HTML pages
* Techniques for navigating and scraping web pages


# Interacting with the Web (Part II)

In the previous units, we learned how to retrieve data from Web sources using APIs. But what if the organization hosting the data does not have the forethought or resources to create an API (or if they do not want to share their data)?  Then, we have to **crawl** their website and **scrape** their data.

To do this, we will be using our dependable `requests` library.  However, we will need to call upon a few other resources.  In particular, we will need to understand the code in which webpages are written.


+++

## Detour: A (very brief) intro to HTML

HTML is a markup language for describing web documents. It stands for **H**yper **T**ext **M**arkup **L**anguage. HTML, together with CSS (**C**ascading **S**tyle **S**heets for _styling_ web documents) and Javascript (for _animating_ web documents), it is the language that is used to construct web pages.

HTML documents are built using a series of HTML _tags_. Each tag describes a different type of content. Web pages are built by putting together different tags.

This is the general HTML tag structure:

```html
<tagname tag_attribute1="attribute1value1 attribute1value2" tag_attribute2="attribute2value1">tag contents</tagname>
```
* Tags (usually) have both a start (or opening) tag, <tagname> and an end (or closing) tag, </tagname>
* Tags can also have attributes which are declared _inside_ the opening tag.
* The actual tag _content_ goes inbetween the opening and closing tags.

Tags can be contained (nested) inside other tags, which defines relationships between them:

```html
<parent>
  <brother></brother>
  <sister>
    <grandson></grandson>
  </sister>
</parent>
```

* `<parent>` is the _parent_ tag of `<brother>` and `<sister>`
* `<brother>` and `<sister>` are the _children_ or _direct descendant_ tags of `<parent>`
* `<brother>`, `<sister>`, and `<grandson>` are the _descendant_ tags of `<parent>`
* `<brother>` and `<sister>` are _sibling_ tags

Here's a very simple web document:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title> 
  </head>

  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>
</html> 
```

Here, `<h1>` and `<p>` are sibling tags, `<body>` is their parent tag, and all three are descendent tags of `<html>`

When you access any URL, your browser (Chrome, Firefox, Safari, IE, etc.) is actually reading a document such as this one and using the tags within the document to decide how to render the page for you.

Jupyter is able to render a (python) string of HTML code as real HTML in the notebook itself!

In [13]:
from IPython.display import HTML

first_html = """
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title>
  </head>
  
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>

</html> 
"""

HTML(first_html)

#### Let's look at what the different tags mean:

```html
<!-- This is how you write a comment in HTML. Comments will not show up in the browser -->

<!-- This line simply identifies the document type to be HTML-->
<!DOCTYPE html>
<!-- Content between <html> and </html> tags define everything about the document-->
<html>
  <!-- Tags inside the <head> are not rendered but provide general information about the document -->
  <head>
    <!-- Like the <title> tag which provides a title that appears in the browser's title and tab bars -->
    <title>Page Title</title>
  </head>
  
  <!-- Anything inside the <body> tags describes visible page content -->
  <body>
    <!-- The <h1> defines a header. The number defines the size of the header. -->
    <!-- There are 6 levels of headers: <h1> to <h6> -->
    <!-- The higher the number, the lower the font used to display it. -->
    <h1>My First Heading</h1>
    <!-- The <p> represents a paragraph.-->
    <p>My first paragraph.</p>
  </body>
</html>
```

**Different levels of headers**

```html
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6> 
```

**Links**
```html
<a href="http://www.website.com">Click to go to website.com</a>
```

**Images**
```html
<!-- Notice that the image tag has no closing tag and no content outside the opening tag -->
<img src="smiley.gif">
```

**Lists**
```html
<!-- Unordered (bulleted) list -->
<ul>
  <li>One Element</li>
  <li>Another Element</li>
</ul>

<!-- Ordered (numbered) list -->
<ol>
  <li>First Ordered Element</li>
  <li>Second Ordered Element</li>
</ol>
```

**Tables**
```html
<table>
  <!-- An HTML table is defined as a series of rows (<tr>) -->
  <!-- The individual cell (<td>) contents are nested inside rows -->
  
  <!-- The <tr> tag is optional and is the parent of column headers (<th>) -->
  <tr>
    <th>First Header</th>
    <th>Second Header</th>
  </tr>
  <tr>
    <td>Row 2, Col 1</td>
    <td>Row 2, Col 2</td>
  </tr>
  <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
</table>
```

In [4]:
more_tags = """
<html>
<head>
  <title>More HTML Tags</title>
</head>
<body>
  <h1>This is heading 1</h1>
  <h2>This is heading 2</h2>
  <h3>This is heading 3</h3>
  <h4>This is heading 4</h4>
  <h5>This is heading 5</h5>
  <h6>This is heading 6</h6>

  <br>
  
  <a href="http://www.website.com">Click to go to website.com</a>

  <p><img src="../Data/www/images/smiley.png" alt="smiley face"></p>

  <ul>
    <li>One Element</li>
    <li>Another Element</li>
  </ul>

  <ol>
    <li>First Ordered Element</li>
    <li>Second Ordered Element</li>
  </ol>

  <table>
    <!-- An HTML table is defined as a series of rows (<tr>) -->
    <!-- The individual cell (<td>) contents are nested inside rows -->
    <tr>
      <!-- The <tr> tag defines a column headers -->
      <th>First Header</th>
      <th>Second Header</th>
    </tr>
    <tr>
      <td>Row 2, Col 1</td>
      <td>Row 2, Col 2</td>
    </tr>
    <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
  </table>
</body>
</html>
"""

HTML(more_tags)

First Header,Second Header
"Row 2, Col 1","Row 2, Col 2"
"Row 3, Col 1","Row 3, Col 2"


+++

If you want to know more about HTML, I recommend the excellent w3schools website: http://www.w3schools.com/html/html_intro.asp


#### Ok, back to web scraping

Now we are all HTML experts. Great! We're almost ready to start parsing and analyzing a scraped web page. There's just one last item of business we need to discuss before we get started.

### Viewing a page's source code

In order to extract elements of interest from a webpage we need to know where they sit in the webpage's HTML tree.
This means that you need to look at a webpage's HTML source code before you can even start scraping it. Not only that but, during your web scraping you will be switching back and forth between the actual scraping (we'll get there really soon, I promise!) and the webpage's source code.

How do we view a page's source code then?

* To view the **full page** source code:
  1. Right-click anywhere on the webpage **that is not a link**
  2. Click "View Page Source" (<kbd>CTRL</kbd>+<kbd>U</kbd>) in Firefox or Chrome, or "Show page source" (<kbd>&#8997;</kbd>+<kbd>&#8984;</kbd>+<kbd>U</kbd>) in Safari.
    * In order to view the source code in Safari the Develop menu must be enabled first: Preferences > Advanced > Show Develop menu in menu bar
    
* To view the source code zoomed-in on **a single element** (and with better formatting!):
  1. Right-click any element in the webpage.
  2. Click "Inspect Element"

##  Beautiful Soup, so rich and green, waiting in a hot tureen!

(*The Lobster Quadrille*, Alice in Wonderland)

We made it! We are now ready to start scraping web pages. In order to do so we are going to use [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/), a powerful python package to parse web pages you already scraped. Normally you would use `requests` (to GET the page) and then `BeautifulSoup` to analyse the web page.

We will use the wikipedia page for a player from Germany's national football team as an example: https://en.wikipedia.org/wiki/Erik_Durm that has already been downloaded into the `Data/Day6-Web-Scraping/` folder. We are starting with a pre-downloaded HTML page so that there aren't a hundred requests from the same place for the same page at the same server at the same time from (which will frequently result in you getting blocked from accessing that website!)

In [14]:
# Beautiful Soup version 4.x
import bs4

We start by opening up the page and convert it to a `soup` object. Then, we're going to use the `find` method to find the page's `<title>` tag and print it.

In [16]:
# We specify the encoding of the file here because Windows
# has problems reading some characters in it.

with open("../Data/Day6-Web-Scraping/erik_durm_wiki.html", "r", encoding="utf-8") as wiki_file:
        soup = bs4.BeautifulSoup(wiki_file.read(), 'lxml')
        
title = soup.find('title')   #finds the FIRST <title> tag 
print(title.text)

Erik Durm - Wikipedia, the free encyclopedia


Beautiful Soup converts HTML tags into its own `Tag` objects.`Tag` objects have many useful attributes.

In [8]:
print(type(title))
print(title.text) # The text gives you the visible part of the tag
print(title.name) # The type of tag

<class 'bs4.element.Tag'>
Erik Durm - Wikipedia, the free encyclopedia
title


If a tag has any html attributes, they can be accessed in a very "pythonic" way. That is, they are organized as a dictionary!



In [9]:
h1 = soup.find("h1")

print(h1.attrs)
print(h1["class"])
print(h1["id"])

{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}
['firstHeading']
firstHeading


Instead of searching for `Tags` one by one, we can also retrieve them all at once.  As an example, let's find all level 2 headers. To this end, we use the `find_all` method.

In [17]:
headers = soup.find_all('h2')

print(headers)

[<h2>Contents</h2>, <h2><span class="mw-headline" id="Club_career">Club career</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><span class="mw-headline" id="International_career">International career</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=4" title="Edit section: International career">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><span class="mw-headline" id="Career_statistics">Career statistics</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=7" title="Edit section: Career statistics">edit</a><span class="mw-editsection-bracket">]</span></span></h2>, <h2><spa

Too much information!  In order to get the only the information that we need, we must restrict to the desired attribute.

In [18]:
for header in headers:
    print(header.text)

Contents
Club career[edit]
International career[edit]
Career statistics[edit]
Honours[edit]
References[edit]
External links[edit]
Navigation menu


Another `Tag` that that is useful and that demonstrate some of the other useful attributes is the one for webpages that our page points to:

In [19]:
links = soup.find_all('a')

for link in links[:10]:  # Showing just the first 10 links for brevity
    # href represents the target of the link
    # Where the link actually goes to!
    print('-----', link.text)
    print(link.get('href'))
    

----- 
None
----- navigation
#mw-head
----- search
#p-search
----- 
/wiki/File:Erik_Durm_IMG_1748.jpg
----- BVB
/wiki/Borussia_Dortmund
----- [1]
#cite_note-1
----- Pirmasens
/wiki/Pirmasens
----- Left back
/wiki/Defender_(association_football)#Full-back
----- Right back
/wiki/Defender_(association_football)#Full-back
----- Borussia Dortmund
/wiki/Borussia_Dortmund


### Searching using attribute information

Some `Tag` elements have attributes associated with them. These includes `id`, `class_`, `href`.  Our search can restrict results to attributes with a specific value or to results where the attribute type is included.

Note that we must use `class_` instead of `class` to avoid conflicts with Python's built-in keyword. 



In [20]:
# Retrieve the element with the attribute "id" equal to "Early_career"
tag = soup.find(id="Early_career")
print(tag)
print(tag.text)

<span class="mw-headline" id="Early_career">Early career</span>
Early career


In [21]:
# Retrieve all elements with an href attribute
all_links = soup.find_all(href=True)
print(len(all_links))

373


In [22]:
# Retrieve inline citations -- they are <sup> elements with the class "reference"
soup.find_all("sup", class_="reference")[5:15]

[<sup class="reference" id="cite_ref-6"><a href="#cite_note-6"><span>[</span>6<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-7"><a href="#cite_note-7"><span>[</span>7<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-8"><a href="#cite_note-8"><span>[</span>8<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-9"><a href="#cite_note-9"><span>[</span>9<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-10"><a href="#cite_note-10"><span>[</span>10<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-11"><a href="#cite_note-11"><span>[</span>11<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-12"><a href="#cite_note-12"><span>[</span>12<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-2014_German_Super_Cup_13-0"><a href="#cite_note-2014_German_Super_Cup-13"><span>[</span>13<span>]</span></a></sup>,
 <sup class="reference" id="cite_ref-14"><a href="#cite_note-14"><span>[</span>14<span>]</span></a></sup>,
 <s

In [23]:
# Retrieve all tags with class=mw-headline and an id attribute (regardless of value)
soup.find_all(attrs={"class": "mw-headline", "id": True})

[<span class="mw-headline" id="Club_career">Club career</span>,
 <span class="mw-headline" id="Early_career">Early career</span>,
 <span class="mw-headline" id="Borussia_Dortmund">Borussia Dortmund</span>,
 <span class="mw-headline" id="International_career">International career</span>,
 <span class="mw-headline" id="Youth">Youth</span>,
 <span class="mw-headline" id="Senior">Senior</span>,
 <span class="mw-headline" id="Career_statistics">Career statistics</span>,
 <span class="mw-headline" id="Club">Club</span>,
 <span class="mw-headline" id="International">International</span>,
 <span class="mw-headline" id="Honours">Honours</span>,
 <span class="mw-headline" id="Club_2">Club</span>,
 <span class="mw-headline" id="International_2">International</span>,
 <span class="mw-headline" id="References">References</span>,
 <span class="mw-headline" id="External_links">External links</span>]

#### A little more HTML

`class` and `id` are special HTML attributes that allow for a rich connection between HTML and CSS and Javascript. Feel free to google the subject. We won't go into the details here. Just know that:

* The `id` attribute is used to uniquely identify a tag. This means that all `id` attributes should have different values in a webpage.

* The `class` attribute is used to identify tags which share certain properties. A tag can have more than one `class` value:
```html
   <!-- Separate extra classes by a space -->
   <tag class="first_class second_class">...</tag>
```

In the above example, notice that all reference elements (`<sup>` tags) have the same `class` value but different `id` values.

+++

### Navigating the HTML tree with BeautifulSoup


Besides being able to search elements anywhere on the whole html tree, beautiful soup also allows you to navigate the tree in any direction.

Let's try to get at the first paragraph (`<p>`) in the `Club career` section starting from the section's title tag.

Here's the relevant HTML snippet:

```html
    <h2>
      <span class="mw-headline" id="Club_career">Club career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h2>
    <h3>
      <span class="mw-headline" id="Early_career">Early career</span>
      <span class="mw-editsection">
        <span class="mw-editsection-bracket">[</span>
        <a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=2" title="Edit section: Early career">edit</a>
        <span class="mw-editsection-bracket">]</span>
      </span>
    </h3>
    <p>Durm began his club career in 1998 at the academy of SG Rieschweiler....</p>
```

We can see that that section of text is *under* the "Club career" title: 

In [25]:
section_headline = soup.find(id="Club_career")
print(section_headline)
print(section_headline.text)
section_headline.contents

<span class="mw-headline" id="Club_career">Club career</span>
Club career


['Club career']

The `contents` attribute lets us access everything that is inside a given tag. In this case we find only the visible text of the tag.

Looking at the webpage snippet, we see that the tag `<p>` is at the same level as the tags `<h2>` and `<h3>`.  Hence, we need to navigate up one level (to the `<h2>` tag), then navigate to its second sibling (first `<h3>` then `<p>`).

In [30]:
parent_h2 = section_headline.parent  # Up one level
print( parent_h2.name == "h2" )      # Is it the <h2> tag?
print(parent_h2.contents)            

True
[<span class="mw-headline" id="Club_career">Club career</span>, <span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Erik_Durm&amp;action=edit&amp;section=1" title="Edit section: Club career">edit</a><span class="mw-editsection-bracket">]</span></span>]


In [31]:
one_step = parent_h2.next_sibling
print(one_step.name)

None


In [33]:
two_steps = one_step.next_sibling
print(two_steps.name)

h3


We are only at the `<h3>` tag even though we moved past two siblings.  The reason is that some of the siblings in the soup are not actual HTML elements. Some could simply be empty lines.

In [35]:
three_steps = two_steps.next_sibling
print(three_steps.name)

None


In [36]:
four_steps = three_steps.next_sibling
print(four_steps.name)

p


In [38]:
print(four_steps.contents)

['Durm began his club career in 1998 at the academy of SG Rieschweiler, before joining the academy of ', <a href="/wiki/1._FC_Saarbr%C3%BCcken" title="1. FC Saarbrücken">1. FC Saarbrücken</a>, ' in 2008 where he became youth league top scorer of the 2009–2010 season with 13 goals.', <sup class="reference" id="cite_ref-pfaelzischer-merkur.de_2-0"><a href="#cite_note-pfaelzischer-merkur.de-2"><span>[</span>2<span>]</span></a></sup>, ' In July 2010, Durm was enrolled at the academy of ', <a href="/wiki/1._FSV_Mainz_05" title="1. FSV Mainz 05">1. FSV Mainz 05</a>, ' and won the 2010–11 Youth Federation Cup in Germany and Durm debuted and played his only game of the 2010–11 season for the second team of 1. FSV Mainz 05 on 4 December 2010 against ', <a href="/wiki/SV_Elversberg" title="SV Elversberg">SV Elversberg</a>, ' in the German ', <a href="/wiki/Regionalliga" title="Regionalliga">Regionalliga</a>, '.', <sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>

Ok. Now we are where we wanted to be. We have the text corresponding to the `<p>` tag.  This is something we must always be mindful about. Web scraping can, and very frequently will be, messy and will involve trial-and-error...

We can the contents of our desired element is a list.  Let's obtain the number of elements and check what they contain.

In [39]:
print(len(four_steps.contents))
print(four_steps.contents[1])
print(four_steps.contents[5])

12
<a href="/wiki/1._FC_Saarbr%C3%BCcken" title="1. FC Saarbrücken">1. FC Saarbrücken</a>
<a href="/wiki/1._FSV_Mainz_05" title="1. FSV Mainz 05">1. FSV Mainz 05</a>


In order to find the desired tag, we choose a easily identifiable starting point -- `id` is great because its value must be unique -- and then navigate the HTML tree to the correct parent and transversed siblings until we got to the right one. 

Clearly, this is not a very elegant solution. If there were hundreds of siblings that would have been very cumbersome. Fortunately, there is an alternative way:

In [41]:
parent_h2.find_next_sibling("p")

<p>Durm began his club career in 1998 at the academy of SG Rieschweiler, before joining the academy of <a href="/wiki/1._FC_Saarbr%C3%BCcken" title="1. FC Saarbrücken">1. FC Saarbrücken</a> in 2008 where he became youth league top scorer of the 2009–2010 season with 13 goals.<sup class="reference" id="cite_ref-pfaelzischer-merkur.de_2-0"><a href="#cite_note-pfaelzischer-merkur.de-2"><span>[</span>2<span>]</span></a></sup> In July 2010, Durm was enrolled at the academy of <a href="/wiki/1._FSV_Mainz_05" title="1. FSV Mainz 05">1. FSV Mainz 05</a> and won the 2010–11 Youth Federation Cup in Germany and Durm debuted and played his only game of the 2010–11 season for the second team of 1. FSV Mainz 05 on 4 December 2010 against <a href="/wiki/SV_Elversberg" title="SV Elversberg">SV Elversberg</a> in the German <a href="/wiki/Regionalliga" title="Regionalliga">Regionalliga</a>.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>

Much nicer!

Besides the `find_next_sibling` method, there are also `find_previous_sibling`, `find_next_children`, `find_previous_children`, and many others.

The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a comprehensive list of all these methods. There is no need to memorize all of them. It's more important to realize that, as with any programming language, there is more than one way to get any element of the html tree. The trick is to *pick a good starting point* from where to start the scraping.

## Scraping images from a webpage

You can also use Beautiful Soup to get the source of an image from a webpage. It works just the same as for text.

In [42]:
# Some modules that will allows us to display images and other media in the notebook itself
from IPython.display import display, Image

In [44]:
for image in soup.find_all('img'):
    print(image)

<img src="www/images/Erik_Durm_IMG_1748.jpg"/>
<img src="www/images/Erik_Durm20140714_0009.jpg"/>
<img alt="Germany" class="thumbborder" data-file-height="600" data-file-width="1000" height="30" src="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/50px-Flag_of_Germany.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/75px-Flag_of_Germany.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/100px-Flag_of_Germany.svg.png 2x" width="50"/>
<img alt="" height="1" src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" style="border: none; position: absolute;" title="" width="1"/>
<img alt="Wikimedia Foundation" height="31" src="/static/images/wikimedia-button.png" srcset="/static/images/wikimedia-button-1.5x.png 1.5x, /static/images/wikimedia-button-2x.png 2x" width="88"/>
<img alt="Powered by MediaWiki" height="31" src="https://en.wikipedia.org/static/1.26wmf19/resources/assets/poweredby_mediawik

We can pinpoint a specific image and get its attributes

In [45]:
images = soup.find_all('img')
img0 = images[0]
print(img0.attrs)

{'src': 'www/images/Erik_Durm_IMG_1748.jpg'}


Then we can display the image using its `src` attribute

In [46]:
display(Image(url='../Data/' + img0['src']))

display(Image(url='../Data/' + images[1]['src']))


## Exercise: Scraping results from your Personality profile

For this exercise you will use your results from the personality quiz at [HEXACO](http://hexaco.org/hexaco-online). You did take the quiz right? :)

Save the page with the quiz results to: `<path to the bootcamp directory>/Data/my_hexaco.html`

In [136]:
with open("../Data/my_hexaco.html", "r", encoding="utf-8") as hexaco_file:
        soup = bs4.BeautifulSoup(hexaco_file.read())

1 - Find the `<table>` element, that contains your results.

In [134]:
table = soup.find('table') # your search terms inside the `find` method
# tr_tags=table.find_all('tr')
# tr_tags[2].find('b').text
# td_tags=tr_tags[2].find_all('td')
# td_tags

2 -  Find all the scale names using the `table` variable from above

In [152]:
with open("../Data/my_hexaco.html", "r", encoding="utf-8") as hexaco_file:
        soup = bs4.BeautifulSoup(hexaco_file.read())
table = soup.find('table')
# Find all table rows, skipping the first two which don't matter
for tag in table.find_all("tr")[2:]:
    cells = tag.find_all("td")
    scale=cells[0]
    score=cells[1].select('div')[1]
    #print(score)
    scale.find('div', class_='description').decompose()
    scale.find('a', class_='btn').decompose()
    print(scale.text.strip(),score['title'].split(':')[1].strip())
    #print(scale.text.strip(),score['data-original-title'].split(':')[1].strip())

Honesty-Humility 3.06
Sincerity 3.25
Fairness 3.25
Greed-Avoidance 3.50
Modesty 2.25
Emotionality 2.81
Fearfulness 2.50
Anxiety 2.00
Dependence 3.50
Sentimentality 3.25
eXtraversion 2.75
Social Self-Esteem 2.25
Social Boldness 4.00
Sociability 2.75
Liveliness 2.00
Agreeableness 2.88
Forgivingness 2.25
Gentleness 3.00
Flexibility 4.00
Patience 2.25
Conscientiousness 3.00
Organization 2.50
Diligence 3.00
Perfectionism 2.75
Prudence 3.75
Openness to Experience 2.94
Aesthetic Appreciation 3.50
Inquisitiveness 2.75
Creativity 3.25
Unconventionality 2.25
Altruism 3.00


3: Now get both the scale names and your own scores associated with each scale

In [101]:
# Find all table rows, skipping the first two which don't matter
for tag in table.find_all("tr")[2:]:
    cells = tag.find_all("td")
    
    # Your code here