# Chapter 4: Encoding and Annotation Schemes
## Building a scraper

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

#### Using `requests`

In [1]:
import requests

url_en = 'https://en.wikipedia.org/wiki/Aristotle'
url_fr = 'https://fr.wikipedia.org/wiki/Aristote'
html_doc = requests.get(url_en).text
print(html_doc[:2000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Aristotle - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limite

## Parsing HTML and a Wikipedia page

#### We import the modules

In [2]:
import bs4
from urllib.parse import urljoin

#### We load a page and parse it

In [3]:
url_en = 'https://en.wikipedia.org/wiki/Aristotle'
html_doc = requests.get(url_en).text
parse_tree = bs4.BeautifulSoup(html_doc, 'html.parser')

#### We extract elements

In [4]:
parse_tree.title
# <title>Aristotle - Wikipedia, the free encyclopedia</title>

<title>Aristotle - Wikipedia</title>

In [5]:
parse_tree.title.text
# Aristotle - Wikipedia, the free encyclopedia

'Aristotle - Wikipedia'

In [6]:
# We extract header 1

In [7]:
parse_tree.h1.text
# Aristotle

'Aristotle'

#### We extract all the headers h2

In [8]:
headings = parse_tree.find_all('h2')
[heading.text for heading in headings]
# ['Contents', 'Life', 'Thought', 'Loss and preservation of his works', 'Legacy', 'List of works', 'Eponyms', 'See also', 'Notes and references', 'Further reading', 'External links', 'Navigation menu']

['Contents',
 'Life',
 'Theoretical philosophy',
 'Natural philosophy',
 'Practical philosophy',
 'Transmission',
 'Surviving works',
 'Depictions in art',
 'Eponyms',
 'See also',
 'References',
 'Further reading',
 'External links']

#### We extract the links

In [9]:
links = parse_tree.find_all('a', href=True)
links[:5]

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>]

#### The labels

In [10]:
[link.text for link in links][:15]

['Jump to content',
 'Main page',
 'Contents',
 'Current events',
 'Random article',
 'About Wikipedia',
 'Contact us',
 'Donate',
 'Help',
 'Learn to edit',
 'Community portal',
 'Recent changes',
 'Upload file',
 '\n\n\n\n\n\n',
 '\nSearch\n']

#### The links

In [11]:
[link.get('href') for link in links][:15]

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Special:RecentChanges',
 '/wiki/Wikipedia:File_upload_wizard',
 '/wiki/Main_Page',
 '/wiki/Special:Search']

#### The absolute addresses

In [12]:
try:
    out = [urljoin(url_en, link['href']) for link in links]
except Exception as ex:
    type(ex)
out[:15]

['https://en.wikipedia.org/wiki/Aristotle#bodyContent',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Wikipedia:Contents',
 'https://en.wikipedia.org/wiki/Portal:Current_events',
 'https://en.wikipedia.org/wiki/Special:Random',
 'https://en.wikipedia.org/wiki/Wikipedia:About',
 'https://en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://en.wikipedia.org/wiki/Help:Contents',
 'https://en.wikipedia.org/wiki/Help:Introduction',
 'https://en.wikipedia.org/wiki/Wikipedia:Community_portal',
 'https://en.wikipedia.org/wiki/Special:RecentChanges',
 'https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Special:Search']