# Request pages from the Internet, search by page
1. Request the main page of the site
2. Extract the list of links
3. Let's go through all the pages, calculate their length
4. Based on these data, let's try to find out who was the main character of the Harry Potter books

Let's try to request the main page of our educational site from the Internet using the `requests` module:

In [1]:
import requests   # Importing the requests module

In [2]:
# Using the get function from the requests module, we will request the site page and save the result to the site variable
site = requests.get('https://online.hse.ru/python-as-foreign/1/')  

print(site)

<Response [200]>


In [3]:
# error 404

The number 200 means that the computer has "successfully found and received a page from the Internet." Let's see what happens if you request a non-existent page:

In [4]:
print(requests.get('https://online.hse.ru/python-as-foreign/1/11.html'))

<Response [404]>


Число 404 означает, что "страницу найти не удалось". Вы могли видеть это число и сами во время попытки посмотреть на несуществующую страницу. Мы можем узнать, получилось ли найти страницу, обратившись к переменной `site.status_code`

In [5]:
print(site.status_code)

200


To get the text of the html code of the page, you need to access the variable `site.text`

In [6]:
print(site.text)

<html>
<head>
    <meta charset="UTF-8">
    <title>Оглавление</title>
</head>
<body>
    Статьи о персонажах: <a href="1.html">Гарри Поттер</a>,
    <a href="2.html">Джинни Уизли</a>,
    <a href="3.html">Лили Поттер</a>,
    <a href="4.html">Гермиона Грейнджер</a>,
    <a href="5.html">Сириус Блэк</a>,
    <a href="6.html">Рубеус Хагрид</a>,
    <a href="7.html">Рон Уизли</a>,
    <a href="8.html">Астория Гринграсс</a>,
    <a href="9.html">Люциус Малфой</a>,
    <a href="10.html">Драко Малфой</a>,
    <a href="11.html">Беллатриса Лестрейндж</a>.<br>
    По материалам <a href="https://harrypotter.fandom.com">Гарри Поттер вики</a>. <br>Распространяется на условиях CC-BY-SA
</body>
</html>


Such variables are related to other variables, we will call **attributes**. Attributes are syntactically similar to methods — they are also written with a dot after the name of the variable to which they relate, but they do not have brackets, since they are not commands.

It seems that something strange has loaded instead of the page: the page loaded in the wrong encoding, and the computer cannot read the Russian letters. We can look at the encoding by referring to the `encoding` attribute:

In [7]:
import requests   
site = requests.get('https://online.hse.ru/python-as-foreign/1/')
print(f'The page loaded in encoding {site.encoding}')

The page loaded in encoding UTF-8


Let's fix the encoding by specifying [universal](https://ru.wikipedia.org/wiki/Юникод ), which supports all languages of the world:

In [8]:
site.encoding = 'utf-8'
print(site.text)

<html>
<head>
    <meta charset="UTF-8">
    <title>Оглавление</title>
</head>
<body>
    Статьи о персонажах: <a href="1.html">Гарри Поттер</a>,
    <a href="2.html">Джинни Уизли</a>,
    <a href="3.html">Лили Поттер</a>,
    <a href="4.html">Гермиона Грейнджер</a>,
    <a href="5.html">Сириус Блэк</a>,
    <a href="6.html">Рубеус Хагрид</a>,
    <a href="7.html">Рон Уизли</a>,
    <a href="8.html">Астория Гринграсс</a>,
    <a href="9.html">Люциус Малфой</a>,
    <a href="10.html">Драко Малфой</a>,
    <a href="11.html">Беллатриса Лестрейндж</a>.<br>
    По материалам <a href="https://harrypotter.fandom.com">Гарри Поттер вики</a>. <br>Распространяется на условиях CC-BY-SA
</body>
</html>


Sometimes simply specifying the correct encoding is not enough, and then you need to explicitly force the computer to translate the page received from the Network into the correct encoding using the command `.content.decode()`:

In [9]:
import requests   
site2 = requests.get('https://online.hse.ru/python-as-foreign/1/')
site_text = site2.content.decode('utf-8')
print(site_text)

<html>
<head>
    <meta charset="UTF-8">
    <title>Оглавление</title>
</head>
<body>
    Статьи о персонажах: <a href="1.html">Гарри Поттер</a>,
    <a href="2.html">Джинни Уизли</a>,
    <a href="3.html">Лили Поттер</a>,
    <a href="4.html">Гермиона Грейнджер</a>,
    <a href="5.html">Сириус Блэк</a>,
    <a href="6.html">Рубеус Хагрид</a>,
    <a href="7.html">Рон Уизли</a>,
    <a href="8.html">Астория Гринграсс</a>,
    <a href="9.html">Люциус Малфой</a>,
    <a href="10.html">Драко Малфой</a>,
    <a href="11.html">Беллатриса Лестрейндж</a>.<br>
    По материалам <a href="https://harrypotter.fandom.com">Гарри Поттер вики</a>. <br>Распространяется на условиях CC-BY-SA
</body>
</html>


Great, we can understand if a page has loaded, and if so, look at its text. To teach a computer how to work with this text, let's use the `BeautifulSoup` module:

In [10]:
from bs4 import BeautifulSoup

# Let's ask the computer to recognize the html code of the site using BeautifulSoup
soup = BeautifulSoup(site.text)

print(soup)

<html>
<head>
<meta charset="utf-8"/>
<title>Оглавление</title>
</head>
<body>
    Статьи о персонажах: <a href="1.html">Гарри Поттер</a>,
    <a href="2.html">Джинни Уизли</a>,
    <a href="3.html">Лили Поттер</a>,
    <a href="4.html">Гермиона Грейнджер</a>,
    <a href="5.html">Сириус Блэк</a>,
    <a href="6.html">Рубеус Хагрид</a>,
    <a href="7.html">Рон Уизли</a>,
    <a href="8.html">Астория Гринграсс</a>,
    <a href="9.html">Люциус Малфой</a>,
    <a href="10.html">Драко Малфой</a>,
    <a href="11.html">Беллатриса Лестрейндж</a>.<br/>
    По материалам <a href="https://harrypotter.fandom.com">Гарри Поттер вики</a>. <br/>Распространяется на условиях CC-BY-SA
</body>
</html>


It seems that nothing has changed. But now we can search through our page. For example, we can find links to other pages. Let's ask the computer to find some kind of link (i.e. some tag \<a\>...\</a\>)

In [11]:
# Let's ask the computer to find the first link on the page
print(soup.find('a'))

<a href="1.html">Гарри Поттер</a>


Let's ask the computer to write the page address separately (in our case, 1.html ) and the link text (Harry Potter):

In [12]:
link = soup.find('a')

# Using the link.get() method, we can find out the value of any attribute we are interested in.
# The page address is in the href attribute, so we can find it through link.get('href')
href = link.get('href')

# The link text is in the link.text variable
print(f"File's link: {href}, link text: {link.text}")

File's link: 1.html, link text: Гарри Поттер


Simplify the code by getting rid of the `href` variable:

In [13]:
link = soup.find('a')

# We use double quotes inside the f-string so that the computer doesn't get confused
# quotation marks of the string and quotation marks around the name of the href attribute
print(f'File\'s path: {link.get("href")}, text link: {link.text}')

File's path: 1.html, text link: Гарри Поттер


Well, we have learned how to work with one link. Let's ask the computer to find them all:

In [14]:
# Let's ask the computer to find all the links on the page, and display a list of them
print(soup.find_all('a'))

[<a href="1.html">Гарри Поттер</a>, <a href="2.html">Джинни Уизли</a>, <a href="3.html">Лили Поттер</a>, <a href="4.html">Гермиона Грейнджер</a>, <a href="5.html">Сириус Блэк</a>, <a href="6.html">Рубеус Хагрид</a>, <a href="7.html">Рон Уизли</a>, <a href="8.html">Астория Гринграсс</a>, <a href="9.html">Люциус Малфой</a>, <a href="10.html">Драко Малфой</a>, <a href="11.html">Беллатриса Лестрейндж</a>, <a href="https://harrypotter.fandom.com">Гарри Поттер вики</a>]


BeautifulSoup is able to search for information in the code of website pages with a huge number of methods, but in this course we will use the find and find_all method.

Let's make the output more beautiful by going through all the links with a for loop:

In [15]:
for link in soup.find_all('a'):
    print(f'Link path - {link.get("href")} and text "{link.text}"')

Link path - 1.html and text "Гарри Поттер"
Link path - 2.html and text "Джинни Уизли"
Link path - 3.html and text "Лили Поттер"
Link path - 4.html and text "Гермиона Грейнджер"
Link path - 5.html and text "Сириус Блэк"
Link path - 6.html and text "Рубеус Хагрид"
Link path - 7.html and text "Рон Уизли"
Link path - 8.html and text "Астория Гринграсс"
Link path - 9.html and text "Люциус Малфой"
Link path - 10.html and text "Драко Малфой"
Link path - 11.html and text "Беллатриса Лестрейндж"
Link path - https://harrypotter.fandom.com and text "Гарри Поттер вики"


Next time we will learn how to analyze the connections between the characters by analyzing specific pages.

The final version of the notebook:

In [16]:
# Importing the requests and BeautifulSoup modules
import requests   
from bs4 import BeautifulSoup

# Using the get function from the requests module, we will request the site page and save it to the site variable
site = requests.get('https://online.hse.ru/python-as-foreign/1/')  

# Specify the universal character encoding
site.encoding = 'utf-8'

# Let's ask the computer to recognize the html code of the site using BeautifulSoup
soup = BeautifulSoup(site.text)

# Let's go through all the links found on the page and, saving each link under the name link, do the following:
for link in soup.find_all('a'): 
    # Print the page address from the href attribute and the link text
    print(f'Link path - {link.get("href")} and text "{link.text}"')

Link path - 1.html and text "Гарри Поттер"
Link path - 2.html and text "Джинни Уизли"
Link path - 3.html and text "Лили Поттер"
Link path - 4.html and text "Гермиона Грейнджер"
Link path - 5.html and text "Сириус Блэк"
Link path - 6.html and text "Рубеус Хагрид"
Link path - 7.html and text "Рон Уизли"
Link path - 8.html and text "Астория Гринграсс"
Link path - 9.html and text "Люциус Малфой"
Link path - 10.html and text "Драко Малфой"
Link path - 11.html and text "Беллатриса Лестрейндж"
Link path - https://harrypotter.fandom.com and text "Гарри Поттер вики"
