# Calculating the statistics of the use of words on the site page

Last time we learned how to connect to websites, get pages from the Internet and search for tags in the code of the received pages. Today we will learn how to search by page content.

Let's assume that you and I have not read Harry Potter, but it is vital for us to understand who the main characters in this series of books are and how the characters interact with each other.

Let's try to assume that there are more words dedicated to the main characters on fan sites — or, at least, those who are quite popular with the public — and that fans would not forget to mention the interactions between the characters in their biographies.

Let's go through the pages and calculate the length of the texts that they contain, for this:
1. download the main page;
2. extract a list of links from it
3. follow the links and check for each of them:
    * that the link points to our site and
    * that the link can be downloaded (it returns the response code 200):
    1. for each of the selected links, calculate the length of the text on the page and
    2. print the link name and the length of the text.

Download the main page of the training site and build the addresses of the remaining links:

In [1]:
import requests   
from bs4 import BeautifulSoup

base_url = 'https://online.hse.ru/python-as-foreign/1/'

page = requests.get(base_url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text)

for link in soup.find_all('a'):
    if link.get('href').endswith('html'):   # All links to our site end in .html
        print(base_url+link.get('href'), link.text)

https://online.hse.ru/python-as-foreign/1/1.html Гарри Поттер
https://online.hse.ru/python-as-foreign/1/2.html Джинни Уизли
https://online.hse.ru/python-as-foreign/1/3.html Лили Поттер
https://online.hse.ru/python-as-foreign/1/4.html Гермиона Грейнджер
https://online.hse.ru/python-as-foreign/1/5.html Сириус Блэк
https://online.hse.ru/python-as-foreign/1/6.html Рубеус Хагрид
https://online.hse.ru/python-as-foreign/1/7.html Рон Уизли
https://online.hse.ru/python-as-foreign/1/8.html Астория Гринграсс
https://online.hse.ru/python-as-foreign/1/9.html Люциус Малфой
https://online.hse.ru/python-as-foreign/1/10.html Драко Малфой
https://online.hse.ru/python-as-foreign/1/11.html Беллатриса Лестрейндж


Let's try to download each of them and check that they are downloaded correctly:

In [2]:
import requests   
from bs4 import BeautifulSoup

base_url = 'https://online.hse.ru/python-as-foreign/1/'

page = requests.get(base_url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text)

for link in soup.find_all('a'):
    if link.get('href').endswith('html'):   # All links to our site end in .html
        page = requests.get(base_url+link.get('href'))
        if page.status_code == 200:
            print(base_url+link.get('href'), link.text)

https://online.hse.ru/python-as-foreign/1/1.html Гарри Поттер
https://online.hse.ru/python-as-foreign/1/2.html Джинни Уизли
https://online.hse.ru/python-as-foreign/1/3.html Лили Поттер
https://online.hse.ru/python-as-foreign/1/4.html Гермиона Грейнджер
https://online.hse.ru/python-as-foreign/1/5.html Сириус Блэк
https://online.hse.ru/python-as-foreign/1/6.html Рубеус Хагрид
https://online.hse.ru/python-as-foreign/1/7.html Рон Уизли
https://online.hse.ru/python-as-foreign/1/8.html Астория Гринграсс
https://online.hse.ru/python-as-foreign/1/9.html Люциус Малфой
https://online.hse.ru/python-as-foreign/1/10.html Драко Малфой


Note that the page about Bellatrix Lestrange could not be loaded, since it is not on the site and the response code was 404, not 200.

We will set the correct encoding for each of the loaded pages and calculate the length of the texts by referring to the `body` tag, inside which all visible content is located on the pages:

In [3]:
import requests   
from bs4 import BeautifulSoup

base_url = 'https://online.hse.ru/python-as-foreign/1/'

page = requests.get(base_url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text)

for link in soup.find_all('a'):
    if link.get('href').endswith('html'):
        page = requests.get(base_url+link.get('href'))
        if page.status_code == 200:
            print(f'Link path - {link.get("href")} and text "{link.text}"')
            page.encoding = 'utf-8'
            s = BeautifulSoup(page.text)
            print(f'The article length is {len(s.find("body").text)} symbols.')

Link path - 1.html and text "Гарри Поттер"
The article length is 58609 symbols.
Link path - 2.html and text "Джинни Уизли"
The article length is 19644 symbols.
Link path - 3.html and text "Лили Поттер"
The article length is 23711 symbols.
Link path - 4.html and text "Гермиона Грейнджер"
The article length is 55545 symbols.
Link path - 5.html and text "Сириус Блэк"
The article length is 22988 symbols.
Link path - 6.html and text "Рубеус Хагрид"
The article length is 13522 symbols.
Link path - 7.html and text "Рон Уизли"
The article length is 46010 symbols.
Link path - 8.html and text "Астория Гринграсс"
The article length is 2318 symbols.
Link path - 9.html and text "Люциус Малфой"
The article length is 18883 symbols.
Link path - 10.html and text "Драко Малфой"
The article length is 56949 symbols.


Note that our hypothesis has been confirmed: the list of main characters includes Harry, Hermione, Draco and Ron, and the list of secondary characters includes Lily, Ginny, Sirius, Hagrid and Lucius. The episodic hero turned out to be Astoria Greengrass.

Now let's rework our program so that it tries to find mentions of others on the pages of some characters, to do this, we will finalize the existing program:
1. create a list of the names of the heroes by which they can be found,
2. let's create a dictionary of the Hero type: A lot of names of heroes that are mentioned on his biography page,
3. let's go through all the pages of the biographies and on each page:
    1. let's try to find a mention of each name from the list of heroes in the text of the page,
    2. add the found names to our dictionary of sets
4. At the end, we will display the statistics on the screen and compare it with the contents of the books (without spoilers).

Let's make a dictionary of names, taking only the unchangeable part from each name (on behalf of Harry Potter, we'll take the name Harry, but on behalf of Hermione Grager, we'll take only the line "Hermione"):

In [4]:
import requests   
from bs4 import BeautifulSoup

base_url = 'https://online.hse.ru/python-as-foreign/1/'

page = requests.get(base_url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text)

chars = ['Гарри', 'Рон', 'Гермион', 'Сириус', 'Хагрид', 'Джинн', 'Лили', 'Астори', 'Люциус', 'Драко']
char_to_char = {} # Type of the dictionary: hero => set of other heroes with whom he communicates

Let's add a search for mentions of heroes to the already written code for viewing the texts of all biographies:

In [5]:
import requests   
from bs4 import BeautifulSoup

base_url = 'https://online.hse.ru/python-as-foreign/1/'

page = requests.get(base_url)
page.encoding = 'utf-8'
soup = BeautifulSoup(page.text)

chars = ['Гарри', 'Рон', 'Гермион', 'Сириус', 'Хагрид', 'Джинн', 'Лили', 'Астори', 'Люциус', 'Драко']
char_to_char = {} # Type of the dictionary: hero => set of other heroes with whom he communicates

for link in soup.find_all('a'):
    if link.get('href').endswith('html'):
        page = requests.get(base_url+link.get('href'))
        if page.status_code == 200:
            page.encoding = 'utf-8'
            s = BeautifulSoup(page.text)
            
            if link.text not in char_to_char:               # If in our dictionary of sets 
                                                            # there is no key for the hero being studied now —
                char_to_char[link.text] = set()             # let's create this key
            for char in chars:                              # Let's look at the list of names of the heroes
                if char not in link.text:                   # If we don't look at the hero's biography page
                    if char in s.find('body').text:         # and if the hero is mentioned on the page,
                        char_to_char[link.text].add(char)   # let's add it to a lot of interactions

for hero in char_to_char:
    print(f'{hero} interacted with: {", ".join(char_to_char[hero])}.')

Гарри Поттер interacted with: Джинн, Люциус, Гермион, Хагрид, Сириус, Лили, Рон, Драко.
Джинни Уизли interacted with: Люциус, Гермион, Гарри, Лили, Сириус, Рон, Драко.
Лили Поттер interacted with: Джинн, Гарри, Хагрид, Сириус, Драко.
Гермиона Грейнджер interacted with: Джинн, Гарри, Хагрид, Сириус, Лили, Рон, Драко.
Сириус Блэк interacted with: Гермион, Гарри, Хагрид, Лили, Рон.
Рубеус Хагрид interacted with: Рон, Драко, Гермион, Гарри.
Рон Уизли interacted with: Джинн, Люциус, Гермион, Гарри, Хагрид, Сириус, Лили, Драко.
Астория Гринграсс interacted with: Люциус, Драко.
Люциус Малфой interacted with: Джинн, Гермион, Гарри, Астори, Хагрид, Сириус, Рон, Драко.
Драко Малфой interacted with: Джинн, Люциус, Гермион, Гарри, Астори, Хагрид, Сириус, Рон.


It can be noted that Astoria Greengrass interacted only with members of her family (Malfoy), which corresponds to the text of the book:

`Astoria Greengrass interacted with: Draco, Lucius.`