# Notebook №8. Information systems

by a student of the IS-20-1 group, Khromenko Danil.
<br>

## With environment for opening files

Last time we discussed working with files, but I didn't have time to tell you about an important construction
with , which is often used to automatically close open files.
Let's consider the application of this construction by example.

First, let's create a file that we will open. Let it be test_123.py . If you
have a file with this name in your laptop folder that is very valuable to you, then replace
the value of the filename variable with something else.

In [1]:
#file name - test_123.py
filename = 'test_123.py'

In [2]:
#open a file for recording
f = open(filename, 'w')
#write the "print('Hello, world!!!')" command to a file
f.write("print('Hello, world!!!')")
#close the file for writing
f.close()

The standard problem with files is that any open file must be closed.
In the previous lecture I showed you this syntax:

In [3]:
#open the file and read it
open(filename).read()

"print('Hello, world!!!')"

This is a very short, but not very good syntax, because closing the file is left to
the so-called garbage collector system, and it is not known exactly when the file will be closed.

*In the standard Python implementation, which is called CPython, garbage collector
is designed in such a way that the file is closed immediately after executing this line,
but other implementations may behave differently and in some situations
problems may arise with such code.*

In [4]:
#It's better to do this
with open(filename) as f:
    print(f.read())

print('Hello, world!!!')


When entering the with construct, a line equivalent to f = open(filename) is executed. Then
the indented lines are executed, and when the indentation ends,
the file will be closed automatically. So these two lines are equivalent to:

In [5]:
#this code is completely similar to the previous one
f = open(filename)
print(f.read())
f.close()

print('Hello, world!!!')


In [6]:
#Here is another example
#open the file and close it after completing the with block
with open(filename) as f:
    #read information from a file
    print(f.read())
    #go back to the beginning of the file
    f.seek(0)
    #read information from a file
    print(f.read())

print('Hello, world!!!')
print('Hello, world!!!')


Here we use the construction f.seek(0) to "rewind" the file to the beginning — in this case, repeated f.read() will again output its contents.

Now let's try to do something with the file after the indentation.

In [7]:
with open(filename) as f:
    print(f.read())
    f.seek(0)
    print(f.read())
#attempt to read the file after closing it (with block)
#an error occurs because the file is closed (not open for reading)
print(f.read())

print('Hello, world!!!')
print('Hello, world!!!')


ValueError: I/O operation on closed file.

As you can see, immediately after the end of the block (indented, as usual), the file is closed.

And the with syntax has several advantages over the traditional approach. Firstly, you
certainly won't forget to close the file, because you can forget to write f.close(), but you can't forget
to remove the indentation. Secondly, even if you don't forget f.close(), you may not reach it
because there was some mistake on the way.

### A little bit about exceptions

In the code below, after the file has been opened, division by 0 occurs. Construction<br>
<pre>
try:
    something
except Name_of_some_error:
    do_something_else
</pre>
allows, if an error of the Name_of_some_error type occurs, not to end
the program with the words "Everything is gone! Error!", and immediately transfer control to the block
do_something_else, which will do something. Interestingly, what is in the block
do_something_else in the example below, the file was still open, which is bad. This
can be compared to the situation: you put the kettle on the stove, but then you got an urgent
call and you ran away, and the fire remained unquenched.

In [8]:
try:
    f = open(filename)
    print(f.read())
    #the error occurs here due to division by 0
    print(10/0)
    print('This is never been printed')
    f.close()
except ZeroDivisionError:
    print("Ups, I did it again!")
    f.seek(0)
    print(f.read())

print('Hello, world!!!')
Ups, I did it again!
print('Hello, world!!!')


And here the situation is like this: although we ran away on an urgent call, but the smart kettle immediately turned itself
off. As you can see, when trying to read from a file in the except block, we get an error, and this
is good, it means that the file has closed, despite the error.

In [9]:
#In this code block, if an error occurs, the file will be closed automatically due to the with block
try:
    with open(filename) as f:
        print(f.read())
        print(10/0)
        print('This is never been printed')
except ZeroDivisionError:
    print("Ups, I did it again!")
    #The error occurs here because the user is trying to get information from a closed file
    print(f.read())

print('Hello, world!!!')
Ups, I did it again!


ValueError: I/O operation on closed file.

## Appending to a file

We often need to do one thing with a file - either read or write. Sometimes we need
to modify a file. Most often, this is done as follows: the file is first read into memory, then
modified in memory and written "from scratch" to the same place as before. If we are talking about not
very large files, then this method works fine.

At the same time, sometimes we don't need to overwrite the file from scratch, but add some information to the
end of the file. Most often, this has to be done to record logs in which some
information about the operation of the program is stored (for example, the web server thus logs from which
addresses it was accessed and which pages were requested). To add something to the end of the file,
you need to open it with the modifier 'a' (from the word append) like this:

In [10]:
#writing information to the end of the file
with open(filename, 'a') as f:
    print("\n" + "print('Some new string')", file = f)

In [11]:
#Let's check that the old content remains in place
with open(filename) as f:
    print(f.read())

print('Hello, world!!!')
print('Some new string')



## Extracting data from web pages

### Loading a web page: requests module

If the line below does not work for you, then do pip install requests or conda install
requests on the command line (for example, in Anaconda Prompt).

In [12]:
import requests

The requests module allows you to access web pages. There are two common ways
to access web pages: a get request and a post request (although there are actually many more types of http requests). A get request is when you send some information to the server in the address bar.
For example, if you go to this address: https://www.google.ru/?q=sevsu+department+IS
(https://www.google.ru/?q=sevsu+department+IS), then you are asking Google to search for "sevsu
department of IS". a post request is when you need to enter information in some form,
for example, enter a login password that will not be displayed in the browser's address bar.

We will use get requests for now.

In [13]:
r = requests.get('http://www.sevsu.ru')

In [14]:
#To check that the page has loaded normally, there is a command
r.ok

True

The value True indicates that everything went fine.

In [15]:
#attempt to navigate to a non-existent page
q = requests.get('http://www.sevsu.ru/anyabsentdirectory')
print(q.ok)

False


We tried to navigate to a non-existent page and it didn't load. Let 's go back to the successful
r query.

In [16]:
#Let's look at the html source of the page with the command
print(r.text)

<!DOCTYPE html>
<html class="no-js" lang="ru">
<head>
    <meta charset="utf-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <title>Севастопольский государственный университет</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <link rel="apple-touch-icon" sizes="57x57" href="/local/templates/sevgu/static/ico/apple-icon-57x57.png">
    <link rel="apple-touch-icon" sizes="60x60" href="/local/templates/sevgu/static/ico/apple-icon-60x60.png">
    <link rel="apple-touch-icon" sizes="72x72" href="/local/templates/sevgu/static/ico/apple-icon-72x72.png">
    <link rel="apple-touch-icon" sizes="76x76" href="/local/templates/sevgu/static/ico/apple-icon-76x76.png">
    <link rel="apple-touch-icon" sizes="114x114" href="/local/templates/sevgu/static/ico/apple-icon-114x114.png">
    <link rel="apple-touch-icon" sizes="120x120" href="/local/templates/sevgu/static/ico/apple-icon-120x120.png">
    <link rel="apple-touch-icon" sizes="144x144" href="/

### A little bit about HTML

What you see above is an HTML page. HTML (HyperText Markup Language) is such
a markup language, which is a special case of the SGML standard. Another special case of SGML is
XML, which we will meet again.

Let's write a simple HTML page. It is most convenient to do this in any editor. But I'll write it to a file via my laptop.

In [17]:
#creating a string containing an html page
my_html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset = "UTF-8">
    <title>Title</title>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr>
<ol>
    <li>One</li>
    <li>Two</li>
</ol>

</body>
</html>
'''

In [18]:
#write our string to an html file
with open('my.html', 'w') as f:
    f.write(my_html)

Open it my.html browser and you will see a simple web page. It can be seen that HTML is divided into special fragments, which are called tags. In the text above there are tags: &lt;html&gt;, &lt;head&gt;, &lt;title&gt;, etc. Each tag marks some piece of the web page. The &lt;title&gt; tag is the page title. The &lt;ol&gt; tag marks an ordered list. The &lt;li&gt; tag corresponds to the list item. The &lt;p&gt; tag is a paragraph. All the listed tags are paired: they mark some fragment of text (possibly containing other tags), placing it between the corresponding opening and the closing tag (for example, &lt;li&gt; is the opening tag, and &lt;/li&gt; is the closing tag; everything in between is a list item). The exception here is the &lt;hr&gt; tag, which draws a horizontal line (it works without &lt;/hr&gt;).

In fact, an HTML page is a set of nested tags. We can say that this is a tree with a root in the &lt;html&gt; tag . Each tag has descendants - those tags that are directly nested in it. For example, the &lt;body&gt; tag will have &lt;h1&gt;, &lt;p&gt;, &lt;hr&gt;, &lt;ol&gt; descendants. It turns out such a kind of family tree.

HTML interests us in order to extract information from such a tree. One of the most popular objects for storing information are tables, so let's insert a small table into our file: it is indicated by the &lt;table&gt; tag, each row of the table is marked with the &lt;tr&gt; tag inside &lt;table&gt;, and each cell is marked with the &lt;td&gt; tag inside &lt;tr&gt;.

In [19]:
#creating a string containing an html page
my_html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset = "UTF-8">
    <title>Title</title>
    <style type='text/css;'>
        table {
        border-collapse: collapse;
    }
    table, th, td {
        border: 1px solid black;
    }
    </style>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr>
<ol>
    <li>One</li>
    <li>Two</li>
</ol>
<table>
    <tr>
        <td>
            Cell 1
        </td>
        <td>
            Cell 2
        </td>
    </tr>
    <tr>
        <td>
            Cell 3
        </td>
        <td>
            Cell 4
        </td>
    </tr>
</table>
</body>
</html>
'''
#write our string to an html file
with open('my.html', 'w') as f:
 f.write(my_html)

Let's assume that it lies somewhere on a remote site. Let's download it using requests and
try to extract some information.

In [20]:
r = requests.get('http://math-info.hse.ru/f/2015-16/all-py/my.html')

### BeautifulSoup

There are many packages for processing web pages. The problem with HTML is that most browsers behave "forgiving", and therefore there are a lot of poorly written (not according to the HTML standard) HTML pages on the web. However, processing even not quite correct HTML code is not so difficult
if you have the right tools at hand.

We will use the *Beautiful Soup 4* package. It is included in the standard Anaconda package, but
if you use another Python distribution, you may have to install it manually
using pip install beautifulsoup4.

*A package called BeautifulSoup is most likely not what you need. This is the third
version (Beautiful Soup 3), and we will use the fourth. So we need a package
beautifulsoup4 . To make it really fun, when importing, you need to specify a different
package name — bs4 , and import a function called BeautifulSoup . In
general, it's easy to get confused at first, but these difficulties need to be overcome once, and
then it will be easier.*

In [21]:
#importing Beautiful Soup
from bs4 import BeautifulSoup

To use *BeautifulSoup*, you need to pass the text of the web page (as
a single line) to the *BeautifulSoup* function. In order for him not to swear, I also manually specify the name of the parser (the
program that just processes HTML) — for compatibility, I use
*html.parser* (it is included in the Python package and does not require installation), but you can also try
using *lxml* if you have it installed.

In [22]:
#using BeautifulSoup
page = BeautifulSoup(r.text, 'html.parser')

In [23]:
#What now lies in the page variable? Let's take a look.
page


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
<style type="text/css;">
        table {
        border-collapse: collapse;
    }

    table, th, td {
        border: 1px solid black;
    }
    </style>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr/>
<ol>
<li>One</li>
<li>Two</li>
</ol>
<table>
<tr>
<td>
            Cell 1
        </td>
<td>
            Cell 2
        </td>
</tr>
<tr>
<td>
            Cell 3
        </td>
<td>
            Cell 4
        </td>
</tr>
</table>
</body>
</html>

We see that the page object is very similar to a string, but, in fact, it is not just a string. 
You can make requests to the page.

In [24]:
#request to page: see what's inside the html tag
page.html

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
<style type="text/css;">
        table {
        border-collapse: collapse;
    }

    table, th, td {
        border: 1px solid black;
    }
    </style>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr/>
<ol>
<li>One</li>
<li>Two</li>
</ol>
<table>
<tr>
<td>
            Cell 1
        </td>
<td>
            Cell 2
        </td>
</tr>
<tr>
<td>
            Cell 3
        </td>
<td>
            Cell 4
        </td>
</tr>
</table>
</body>
</html>

We see what is inside the &lt;html&gt; tag (this is almost the entire page, but the very first line
was "cut off"). You can go deeper and look at the contents of &lt;head&gt;.

In [25]:
#request to page: see what's inside the head tag
page.html.head

<head>
<meta charset="utf-8"/>
<title>Title</title>
<style type="text/css;">
        table {
        border-collapse: collapse;
    }

    table, th, td {
        border: 1px solid black;
    }
    </style>
</head>

Now we only see what's inside the &lt;head&gt; tag . We can go even deeper and get what is inside the &lt;title&gt; tag, which in turn is inside the &lt;head&gt; tag (they say that &lt;title&gt; is a descendant of &lt;head&gt;:

In [26]:
#request to page: see what's inside the title tag
page.html.head.title

<title>Title</title>

However, it would be possible not to write in such detail — since there is only one tag in the document &lt;title&gt;, we could not specify that it is inside &lt;head&gt;, which is inside &lt;html&gt;.

In [27]:
page.head.title

<title>Title</title>

In [28]:
page.title

<title>Title</title>

In [29]:
#One of the descendants of <body> is <table>. You can get it like this:
page.body.table

<table>
<tr>
<td>
            Cell 1
        </td>
<td>
            Cell 2
        </td>
</tr>
<tr>
<td>
            Cell 3
        </td>
<td>
            Cell 4
        </td>
</tr>
</table>

Let's say I need to get multiple items with the same tag, for example, all rows
&lt;tr&gt;. The following syntax is used for this:

In [30]:
#get multiple items with the same tag, for example, all rows <tr>
rows = page.body.table.findAll('tr')
rows

[<tr>
 <td>
             Cell 1
         </td>
 <td>
             Cell 2
         </td>
 </tr>,
 <tr>
 <td>
             Cell 3
         </td>
 <td>
             Cell 4
         </td>
 </tr>]

In [31]:
#number of rows in 'rows'
len(rows)

2

We see that this is a list of two elements. So you can cycle through it.

In [32]:
#output lines using a loop
for i, row in enumerate(rows, 1):
    print(i)
    print(row)

1
<tr>
<td>
            Cell 1
        </td>
<td>
            Cell 2
        </td>
</tr>
2
<tr>
<td>
            Cell 3
        </td>
<td>
            Cell 4
        </td>
</tr>


We have 2 lines and each of them is the same BeautifulSoup object as all
the previous ones. So you can apply the construction to them row.td.

In [33]:
#application of the "row.td" construction
for i, row in enumerate(rows):
    print(i)
    print(row.td)

0
<td>
            Cell 1
        </td>
1
<td>
            Cell 3
        </td>


We see that if there are several &lt;td&gt; tags inside the &lt;row&gt; tag, then row.td will take the first of them.
So we got the first column. But we are not interested in the &lt;td&gt; tag itself, but in the string that
lies there. It can be printed like this.

In [34]:
#application of the "row.td.strong" construction
for i, row in enumerate(rows):
    print(i)
    print(row.td.string)

0

            Cell 1
        
1

            Cell 3
        


It can be seen that there are unnecessary spaces before the line. Delete them with the strip command.

In [35]:
#application of the "row.td.string.strip()" construction
for i, row in enumerate(rows):
    print(i)
    print(row.td.string.strip())

0
Cell 1
1
Cell 3


In [36]:
#Let's load the table as a list of lists
table = []
for i, row in enumerate(rows):
    table.append([])
    for cell in row.findAll('td'):
        table[-1].append(cell.string.strip())
print(table)

[['Cell 1', 'Cell 2'], ['Cell 3', 'Cell 4']]


In [37]:
#Here is the same thing, but shorter with the help of list comprehensions:
table = []
for row in rows:
    table.append([cell.string.strip() for cell in row.findAll('td')])
print(table)

[['Cell 1', 'Cell 2'], ['Cell 3', 'Cell 4']]


In [38]:
#Or even shorter (but more intricate):
table = [[cell.string.strip() for cell in row.findAll('td')]
    for row in rows]
print(table)

[['Cell 1', 'Cell 2'], ['Cell 3', 'Cell 4']]


Note that instead of some_beautiful_soup_object.findAll('sometag'), you can write shorter
some_beautiful_soup_object('sometag'). So you can write even shorter.

In [39]:
#even shorter
table = [[cell.string.strip() for cell in row('td')]
    for row in rows]
print(table)

[['Cell 1', 'Cell 2'], ['Cell 3', 'Cell 4']]


Tags, in addition to the name, have other properties — for example, in the line &lt;html lang="en"&gt; we see the lang property of the &lt;html&gt; tag, which has the value "en". Another important example of a tag with
properties is the &lt;a&gt; tag, which creates a link. It has a href property that stores the actual link.

Now imagine that we want to make a robot that will walk on web pages, and
go from one page to another by links. Then we are faced with the task of extracting all hyperlinks from the page. To do this, you need to find all the &lt;a&gt; tags on the page, and take
the &lt;href&gt; parameter from all of them. To begin with, let's show how to get an object property, for example, lang from html .
This is done as if our object is a dictionary, and we take its value by key.

In [40]:
#request to page: see what's inside the html tag, in the property 'lang'
page.html['lang']

'en'

In [41]:
#If we request a property that the tag does not have, then we will get a KeyError, as with the dictionary
#ERROR: because there is an attempt to request a property that the tag does not have
page.html['strange']

KeyError: 'strange'

Just like the dictionary, there is a get() method that returns nothing if there is no such property. Or returns the default value defined by us.

In [42]:
#using the get() method to check for the presence of a property
page.html.get('strange')

In [43]:
#using the get() method to check for the presence of a property
page.html.get('strange', 'no-such-tag')

'no-such-tag'

In [44]:
#Now we will extract all the links from some site
r = requests.get('http://www.sevsu.ru')
page = BeautifulSoup(r.text, 'html.parser')

In [45]:
#Here are all the links on our page
page('a')

[<a class="link link--blue" href="/admission/" target="_blank" title="Приемная комиссия">Приемная комиссия</a>,
 <a class="link link--black" href="tel:+7 (8692) 222-911" title="Позвонить +7 (8692) 222-911"><i class="dev-notice__icon dev-notice__icon--tel"></i>+7 (8692) 222-911</a>,
 <a class="link link--black" href="/cdn-cgi/l/email-protection#a3d3d1cac6cee3d0c6d5d0d68dd1d6" title="Написать priem@sevsu.ru"><i class="dev-notice__icon dev-notice__icon--mail"></i><span class="__cf_email__" data-cfemail="d0a0a2b9b5bd90a3b5a6a3a5fea2a5">[email protected]</span></a>,
 <a class="header__logo" href="https://www.sevsu.ru">
 <img alt="" src="/local/templates/sevgu/static/img/content/logo-footer.svg"/>
 </a>,
 <a class="logout-icon" href="https://lk.sevsu.ru/user/sign-in/login" title="Личный кабинет">
 <img alt="Личный кабинет" class="select-sight__image" src="/local/html/icon-svg/menu-1.svg"/>
 </a>,
 <a class="faint-sight" href="/search/" title="Поиск">
 <img alt="Поиск" class="select-sight__im

As you can see, the findAll() method (or its abbreviated form of writing in the form of just parentheses) searches not
only for the immediate "children" of some vertex (in genealogical terms), but also for all
descendants.

In [46]:
#Let's print the links ourselves
for link in page("a"):
    if link.get("href")!=None:
        print(link["href"])

/admission/
tel:+7 (8692) 222-911
/cdn-cgi/l/email-protection#a3d3d1cac6cee3d0c6d5d0d68dd1d6
https://www.sevsu.ru
https://lk.sevsu.ru/user/sign-in/login
/search/
?special_version=Y
#
https://www.sevsu.ru/admission/item/podat-dokumenty/
https://www.sevsu.ru/uni/omoo/pkig//admission/item/podat-dokumenty/
https://www.cn.sevsu.ru/admission/item/podat-dokumenty/
https://old.sevsu.ru/arab/admission/item/podat-dokumenty/
https://welcome.sevsu.ru/
https://old.sevsu.ru/studentam/
/uni/dap/service-dap
/uni/career
/uni/
https://lk.sevsu.ru/user/sign-in/login
/search/
?special_version=Y
#
https://www.sevsu.ru/admission/item/podat-dokumenty/
https://www.sevsu.ru/uni/omoo/pkig//admission/item/podat-dokumenty/
https://www.cn.sevsu.ru/admission/item/podat-dokumenty/
https://old.sevsu.ru/arab/admission/item/podat-dokumenty/
#
#
?special_version=Y
/uni/about/
/uni/rector/
/uni/nabsovet/
/uni/academic-board
https://lk.sevsu.ru/user/sign-in/login
/search/
?special_version=Y
#
https://old.sevsu.ru/english/

There are external hyperlinks that start with http, and local ones that lead to the same
site and are relative (that is, before 1516/topology2.php you need to write
http://www.sevsu.ru/ to get the full link to the relevant document).

Now it is clear how our robot should act: for each of the received links, it should
load the corresponding page, find all the links on it, add them to the queue for
research, etc. This is about how web crawlers of search engines work. (Although, of course, they
are much more complicated.)

### P.S. Documentation is your friend

For BeautifulSoup, the documentation is here (http://www.crummy.com/software/BeautifulSoup/bs4/doc/),
and for requests here (http://docs.python-requests.org/en/latest/) (start with Quickstart). Of course, it's in
English, but as my programming teacher said, "after six months of programming classes
, you will consider English a subset of Russian."

Another source of information about libraries is the same web search, which will most often
give links to a website with questions and answers http://stackoverflow.com/ (http://stackoverflow.com/).
For example, typing how to parse table with beautifulsoup (https://www.google.ru/search?q=how+to+parse+table+with+beautifulsoup&gws_rd=cr&ei=wXaJVvzQKIfXyQO4v4PYDw) you will get
some links to stackoverflow with code examples. By the way, you can also ask your
own questions on stackoverflow — but first you need to make sure that they have not been answered before.