# 0. Installing course dependencies

In [1]:
!pip install -r ../requirements.txt

Collecting argparse~=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


In [3]:
!conda install -c conda-forge ffmpeg -y

'conda' is not recognized as an internal or external command,
operable program or batch file.


Run the next cell if you want to download embedding model, but this is not required during this lab. You can do it later.

In [4]:
!python -m spacy download en_trf_distilbertbaseuncased_lg


✘ No compatible package found for 'en_trf_distilbertbaseuncased_lg' (spaCy
v3.2.1)



# 1. Touching the Internet

Solve the following task. Download [this page](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt), and save it to the file with the name derived from the URL. File with another URL should not be save into the file with this name. E.g. [this file](https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt).

Ref: [requests](https://docs.python-requests.org/en/latest/) library is cool.

In [12]:
import requests

url = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"

# TODO: download and save to a file
request = requests.get(url)
open("facts.txt", 'wb').write(request.content)

13160

# 2. Parsing different formats

Most probably, if you meet something in Internet, this is: binary, plain text, XML, or json. XML also splits into xHTML, RSS, Atom, SOAP, XML-RPC, ... . Your task is to learn, how to process different formats.

## 2.1. JSON

In [the given file](http://sprotasov.ru/data/postnauka.txt) there is valid json. Parse it and print all video URLs, which have `computer science` tag. Use built-in features of `requests`, or just a `json` library ([ref](https://docs.python.org/3/library/json.html)).

In [13]:
import json
import requests

# TODO. Your code here
url = "http://sprotasov.ru/data/postnauka.txt"

request = requests.get(url)

document = json.loads(request.content)

for record in document:
    if "computer science" in record["tags"]:
        print(record["url"])


http://postnauka.ru/talks/31897
http://postnauka.ru/video/24306
http://postnauka.ru/faq/46974


## 2.2. HTML

For a given StackExchange answer extract logins of the contributors (who asked and who answered) with votes. [bs4](https://beautiful-soup-4.readthedocs.io/en/latest/) will help you to do the job.

I can recommend to use CSS or XPath selectors. `div` elements with `post-layout` class represent answers. Inside there are `div` with `votecell` class stroring votes number and `div` with class `user-details` storing user info. My personal recommendation is to use `css selectors`, which are [documented here](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors).

In [14]:
import requests
from bs4 import BeautifulSoup

url = f"https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd"
print(url)

# TODO. Your code here should parse HTML source page and find contributors of the repository.
request = requests.get(url)
document = BeautifulSoup(request.content)

for vote_count in document.find_all("div", {"class" : "js-vote-count"}):
    # vote count is 3rd generation child of every post
    post_layout = vote_count.parent.parent.parent
    author_user_details = post_layout.find("div", {"class" : "user-details", "itemprop" : "author"})
    name = author_user_details.a.string

    score = vote_count.string.strip()

    print(f"Contributor \"{name}\" has {score} votes.")


https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd
Contributor "Celdor" has 20 votes.
Contributor "Ittay Weiss" has 16 votes.
Contributor "Tomasz Bartkowiak" has 8 votes.
Contributor "Bart Vanderbeke" has 4 votes.
Contributor "Bart Vanderbeke" has 3 votes.
Contributor "hgfei" has 2 votes.
Contributor "littleO" has 1 votes.
Contributor "TheSHETTY-Paradise" has 1 votes.


# 2.3. RSS feed

A lot of information is already organized in typed XML documents. Podcasts are just RSS feed. Parse [the feed of this podcast](http://sprotasov.ru/podcast/rss.xml) and print out the time span between the first and the last episodes. Use [`feedparser` for this](https://waylonwalker.com/parsing-rss-python/).

In [26]:
import feedparser
import datetime
rss = 'http://sprotasov.ru/podcast/rss.xml'
feed = feedparser.parse(rss) 

# TODO: complete the code to compute the time span of all the episodes.
total_time = datetime.timedelta()

for entry in feed["entries"]:
    h, m, s = entry["itunes_duration"].split(":")
    total_time += datetime.timedelta(hours = int(h), minutes = int(m), seconds = int(s))

print(f"Total time: {total_time}.")

Total time: 4:36:49.


# 3. Solving simple information retrieval task

According to the name, `information retrieval` is the discipline, which helps retrieves information (from unstructured sources). Thus, we will retrieve some information from [this news article](https://www.bbc.com/news/world-us-canada-59944889). Your task is to write a code, which will answer the question: **How many people die every day in the US waiting for a transplant?** Write flexible enough code. Test yourself by changing the link to [this one](https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/).

In [60]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.com/news/world-us-canada-59944889'
url2 = "https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/"
question = 'How many people die every day in the US waiting for a transplant?'

# TODO. Impress me!
request = requests.get(url2)
soup = BeautifulSoup(request.content)

# Calculates how similar are two sentences
def GetSimilarityScore(sentence, baseline):
    match_count = 0
    base_words = baseline.split()
    sentence_words = sentence.split()

    if len(sentence_words) > 0:
        for word in sentence_words:
            if word in base_words:
                match_count += 1
        return match_count / len(sentence_words)
    return 0

question_low = question.lower()
max_score = 0
answer = "Not found"
for sentence in soup.get_text(" <|> ", strip = True).split("."):
    match_count = GetSimilarityScore(sentence.lower(), question_low)
    if match_count > max_score:
        max_score = match_count
        answer = sentence.replace(" <|> ", " ") + "."

print(f"Question: {question}")
print(f"Answer: {answer}")
print(f"Score: {max_score * 100:.02f}%.")

Question: How many people die every day in the US waiting for a transplant?
Answer:  On average, 17 people die every day from the lack of available organs for transplant.
Score: 31.58%.
