# 0. Installing course dependencies

In [1]:
!pip install -r ../requirements.txt
!conda install -c conda-forge ffmpeg -y
!python -m spacy download en_trf_distilbertbaseuncased_lg

# 1. Touching the Internet

Solve the following task. Download [this page](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt), and save it to the file with the name derived from the URL. File with another URL should not be save into the file with this name. E.g. [this file](https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt).

Ref: [requests](https://docs.python-requests.org/en/latest/) library is cool.

In [1]:
import requests
import os.path 
from hashlib import sha512
import base64

url = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"
r = requests.get(url)
hash_filename= "./" + sha512(url.encode()).hexdigest() + ".txt"

with open(hash_filename, 'wb') as f:
    f.write(r.content)

# 2. Parsing different formats

Most probably, if you meet something in Internet, this is: binary, plain text, XML, or json. XML also splits into xHTML, RSS, Atom, SOAP, XML-RPC, ... . Your task is to learn, how to process different formats.

## 2.1. JSON

In [the given file](http://sprotasov.ru/data/postnauka.txt) there is valid json. Parse it and print all video URLs, which have `computer science` tag. Use built-in features of `requests`, or just a `json` library ([ref](https://docs.python.org/3/library/json.html)).

In [2]:
import json
import requests

url = "http://sprotasov.ru/data/postnauka.txt"
resp = requests.get(url=url)
data = json.loads(resp.text.encode().decode('utf-8-sig') )
urls = []
for dict in data:
    if 'computer science' in dict['tags']:
        urls.append(dict['url'])
print(urls)

['http://postnauka.ru/talks/31897', 'http://postnauka.ru/video/24306', 'http://postnauka.ru/faq/46974']


## 2.2. HTML

For a given StackExchange answer extract logins of the contributors (who asked and who answered) with votes. [bs4](https://beautiful-soup-4.readthedocs.io/en/latest/) will help you to do the job.

I can recommend to use CSS or XPath selectors. `div` elements with `post-layout` class represent answers. Inside there are `div` with `votecell` class stroring votes number and `div` with class `user-details` storing user info. My personal recommendation is to use `css selectors`, which are [documented here](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors).

In [3]:
import requests
from bs4 import BeautifulSoup

url = f"https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd"
print(url)

resp = requests.get(url=url)
soup = resp.text
soup = BeautifulSoup(soup)
mydivs = soup.select("div.post-layout")
# print(mydivs[0])
votevals = []
names = []
for div in mydivs:
    vote_num = div.select("div.js-vote-count")[0]
    votevals.append(int(vote_num.get_text().replace('\n', "").replace('\r', "").replace(" ", "")))
    name = div.select("div.user-details")[0]
    try:
        name = name.find_all('a')[0].getText()
    except IndexError:
        name = "None"
    names.append(name)#.split("\n"))#[1])
    
for vote, name in zip(votevals, names):
    if vote:
        print(vote, name)

https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd
23 Rodrigo de Azevedo
17 Ittay Weiss
9 None
4 Bart Vanderbeke
3 Bart Vanderbeke
2 hgfei
1 littleO
1 TheSHETTY-Paradise


# 2.3. RSS feed

A lot of information is already organized in typed XML documents. Podcasts are just RSS feed. Parse [the feed of this podcast](http://sprotasov.ru/podcast/rss.xml) and print out the time span between the first and the last episodes. Use [`feedparser` for this](https://waylonwalker.com/parsing-rss-python/).

In [4]:
import feedparser
from datetime import datetime, timedelta
rss = 'http://sprotasov.ru/podcast/rss.xml'
feedparser.parse(rss) 
entries = feedparser.parse(rss)['entries']

#if to calculate duration
# feeds = [entry['itunes_duration'] for entry in entries]
# times = []
# for datetime_str in feeds:
#     datetime_object = (datetime.strptime(datetime_str, '%H:%M:%S')-datetime(1900, 1, 1)).total_seconds()
#     times.append(datetime_object)
# print(timedelta(seconds=sum(times)))

#if to calculate anual period
feeds = [entry['published_parsed'] for entry in entries]
s_year, s_mon, s_mday, s_hour, s_min, s_sec, _, _, _ = feeds[-1]
e_year, e_mon, e_mday, e_hour, e_min, e_sec, _, _, _ = feeds[0]
times = []
s_datetime_object = datetime(*[s_year, s_mon, s_mday, s_hour, s_min, s_sec])
e_datetime_object = datetime(*[e_year, e_mon, e_mday, e_hour, e_min, e_sec])
diff = e_datetime_object - s_datetime_object
print(diff)

# the following compuatation is not very precise!!!! 
# (as it does not account particular days and years)
# https://stackoverflow.com/a/4040338
days = diff.days
years = days//365
months = (days-365*years)//30
days = days - years*365 - months*30
print("podcast took {} years, {} months, {} days".format(years, months, days))

1801 days, 10:30:00
podcast took 4 years, 11 months, 11 days


# 3. Solving simple information retrieval task

According to the name, `information retrieval` is the discipline, which helps retrieves information (from unstructured sources). Thus, we will retrieve some information from [this news article](https://www.bbc.com/news/world-us-canada-59944889). Your task is to write a code, which will answer the question: **How many people die every day in the US waiting for a transplant?** Write flexible enough code. Test yourself by changing the link to [this one](https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/).

In [85]:
import requests
url = 'https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/'
url = 'https://www.bbc.com/news/world-us-canada-59944889'
question = 'How many people die every day in the US waiting for a transplant?'

resp = requests.get(url=url)
soup = resp.text
soup = BeautifulSoup(soup)
mydivs = soup.find_all("body")[0].text.replace("\n", " ").replace("\r", "")
text = ""
for char in mydivs:
    text+=char
text = text.split('.')
bow = question.split()
scores = []
for sentence in text:
    score = 0
    for word in bow:
        if word in sentence: score+=1
    scores.append(score)
index = scores.index(max(scores))
print(text[index])

 Currently 17 people die every day in the US waiting for a transplant, with more than 100,000 reportedly on the waiting list
