# 0. [OPTIONAL] Installing course dependencies

These are dependencies for the whole course.

In [3]:
!pip install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting argparse>=1.4.0
  Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


You may skip the next block for now. You will need `ffmpeg` on week 12.

In [None]:
# !conda update -y base conda
!conda install -c conda-forge ffmpeg -y

Run the next cell if you want to download embedding model, but this is not required during this lab. You can do it later.

In [None]:
!python -m spacy download en_trf_distilbertbaseuncased_lg

# 1. Touching the Internet

Solve the following task.
1. Download [this page](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt)
2. Save it to the file with the **unique** name derived from the URL. NB File with another URL should not be save into the file with this name. E.g. [this file](https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt) is another file with another content!

Hints:
- [requests](https://docs.python-requests.org/en/latest/) library is cool.
- [hashlib](https://docs.python.org/3/library/hashlib.html) may help with computing hash strings.
- when you download and save the data, don't try to encode and decode it. Use binary format when working with streams and files. <span style="color:red">Discuss with your TA which encodings you know and how they differ</span>.

In [4]:
import requests
from hashlib import sha512

url1 = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"
url2 = "https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt"

r1 = requests.get(url1)
hash_filename1= "./" + sha512(url1.encode()).hexdigest() + ".txt"
open(hash_filename1, 'wb').write(r1.content)

r2 = requests.get(url2)
hash_filename2= "./" + sha512(url2.encode()).hexdigest() + ".txt"
open(hash_filename2, 'wb').write(r2.content)


199805

# 2. Parsing different formats

Most probably, if you meet something in the Internet, this is one of: binary, plain text, XML, or json. XML then splits into xHTML, RSS, Atom, SOAP, XML-RPC, ... . Your task is to learn, how to process different formats.

## 2.1. JSON

In [the given file](http://sprotasov.ru/data/postnauka.txt) there is valid json. Parse this file and print all video URLs, which have `computer science` tag. Use built-in features of `requests`, or just a `json` library ([ref](https://docs.python.org/3/library/json.html)).

Hint:
- if the file has issues with parsing read about [the difference](https://stackoverflow.com/questions/57152985/what-is-the-difference-between-utf-8-and-utf-8-sig).

In [5]:
import json
import requests

url = "http://sprotasov.ru/data/postnauka.txt"
r = requests.get(url)
r.encoding="utf-8-sig"

for item in r.json():
  if "computer science" in item["tags"]:
    print(item["url"])

http://postnauka.ru/talks/31897
http://postnauka.ru/video/24306
http://postnauka.ru/faq/46974


## 2.2. HTML

For a given StackExchange answer extract logins of the contributors (who asked and who answered) with votes. [bs4](https://beautiful-soup-4.readthedocs.io/en/latest/) will help you to do the job.

I can recommend to use CSS or XPath selectors. `div` elements with `post-layout` class represent answers. Inside there are `div` with `votecell` class stroring votes number and `div` with class `user-details` storing user info. My personal recommendation is to use `css selectors`, which are [documented here](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors).

In [6]:
import requests
from bs4 import BeautifulSoup

url = "https://math.stackexchange.com/questions/411486/"\
        "understanding-the-singular-value-decomposition-svd"
r = requests.get(url)
soup = BeautifulSoup(r.content)
mydivs = soup.find_all("div", {"class": "post-layout"})
for div in mydivs:
  votes = div.find_all("div", {"class": "js-vote-count"})[0].getText().strip()
  users = div.find_all("div", {"class": "user-details"})
  names = []
  for user in users:
    name = user.find('a')
    if name:
      names.append(name.getText())

  print(f"({names}, {votes})")
  

(['Rodrigo de Azevedo', 'Celdor'], 23)
(['Ittay Weiss'], 17)
(['Tomasz Bartkowiak'], 10)
(['Bart Vanderbeke'], 4)
(['Bart Vanderbeke'], 3)
(['hgfei'], 2)
(['littleO'], 1)
(['TheSHETTY-Paradise'], 1)


# 2.3. RSS feed

A lot of information is already organized in typed XML documents. Podcasts, for example, are just RSS feed. Parse [the feed of this podcast](http://sprotasov.ru/podcast/rss.xml) and print out:
- the number of episodes
- the length of the time span between the first and the last episodes (in days).

Use [`feedparser` library for this](https://waylonwalker.com/parsing-rss-python/).

In [7]:
import feedparser
from datetime import datetime

rss = 'http://sprotasov.ru/podcast/rss.xml'
parsed = feedparser.parse(rss) 
print(parsed.entries[0].keys())
print(parsed.entries[0]['published_parsed'])
episodes = len(parsed.entries)

first_date = parsed.entries[-1]['published']
last_date = parsed.entries[0]['published']

#Ref: https://pynative.com/python-datetime-format-strftime/
#Ref: https://pynative.com/python-difference-between-two-dates/#how-to-measure-execution-time-in-python

formats = ["%a, %d %b %Y %H:%M:%S %z", "%A, %d %b %Y %H:%M:%S %z", "%a, %d %B %Y %H:%M:%S %z", "%A, %d %B %Y %H:%M:%S %z"]
for format in formats: 
  #first date
  try:
    d1 = datetime.strptime(first_date, format)
  except ValueError: 
    pass 

  #last date
  try:
    d2 = datetime.strptime(last_date, format)
  except ValueError: 
    pass 

span = d2-d1

#Number of episodes
print(f'Number of episodes: {episodes}')
#length of time between first and last episodes in days
print(f'Span of time between first and last episodes: {span.days} days')

dict_keys(['title', 'title_detail', 'summary', 'summary_detail', 'content', 'links', 'tags', 'published', 'published_parsed', 'itunes_duration', 'id', 'guidislink'])
time.struct_time(tm_year=2022, tm_mon=11, tm_mday=14, tm_hour=18, tm_min=25, tm_sec=0, tm_wday=0, tm_yday=318, tm_isdst=0)
Number of episodes: 18
Span of time between first and last episodes: 2030 days


# 3. [EXTRA TASK] Solving simple information retrieval task

According to the name, `information retrieval` is the discipline, which helps retrieves information (from unstructured sources). Thus, we will retrieve some information from [this news article](https://www.bbc.com/news/world-us-canada-59944889). Your task is to write a code, which will answer the question: **How many people die every day in the US waiting for a transplant?** Write flexible enough code. Test yourself by changing the link to [this one](https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/).

In [None]:
import requests
url = 'https://www.bbc.com/news/world-us-canada-59944889'
url2 = 'https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/'

question = 'How many people die every day in the US waiting for a transplant?'

# TODO. Impress me!