# 0. [OPTIONAL] Installing course dependencies

These are dependencies for the whole course.

In [None]:
!pip install -r ../requirements.txt

You may skip the next block for now. You will need `ffmpeg` on week 12.

In [None]:
# !conda update -y base conda
!conda install -c conda-forge ffmpeg -y

Run the next cell if you want to download embedding model, but this is not required during this lab. You can do it later.

In [None]:
!python -m spacy download en_trf_distilbertbaseuncased_lg

# 1. Touching the Internet

Solve the following task.
1. Download [this page](https://raw.githubusercontent.com/YusufRoshdy/information-retrieval/main/datasets/facts.txt)
2. Save it to the file with the **unique** name derived from the URL. NB File with another URL should not be save into the file with this name. E.g. [this file](https://github.com/YusufRoshdy/information-retrieval/blob/main/datasets/facts.txt) is another file with another content!

Hints:
- [requests](https://docs.python-requests.org/en/latest/) library is cool.
- [hashlib](https://docs.python.org/3/library/hashlib.html) may help with computing hash strings.
- when you download and save the data, don't try to encode and decode it. Use binary format when working with streams and files. <span style="color:red">Discuss with your TA which encodings you know and how they differ</span>.

In [None]:
import requests

url1 = "https://raw.githubusercontent.com/YusufRoshdy/information-retrieval/main/datasets/facts.txt"
url2 = "https://github.com/YusufRoshdy/information-retrieval/blob/main/datasets/facts.txt"

# TODO: download and save these documents

In [None]:
import requests
import hashlib

url1 = "https://raw.githubusercontent.com/YusufRoshdy/information-retrieval/main/datasets/facts.txt"

response = requests.get(url1)

response.headers['Content-Type']
if response.status_code == 200:
  filename = hashlib.sha256(url1.encode()).hexdigest()
  filename = f'{filename}.txt'

  with open (filename, 'wb') as file:
    file.write(response.content)

In [None]:
import requests
import hashlib

url2 = "https://github.com/YusufRoshdy/information-retrieval/blob/main/datasets/facts.txt"

response = requests.get(url2)

response.headers['Content-Type']
if response.status_code == 200:
  filename = hashlib.sha256(url2.encode()).hexdigest()
  filename = f'{filename}.txt'

  with open (filename, 'wb') as file:
    file.write(response.content)

# 2. Parsing different formats

Most probably, if you meet something in the Internet, this is one of: binary, plain text, XML, or json. XML then splits into xHTML, RSS, Atom, SOAP, XML-RPC, ... . Your task is to learn, how to process different formats.

## 2.1. JSON

In [the given file](https://raw.githubusercontent.com/YusufRoshdy/information-retrieval/main/datasets/unique_videos.json) there is valid json. Parse this file and print all video URLs, which have `computer science` tag. Use built-in features of `requests`, or just a `json` library ([ref](https://docs.python.org/3/library/json.html)).

Hint:
- if the file has issues with parsing read about [the difference](https://stackoverflow.com/questions/57152985/what-is-the-difference-between-utf-8-and-utf-8-sig).

In [None]:
import json
import requests

url = "https://raw.githubusercontent.com/YusufRoshdy/information-retrieval/main/datasets/unique_videos.json"

json_took = requests.get(url)
formatt = json_took.json()
videos = formatt.get('videos')
cs_list = []
for video in videos:
  if 'computer science' in video.get('tags'):
    cs_list.append(video.get('url'))

for ur in cs_list:
  print(ur)

http://www.tech_tutorialhub.com/lessons/video_1
http://www.coding_learnnow.com/topics/video_2
http://www.coding_learnnow.com/courses/video_3
http://www.coding_videostore.com/courses/video_4
http://www.cs_tutorialhub.com/topics/video_5


## 2.2. HTML

For a given StackExchange answer extract logins of the contributors (who asked and who answered) with votes. [bs4](https://beautiful-soup-4.readthedocs.io/en/latest/) will help you to do the job.

I can recommend to use CSS or XPath selectors. `div` elements with `post-layout` class represent answers. Inside there are `div` with `votecell` class stroring votes number and `div` with class `user-details` storing user info. My personal recommendation is to use `css selectors`, which are [documented here](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors).

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://math.stackexchange.com/questions/411486/"\
        "understanding-the-singular-value-decomposition-svd"
print(url)

# TODO. Your code here should parse HTML source page and find contributors of the repository.

parsed_page = requests.get(url)

soup = BeautifulSoup(parsed_page.content, 'html.parser')
# print(soup.prettify())
answer_elements = soup.select('div.post-layout')
votes = []
user_details = []
for answer in answer_elements:
  votes_element = answer.select_one('div.votecell')
  votes.append(votes_element.get_text(strip=True))
  user_info = answer.select_one('div.user-details')
  user_details.append(user_info.get_text(strip=True))
for index, _ in enumerate(votes):
  print(f'{user_details[index]} \t: {votes[index]}')
# print(answer_elements)

ModuleNotFoundError: No module named 'bs4'

# 2.3. RSS feed

A lot of information is already organized in typed XML documents. Podcasts, for example, are just RSS feed. Parse [the feed of this podcast](https://waylonwalker.com/rss.xml) and print out:
- the number of episodes
- the length of the time span between the first and the last episodes (in days).

Use [`feedparser` library for this](https://waylonwalker.com/parsing-rss-python/).

In [None]:
import feedparser
rss = 'https://waylonwalker.com/rss.xml'
feedparser.parse(rss)

# TODO: complete the code to compute the time span of all the episodes.

# 3. [EXTRA TASK] Solving simple information retrieval task

According to the name, `information retrieval` is the discipline, which helps retrieves information (from unstructured sources). Thus, we will retrieve some information from [this news article](https://www.nbcnews.com/health/health-news/organ-transplant-work-us-revamps-organ-donation-system-rcna76103). Your task is to write a code, which will answer the question: **How many people die every day in the US waiting for a transplant?** Write flexible enough code. Test yourself by changing the link to [this one](https://www.kidney.org/news/newsroom/factsheets/Organ-Donation-and-Transplantation-Stats).

In [None]:
import requests
url = 'https://www.nbcnews.com/health/health-news/organ-transplant-work-us-revamps-organ-donation-system-rcna76103'
url = 'https://www.kidney.org/news/newsroom/factsheets/Organ-Donation-and-Transplantation-Stats'
# url = 'https://www.bbc.com/news/world-us-canada-59944889' # One more link

question = 'How many people die every day in the US waiting for a transplant?'

# TODO. Impress me!