# 0. [OPTIONAL] Installing course dependencies

These are dependencies for the whole course.

In [None]:
!pip install -r ../requirements.txt

You may skip the next block for now. You will need `ffmpeg` on week 12.

In [None]:
# !conda update -y base conda
!conda install -c conda-forge ffmpeg -y

Run the next cell if you want to download embedding model, but this is not required during this lab. You can do it later.

In [None]:
!python -m spacy download en_trf_distilbertbaseuncased_lg

# 1. Touching the Internet

Solve the following task.
1. Download [this page](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt)
2. Save it to the file with the **unique** name derived from the URL. NB File with another URL should not be save into the file with this name. E.g. [this file](https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt) is another file with another content!

Hints:
- [requests](https://docs.python-requests.org/en/latest/) library is cool.
- [hashlib](https://docs.python.org/3/library/hashlib.html) may help with computing hash strings.
- when you download and save the data, don't try to encode and decode it. Use binary format when working with streams and files. <span style="color:red">Discuss with your TA which encodings you know and how they differ</span>.

In [3]:
import requests
import hashlib

url1 = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"
url2 = "https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt"

# TODO: download and save these documents
def save_webpage(url):
    respone = requests.get(url)

    print(respone)
    # Ensure that the response is valid
    respone.raise_for_status()

    #Generate a uniqe hash from the URL
    #print(hashlib.md5(url.encode()))
    url_hash =hashlib.sha256(url.encode()).hexdigest()
    print(url_hash)

    #Save the content to the file names ny the hash:
    with open( url_hash + ".txt" , "wb") as file:
        file.write(respone.content)


save_webpage(url1)
save_webpage(url2)

<Response [200]>
<md5 _hashlib.HASH object @ 0x7fe5147a5d30>
e2d9cd700ca8f8c01cd68193a1249d29030a1782108055b444a5ead7ab310425
<Response [200]>
<md5 _hashlib.HASH object @ 0x7fe51db63ad0>
7770f104796093e196fac1e6822438ba9a2dadc33030edb373caf84a4fc99005


# 2. Parsing different formats

Most probably, if you meet something in the Internet, this is one of: binary, plain text, XML, or json. XML then splits into xHTML, RSS, Atom, SOAP, XML-RPC, ... . Your task is to learn, how to process different formats.

## 2.1. JSON

In [the given file](http://sprotasov.ru/data/postnauka.txt) there is valid json. Parse this file and print all video URLs, which have `computer science` tag. Use built-in features of `requests`, or just a `json` library ([ref](https://docs.python.org/3/library/json.html)).

Hint:
- if the file has issues with parsing read about [the difference](https://stackoverflow.com/questions/57152985/what-is-the-difference-between-utf-8-and-utf-8-sig).

In [4]:
import requests
import json

# URL of the JSON file
URL = "https://raw.githubusercontent.com/YusufRoshdy/information-retrieval/main/datasets/unique_videos.json"

# Fetch the content from the URL
response = requests.get(URL)

# Check if the response status code is 200 (OK)
if response.status_code == 200:
    # Load the JSON data from the response content
    data = json.loads(response.content)
    
    # Iterate through each video in the data
    for video in data['videos']:
        # Check if "computer science" is one of the tags
        if 'computer science' in video['tags']:
            # Print the URL
            print(video['url'])
else:
    print("Failed to fetch the JSON file!")



http://www.tech_tutorialhub.com/lessons/video_1
http://www.coding_learnnow.com/topics/video_2
http://www.coding_learnnow.com/courses/video_3
http://www.coding_videostore.com/courses/video_4
http://www.cs_tutorialhub.com/topics/video_5


## 2.2. HTML

For a given StackExchange answer extract logins of the contributors (who asked and who answered) with votes. [bs4](https://beautiful-soup-4.readthedocs.io/en/latest/) will help you to do the job.

I can recommend to use CSS or XPath selectors. `div` elements with `post-layout` class represent answers. Inside there are `div` with `votecell` class stroring votes number and `div` with class `user-details` storing user info. My personal recommendation is to use `css selectors`, which are [documented here](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors).

In [29]:
! pip install BeautifulSoup4

Defaulting to user installation because normal site-packages is not writeable
Collecting BeautifulSoup4
  Using cached beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.4.1-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.12.2 soupsieve-2.4.1


In [5]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd")
soup = BeautifulSoup(response.text, 'html.parser')

# Finding contributors of the repository
for post in soup.select(".post-layout"):
    user = post.select_one(".user-details a").text
    votes = post.select_one(".votecell .js-vote-count").text
    print(f"{user}: {votes} votes")


Rodrigo de Azevedo: 
            23
         votes
Ittay Weiss: 
            17
         votes
Tomasz Bartkowiak: 
            12
         votes
Bart Vanderbeke: 
            4
         votes
Bart Vanderbeke: 
            3
         votes
hgfei: 
            2
         votes
littleO: 
            1
         votes
TheSHETTY-Paradise: 
            1
         votes


# 2.3. RSS feed

A lot of information is already organized in typed XML documents. Podcasts, for example, are just RSS feed. Parse [the feed of this podcast](http://sprotasov.ru/podcast/rss.xml) and print out:
- the number of episodes
- the length of the time span between the first and the last episodes (in days).

Use [`feedparser` library for this](https://waylonwalker.com/parsing-rss-python/).

In [39]:
! pip install feedparser

Defaulting to user installation because normal site-packages is not writeable
Collecting feedparser
  Using cached feedparser-6.0.10-py3-none-any.whl (81 kB)
Collecting sgmllib3k
  Using cached sgmllib3k-1.0.0-py3-none-any.whl
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0


In [12]:
import feedparser
from datetime import datetime

rss_feed = feedparser.parse('https://waylonwalker.com/rss.xml')

feeds = [entry.published_parsed  for entry in rss_feed.entries if 'published_parsed' in entry]
start_date = min(feeds)
end_date = max(feeds)
days_difference = (datetime(*end_date[:6]) - datetime(*start_date[:6])).days
print(f"Number of episodes: {len(rss_feed.entries)}")
print(f"Days between first and last episode: {days_difference}")


Number of episodes: 695
Days between first and last episode: 8576


# 3. Solving simple information retrieval task

According to the name, `information retrieval` is the discipline, which helps retrieves information (from unstructured sources). Thus, we will retrieve some information from [this news article](https://www.bbc.com/news/world-us-canada-59944889). Your task is to write a code, which will answer the question: **How many people die every day in the US waiting for a transplant?** Write flexible enough code. Test yourself by changing the link to [this one](https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/).

In [7]:
import requests
url = 'https://www.bbc.com/news/world-us-canada-59944889'
url2 = 'https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/'

question = 'How many people die every day in the US waiting for a transplant?'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()

import re

match = re.search(r'(\d+) people die every day in the US waiting for a transplant', text)
if match:
    answer = match.group(1)
    print(f"{answer} people die every day in the US waiting for a transplant.")
else:
    print("Couldn't find the information.")


17 people die every day in the US waiting for a transplant.
