# Homework (deadline 10.01.2025 11:59:59)

Write solutions for the homework exercises in this notebook. Once the work is done download the notebook file (`File > Download .ipynb`) rename it properly so it follows a template `HW2_<SURNAME>_<NAME>.ipynb` and upload it to the [Google Classroom](https://classroom.google.com/c/NzIwNDg4NDAyMTA2/a/NzIwNDg4NDAyMTQz/details).

Remember that you can contact me via email if you have any problems. Moreover, you can also visit me in the ISS on the fourth floor (room 415). Usually, I am there from 11ish but please let me know in advance if you are coming because I might be busy. 

## Task 1 (15 points)
You all should be able to get the file `links.jsonl` from the [Google Classrom](https://drive.google.com/file/d/1f7mPwNCuihV-sogjIdFO9U6BS_ZW6ioM/view). It contains 20 links to The Parliament Magazine and The Guardian articles. Your task is to:

* Read `links.jsonl` to Python (Google Colab)
* Scrap the articles from the given links
* Write out the results into `HW2_<SURNAME>_<NAME>.jsonl`

From every website you should get the following information:

* Title of the article
* Author of the text (if present, if not the value should be None)
* Lead of the article (summary)
* Date of publication
* Text of the article

Additionally, you should count the number of words in the given article and extract only relevant information from the `links.jsonl`.

*Reminder:* A JSON line file contains multiple JSONs. As the name suggests each line contains one. Therefore, each line should look more or less like that:

```python
{ 
	"title" : "Why are Namedays better than Birthdays?",
    "author" : "M. Biesaga",
    "date" : "06.12.2019",
    "lead" : "Scientists from one of the best Universities in the U.S. proved that the discussion about birthdays and namedays is finally over.",
    "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer eget sapien nisi. In placerat nisl felis, vel porttitor odio aliquam quis. Nulla a facilisis arcu. Suspendisse potenti. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis et semper urna. Mauris sit amet enim ex. Integer eget ultricies enim, at tristique sem. In eu eros nisl. Nulla vitae pretium risus, quis vestibulum dui. Nullam vitae dapibus quam. Maecenas commodo dictum ex, id vestibulum ex volutpat interdum. Cras tempor diam non urna auctor, vitae dignissim tortor tristique. Nunc consectetur mauris non lorem luctus aliquam. Mauris vitae ligula orci.",
	"source" : "Journal of Scientific Science",
	"fb" : { 
		        "likes" : 112,
	            "shares" : 2,
			    "comments" : 43
	},
	"length" : 98 
}
```
**IMPORTANT**: If you are going to send requests in a loop please don't send more than 1 request per second. You can use `time.sleep(1)` from the `time` module to halt your code for 1 second. For example, in a loop it would look something like that:
```python
for line in links:
	## Send request
	time.sleep(1)
```
It will perform the code in a loop and at the end of each iteration, it will wait for a second.

In [None]:
## Import modules
import json
import requests as rq
from bs4 import BeautifulSoup
import time

## Solution 1

In [None]:
## Read links.jsonl to Python
with open("links.jsonl", "r") as file:
    links = [json.loads(line) for line in file.readlines()]

In [None]:
## Create an empty list
output = []

## Iterate over all elements of the links list
for line in links:
    ## Create an empty dictionary
    temp_dict = {}
    temp_dict["source"] = line["source_name"]
    temp_dict["fb"] = {}
    temp_dict["fb"]["likes"] = line["fb"]["likes"]
    temp_dict["fb"]["comments"] = line["fb"]["comments"]
    temp_dict["fb"]["shares"] = line["fb"]["shares"]

    ## Send a request
    response = rq.get(line["url"])

    ## Convert the response to an HTML object
    html = BeautifulSoup(response.content, "html.parser")

    if temp_dict["source"] == "The Parliament Magazine":
        temp_dict["title"] = html.select_one("div.av-title h1").text.strip()
        ## I don't extract text from the HTML object because I wll do it later for
        ## all sources.
        temp_dict["author"] = html.select_one("p.av-authName a")
        temp_dict["lead"] = html.select_one("div.av-standFirst.playfair").text.strip()
        temp_dict["date"] = html.select_one("p.av-date").text.strip()
        ## Extract the content of the article in the list comprehension.
        temp_dict["content"] = " ".join(
            p.text.strip() for p in html.select("div.av-main p")
        )
    elif temp_dict["source"] == "The Guardian":
        temp_dict["title"] = html.select_one("article div h1").text.strip()
        ## I don't extract text from the HTML object because I wll do it later for
        ## all sources. In The Guardian, one article would return None because
        ## apparently it does not have an author. If I used the strip() method on
        ## the None value it would return an error.
        temp_dict["author"] = html.select_one("address div a")
        temp_dict["date"] = html.select_one("details summary span").text.strip()
        temp_dict["lead"] = html.select_one("div p").text.strip()
        temp_dict["content"] = " ".join(
            p.text.strip() for p in html.select("article div#maincontent p")
        )
    ## Here, I check whether the author field is None. If not
    ## I extract the name of the author.
    if temp_dict["author"] is not None:
        temp_dict["author"] = temp_dict["author"].text.strip()
    ## Compute the length of the article.
    temp_dict["length"] = len(temp_dict["content"].split())

    output.append(temp_dict)
    time.sleep(1)

In [None]:
with open("HW2_Biesaga_Mikołaj.jsonl", "w") as file:
    for line in output:
        file.write(json.dumps(line) + "\n")

## Solution 2

In [None]:
from newspaper import Article

## Open connection with the output file.
with open("HW2_Biesaga_Mikolaj.jsonl", "w") as write_file:
    ## Open connection with the file with links.
    with open("links.jsonl", "r") as read_file:
        ## Iterate over every single line of the file.  # Note that this method
        ## reads a single line at the time. So I never have more than one line
        ## in the working memory. Unlike file.readlines() that stores all lines
        ## in the working memory.
        for line in read_file:
            ## Convert a string into a dictionary.
            line = json.loads(line)
            ## Create an empty dictionary.
            temp_dict = {}
            ## Initialize Article object with the link.
            art = Article(line["url"])
            ## Download the website's HTML.
            art.download()
            ## Parse HTML code.
            art.parse()
            ## Get the main text of the article.
            temp_dict["text"] = art.text_cleaned
            ## Get the length of the article.
            temp_dict["length"] = len(art.text_cleaned.split())
            ## Extract the publication date and convert it into a string.
            temp_dict["date"] = art.publish_date.strftime("%Y-%m-%d %H:%M:%S")
            ## Get the tile of the article.
            temp_dict["title"] = art.title
            ## Get the source name.
            temp_dict["source"] = line["source_name"]
            ## Convert a string to an HTML object
            html = BeautifulSoup(art.html, "html.parser")
            ## Extract author and lead similarly to the first approach.
            if line["source_name"] == "The Parliament Magazine":
                temp_dict["author"] = html.select_one("p.av-authName a")
                temp_dict["lead"] = html.select_one(
                    "div.av-standFirst.playfair"
                ).text.strip()
            else:
                temp_dict["author"] = html.select_one("address div a")
                temp_dict["lead"] = html.select_one("div p").text.strip()
            if temp_dict["author"]:
                temp_dict["author"] = temp_dict["author"].text.strip()
            ## Extract data about Facebook likes
            temp_dict["fb"] = {
                "likes": line["fb"]["likes"],
                "shares": line["fb"]["shares"],
                "comments": line["fb"]["comments"],
            }
            ## Write out the data to the file
            write_file.write(json.dumps(temp_dict) + "\n")
            ## Wait a second
            time.sleep(1)