# Homework (deadline 6.12.2024 11:59:59)

Write solutions for the homework exercises in this notebook. Once the work is done download the notebook file (`File > Download .ipynb`) rename it properly so it follows a template `HW1_2_<SURNAME>_<NAME>.ipynb` and upload it to the [Google Classroom](https://classroom.google.com/c/NzIwNDg4NDAyMTA2/a/NzIwNDg4NDAyMTQy/details).

Remember that you can contact me via email if you have any problems. Moreover, you can also visit me in the ISS on the fourth floor (room 415). Usually, I am there from 11ish but please let me know in advance if you are coming because I might be busy. 

## Task 1 (5 points)

Get 100 submissions from a subreddit called `todayilearned`. The submissions should include the word `science`. 

When you get the data please save to a JSON line file 10 submissions with the biggest number of comments under them. However, please write out to JSON only the following fields:

* `author_name`
* `created_utc` - in a human-readable format
* `title`
* `num_comments`
* `url` - it should always exist but if this field is `None` or doesn't exist write the code in a way that it doesn't yield an error.

Therefore, the JSON line file should contain a list of dictionaries looking more or less like this:

```python
{
	'author_name' : 'manolito_gafotas',
	'created_utc' : '2022-11-11 11:11:11',
	'title' : 'TIL the announcer for Super Smash Bros Brawl was also the announcer for Bill Nye the Science Guy.',
	'num_comments' : 100,
	'url' : 'https://en.wikipedia.org/wiki/Pat_Cashman'
}


```

In [None]:
!pip install praw

In [None]:
## Load module
import praw
import json
from datetime import datetime
from google.colab import userdata

## Retrive our environmental variables and assing them to names.
client_id = userdata.get("client_id")
client_secret = userdata.get("client_secret")
password = userdata.get("password")
user_agent = userdata.get("user_agent")
username = userdata.get("username")

In [None]:
## Connect to Reddit
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    password=password,
    user_agent=user_agent,
    username=username,
    check_for_async=False,
)


## Define a function to convert date
def convert_date(date_float: float) -> str:
    """
    Takes a date in epoch time format and converts it into a string in human-readable date format.

    Parameters:
    -----------
        date_float (float): a float representing a date in epoch time format.

    Returns:
    --------
        (str) : a string representing a date in human-readable format.
    """
    return datetime.fromtimestamp(date_float).strftime("%d-%m-%Y %H:%M:%S")

In [None]:
## Create a generator of 100 elements
submissions = reddit.subreddit("todayilearned").search("science")

## An empty list to which we will input dictionaries
output = []

## Iterate over a generator
for submission in submissions:
    ## Create an empty dictionary
    temp = {}
    ## Assign the author name if the account exists
    temp["author_name"] = submission.author.name if submission.author else None
    ## Assign the date in human-readable format
    temp["created_utc"] = convert_date(submission.created_utc)
    ## Assign the title
    temp["title"] = submission.title
    ## Assign the number of comments under a given submission
    temp["num_comments"] = submission.num_comments
    ## Assign the url
    temp["url"] = submission.url if submission.url else None
    ## Check if the output list is empty
    if not output:
        ## Append the first submission to the list
        output.append(temp)
        ## Skip the rest of the loop
        continue
    ## Check if the lowest number of comments is smaller than the current
    if temp["num_comments"] > output[0]["num_comments"]:
        ## Append the current submission to the list
        output.append(temp)
        ## Sort the list by number of comments
        output.sort(key=lambda x: x["num_comments"])
        ## Check if the length of the list is bigger than 10
        if len(output) > 10:
            ## Remove the submission with the lowest
            ## number of comments
            output = output[1:]

In [None]:
## Write out the results to the JSON line file
with open("Task1.jsonl", "w") as file:
    for line in output:
        file.write(json.dumps(line) + "\n")

## Task 2 (5 points)

Get all comments from the oldest submission from Task 1. Write them out to a JSON line file. Information about each comment should consist only of the following information:

* `author_name`
* `created_utc` in human-readable format
* `score`
* `body`

Therefore, the JSON line file should contain a list of dictionaries looking more or less like this:

```python
{
	'author_name' : 'manolito_gafotas',
	'created_utc' : '2011-11-11 11:11:11',
	'body' : "im in grade 12 now (i live in canada as well) but i seriously dont even remember us doing this.Maybe not till at least grade 6 or 7",
	'score' : 100
}
```

In [None]:
!pip install praw

In [None]:
## Load module
import praw
import json
from tqdm import tqdm  ## Import a fancy module for progress bar
from google.colab import userdata

## Retrive our environmental variables and assing them to names.
client_id = userdata.get("client_id")
client_secret = userdata.get("client_secret")
password = userdata.get("password")
user_agent = userdata.get("user_agent")
username = userdata.get("username")

In [None]:
## Connect to Reddit
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    password=password,
    user_agent=user_agent,
    username=username,
    check_for_async=False,
)


## Define a function to convert time
def convert_date(date_float: float) -> str:
    """
    Takes a date in epoch time format and converts it into a string in human-readable date format.

    Parameters:
    -----------
        date_float (float): a float representing a date in epoch time format.

    Returns:
    --------
        (str) : a string representing a date in human-readable format.
    """
    return datetime.fromtimestamp(date_float).strftime("%d-%m-%Y %H:%M:%S")

In [None]:
## Get 100 submissions into a list
submissions = [item for item in reddit.subreddit("todayilearned").search("science")]
## Sort them by the date of creation
sorted_submissions = sorted(submissions, key=lambda x: x.created_utc)
## Pick the oldest submission id
oldest_submission = sorted_submissions[0]

In [None]:
## Load all comments
oldest_submission.comments.replace_more(limit=None)

In [None]:
## Open the file
with open("Task2.jsonl", "w") as file:
    ## Iterate over all comemnts
    for comment in tqdm(oldest_submission.comments.list()):
        ## Create an empty dict
        temp = {}
        ## Assign the author name if the Redditor exists
        temp["author_name"] = comment.author.name if comment.author else None
        ## Assign the date in humna-readable format
        temp["created_utc"] = convert_date(comment.created_utc)
        ## Assign the body of the comment
        temp["body"] = comment.body
        ## Assign the number of comments under given submission
        temp["score"] = comment.score
        ## Write out to the file
        file.write(json.dumps(temp) + "\n")