# APIs 101 (oDCM)

*The focus in this tutorial lies on pagination (i.e., looping through multiple pages), and parameters (i.e., modifying the response of an API call). We know you love dad jokes, so guess what? We're back with many more jokes, and you're going to learn how to save them all! Finally, we show you how to obtain user-level data from the Reddit!*

--- 

## Learning Objectives

* Send HTTP requests to a web API, and retrieve JSON responses
* Use parameters to modify the results of an API call
* Iterate over multiple pages of JSON responses 
* Extract and store results of an API request in lists and files

--- 

## Acknowledgements
This course draws on a variety of online resources which can be retrieved from the [course website](https://odcm.hannesdatta.com/#student-profile--prerequisites). 


--- 

## Support Needed?
For technical issues outside of scheduled classes, please check the [support section](https://odcm.hannesdatta.com/docs/course/support) on the course website.

---

## 1. Icanhazdajoke

### 1.1 Make an API request

[Icanhazdadjoke.com](https://icanhazdadjoke.com) is a simple web site that allows users of their API to receive (randomized) *dad jokes*. Yes, we know that sounds stupid, but we like that API for its simplicity, which is ideal when explaining to you more about APIs.

So, the code cell below calls the joke API, and the result of the API request displays a joke. 

__Let's try it out__

Run the cell a few times to notice that with each call, you see a new joke.


In [None]:
# request JSON output from icanhazdadjoke API
import requests
url = "https://icanhazdadjoke.com"
response = requests.get(url, headers={"Accept": "application/json"})
joke_request = response.json() 
print(joke_request)

### 1.2 Use parameters to modify the API results   

__Importance__

Probably you agree that dad jokes per se aren't that exciting. Wouldn't it be amazing to search for particular jokes instead?

APIs certainly provide the functionality to *customize* requests. That's where APIs make most of a difference! You have probably already modified the results of an API call a dozen times without even knowing it. For example, if you Google the word `cat`, the results page may look something like this:

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/apis101/images/google.png" width=60% align="left"  style="border: 1px solid black"/>


Note how the URL in the browser starts with [`google.com/search?q=cat...`](https://www.google.com/search?q=cat)? What happened here is that your search query was passed to the Google Search API, and hence returned the results of the search query `cat`. That search query is even already embedded in the link itself. Cool, right?

__Let's try it out__

So, rather than filling out the search box on the website of Icanhazdadjoke.com itself, you can also tweak it in the URL directly. Open your browser now at [https://icanhazdadjoke.com/search?term=cat](https://icanhazdadjoke.com/search?term=cat), and modify the `term` parameter to try a search for different jokes.


<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/apis101/images/cat_jokes.gif" width=60% align="left"  style="border: 1px solid black"/>

With the idea of passing parameters to a website, we can update the `search_url` and include the `params` attribute, which contains a dictionary with parameters that further specify our request. Run the cell below to see cat jokes here in Jupyter Notebook.

In [None]:
import requests
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "cat"})
joke_request = response.json()
print(joke_request)

The `joke_request` object now contains a list with all cat-related jokes (`joke_request['results']`), the search term (`cat`), and the total number of jokes (`10`).

#### Exercise 1
1. Change the search term parameter to `dog` and revisit `joke_request['results']`. How many dog jokes are there? 
2. Write a function `find_joke()` that takes a query as an input parameter and returns the number of jokes from the `icanhazdadjoke` search API (tip: use your answer to question 1 as a starting point!). 




In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1 
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": "dog"})
joke_request = response.json()
print(f"The number of dog jokes is: {joke_request['total_jokes']}")

In [None]:
# Question 2
def find_jokes(term):
    search_url = "https://icanhazdadjoke.com/search"

    response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": term})
    joke_request = response.json()
    num_results = joke_request['total_jokes']
    return num_results

find_jokes("some-searchterm-you-would-like-to-try-out")

### 1.3 Pagination

__Importance__

Transferring data is costly - not strictly in a monetary sense, but in *time*. So - APIs are typically very greedy in returning data. Ideally, they only produce a very targeted data point that is needed for the user to see. On icanhazdadjoke.com, for example, that would be a few jokes at maximum. It saves the web site owner paying for bandwidth and guarantees that the site responds fast to user input (such as navigating the site or searching for jokes).

However, when using APIs for research purposes, we are frequently interested in obtaining *everything*. What's the use, for example, to get a book's most recent ten reviews, if there are hundreds of reviews written?

We think you see where we're going with this... 

__Let's try it out__

So, let's try to grab all of the 649 jokes currently available at Icanhazdadjoke.com. The API output, unfortunately, only shows the *first 20 jokes*. To retrieve the remaining 629 jokes, you need *pagination*. The API divides the data into smaller subsets that can be accessed on various pages, rather than returning all output at once. 

Let's retrieve the first batch of dad jokes (note, here we're searching for the `term` `""` - an empty string - which brings us to the entire set of jokes available via the API. In practice, searching for `""` is often blocked by APIs - simply because the site doesn't *want* you to extract a complete copy of their data. In that case, you'd have to become creative to obtain your seeds.


In [None]:
search_url = "https://icanhazdadjoke.com/search"

response = requests.get(search_url, 
                        headers={"Accept": "application/json"}, 
                        params={"term": ""})
joke_request = response.json()
joke_request['results'] = '' # let's remove all jokes, and only look at the other attributes in the JSON response
joke_request

You notice that by default, each page contains 20 jokes (see `limit` in the JSON response above), where page 1 shows jokes 1 to 20, page 2 jokes 21 to 40, ..., and page 33 jokes 641 to 649. 

You can adjust the number of results on each page (max. 30) with the `limit` parameter (e.g., `params={"limit": 10}`). In practice, almost every API on the web limits the results of an API call (`100` is also a common cap).

In the example below, we set `limit` equal to `10`, `20`, and `30`, and see how it affects the number of total pages (`total_pages`) on which jokes are listed. 

In [None]:
for limit in range(10, 31, 10):  # note that range(a, b) runs from a to b-1; so the last value is exclusive (so from 10 to 30 with steps of 10)
    response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": "", 
                                   "limit": limit})
    joke_request = response.json()
    print(f"Limit {limit} gives {joke_request['total_pages']} pages")

As expected, we find that the higher the limit, the more results fit on a single page, and thus the *lower the number of pages* to loop through.

--- 
#### Exercise 2

In addition to the `limit` parameter, you can specify the current page number with the `page` parameter (e.g., `params={"term": "", "page": 2}`. See the example in the next cell:

In [None]:
response = requests.get(search_url, 
                            headers={"Accept": "application/json"}, 
                            params={"term": "", 
                                   "limit": 5,
                                   "page": 2})
response.json()

Adapt the function `find_joke()` (see question 2 of exercise 1), such that it loops over *all available pages*, and stores the ids and jokes in a list. You can leave the `limit` parameter at its default value (20). Make sure that your function also works when you pass it a search `term`. 

Tip: To determine how many pages you need to loop through, you can use the `total_pages` field (e.g., there are only ten cat jokes, so in that case, 1 page would suffice).

In [None]:
# your answer goes here!

#### Solutions

In [None]:
def find_jokes(term):
    search_url = "https://icanhazdadjoke.com/search"
    page = 1
    jokes = []

    while True:  # alternatively you can also use a for-loop that goes from page 1 to total_results / 20 (rounded up)
        response = requests.get(search_url, 
                                headers={"Accept": "application/json"}, 
                                params={"term": term,  # optionally you can add "limit": 20 but that's already the default so it doesn't change anything
                                        "page": page})
        joke_request = response.json()
        jokes.extend(joke_request['results'])
        if joke_request['current_page'] <= joke_request['total_pages']:
            page += 1
        else: 
            return jokes

output = find_jokes("cat") # try running it with "", too!

In [None]:
print(f"You've collected {len(output)} jokes")

### 1.4 Wrap-up

To sum up, we have seen how *parameters* can be a powerful tool when working with APIs. They allow you to tailor your request to be more specific or loop through multiple pages. 

In the API documentation, you typically find more information about the available parameters and the values they can take on. For example, the `icanhazdadjoke` [documentation](https://icanhazdadjoke.com/api) includes a section on the `/search` endpoint and the accepted parameters (`page`, `limit`, `term`). These parameters, however, differ from one API to another. So it's crucial to study each web service's API documentation carefully.


--- 
## 2. Reddit

### 2.1 Subreddits

[Reddit](https://reddit.com) is a widespread American social news aggregation and discussion site. The service uses an API to generate the website's content and grants public access to the API.

In this tutorial, we zoom in on "subreddits", which are niche communities centered around a particular topic. Users can nearly post anything in these subreddits, and you'd be surprised to find out what people are talking about. For example, see below for a screenshot of the [subreddit on Science](https://www.reddit.com/r/Science).

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/apis101/images/reddit_science.png" width=60% align="left"  style="border: 1px solid black"/>


__Let's try it out__

Subreddits all start with `reddit.com/r/...`. Here are a few examples:

- [askreddit](https://www.reddit.com/r/AskReddit), 
- [aww](https://www.reddit.com/r/aww/), 
- [gifs](https://www.reddit.com/r/gifs/), 
- [showerthoughts](https://www.reddit.com/r/Showerthoughts), 
- [lifehacks](https://www.reddit.com/r/lifehacks), 
- [getmotivated](https://www.reddit.com/r/GetMotivated), 
- [moviedetails](https://www.reddit.com/r/MovieDetails), 
- [todayilearned](https://www.reddit.com/r/todayilearned/), 
- [foodporn](https://www.reddit.com/r/FoodPorn/). 

Take your time to browse through some of the subreddits, and get familiar with the structure of the pages.

After a while, you'd probably notice that subreddits are hosted by moderators, who monitor whether the posts adhere to a set of (informal) rules.

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/apis101/images/reddit_moderators.png" width=60% align="left"  style="border: 1px solid black"/>

For example, links to papers you share in [`r/science`](https://www.reddit.com/r/science/) must be less than 6 months old. 

Other users can join a subreddit so that they receive updates about new posts and comments.




#### Exercise 3
Consult the [`marketing`](https://www.reddit.com/r/marketing/hot/) subreddit and answer the following questions: 
1. For your thesis, you need to collect survey responses. Are you allowed to share a link to your survey in this subreddit? Please elaborate on how you came to this conclusion. 
2. You post a link (and wonder how many users will potentially be able to see your post). How many users are subscribed to the subreddit? How many users are currently online?
3. Like other social media platforms, you can navigate towards Reddit's user-profiles and learn more about these persons. Inspect the profile of one of the users who has posted on the Reddit (actually, it is one of the moderators), [`sixwaystop313`](https://www.reddit.com/user/sixwaystop313). Describe in your own words what types of information you can gather from this user. How is the feed organized?


In [None]:
# your answer goes here!

#### Solutions
1. No, the subreddit rules prescribe users not to post surveys and homework assignments (right sidebar).
2. `r/marketing` is moderated has about 370k members, and (at the time of writing this tutorial), about 160 of them were online.
3. On a user page, you find the bio, trophies, communities the user moderates, connected accounts, and most importantly: all user's posts and comments.

---


### 2.2 API headers  

**Importance**  

Let's now obtain some of the data we have seen on the "About" page of the `marketing` thread, using the  Reddit API. 

To request data from the Reddit API, we need to include `headers` in our request. HTTP headers are a vital part of any API request, containing *meta-data associated with the request* (e.g., type of browser, language, expected data format, etc.). 

**Let's try it out**  

Below we request the about page of the [`marketing`]() subreddit that includes such a header. We make our first request to the Reddit API and parse the output in the upcoming exercise!


In [None]:
import requests
url = 'https://www.reddit.com/r/marketing/about/.json'

headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=10', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}
response = requests.get(url, headers=headers)
json_response = response.json()

#### Exercise 4
1. First, take a look at the `json_response` object. Then, leave out the `headers` parameter in your request (so it becomes `requests.get(url)` instead), rerun the cell, and inspect the `json_response` another time. Are there any differences? 
2. Write a while-loop that prints the count of the number of currently active users of the `marketing` subreddit. Have your code pause every 5 seconds before refreshing. Stop the loop after 3 iterations. For pausing, use the function `time.sleep(5)`. Import the time package using `import time`.
3. Convert your code from the previous exercise into a function `get_usercount()` that takes a `subreddit` as input and returns the total number of users, and the number of currently active users as a dictionary. Test your function for the `science`, `skateboarding`, and `marketing` subreddits. How many total and currently active users do these communities have?


In [None]:
# your answer goes here!

#### Solutions
1. Without the `headers` parameter, the API returns an error code (429). Headers are frequently used to track who is using the API. The user of the "anonymous header" has pushed the boundaries too much!


In [None]:
# Question 2 
import time

i = 1
while i <= 3:
    url = 'https://www.reddit.com/r/marketing/about/.json'
    headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=10', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-fetch-site': 'same-origin', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-GB,en;q=0.9'}
    response = requests.get(url, headers=headers)
    json_response = response.json()
    
    print(json_response['data']['active_user_count'])
    i += 1
    time.sleep(5)

In [None]:
# Question 3
def get_usercount(subreddit):
    response = response = requests.get(f'https://www.reddit.com/r/{subreddit}/about/.json', headers=headers)
    json_response = response.json()
    out = {}
    out['subreddit'] = subreddit
    out['total_users'] = json_response['data']['subscribers']
    out['active_users'] = json_response['data']['active_user_count']
    return out
    
get_usercount('science')

In [None]:
get_usercount('skateboarding')

In [None]:
get_usercount('marketing')


---
### 2.3 Profile pages

In addition to subreddits (`r/...`) and about pages (`.../about/`), Reddit users have their own profile page. Let's have another look at the marketing moderator [profile](https://www.reddit.com/user/sixwaystop313) we saw before. Each of the `children` in the `data` is characterized by a type (e.g., `t1` = comment, `t3` = post; for details see [API documentation](https://redditclient.readthedocs.io/en/latest/reference/)), subreddit, timestamp, number of comments, upvotes, downvotes, and many others. 

__Let's try it out__

Run the API call below, and browse the results.

In [None]:
mod = "sixwaystop313"
response = requests.get(f'https://www.reddit.com/user/{mod}.json', headers=headers)
json_response = response.json()
json_response

That's a whole lot of output, which is difficult to go through. So let's copy it to a [JSON viewer](https://jsonviewer.stack.hu). Before we can do that, we have to replace the Pythonic `None`, `True` and `False` by strings (JSON viewer throws an error otherwise).

In [None]:
import json
json_response = json.loads(response.text.replace('null', '"None"').replace('True','"True"').replace('False','"False"'))
json_response

#### Exercise 5
1. The `json_response` object contains both comments and posts ordered chronologically (exactly as they appear on the profile page). Pick a comment (`kind`: `'t1'`) of the author and store the text of the comment in a variable called `comment_text`. 
2. What happens to `comments_text` once the author publishes another post? 
3. How many objects are stored in `json_response['data']['children']`? What does that mean? 

In [None]:
# your answer goes here!

#### Solutions
1. At the moment of creating this solutions file, the 1st item in the list is a comment which we extract as follows:
`comment_text = json_response['data']['children'][0]['data']`. In your case, it may be 2nd (or 3rd, 4th, ... item), however, provided that all other items in the lists are posts. For that reason, the counter after `[0]` may deviate from time to time. 


In [None]:
# if this solution throws a "KeyError: body" error it means the most recent JSON object is not a comment of kind t1 (so change the 0 for 1, 2, ... until it runs) - see question 2
comment_text = json_response['data']['children'][0]['data']['body']
comment_text

2. Since the list items are ordered chronologically, new items are appended at the *beginning* of the list and thus push existing items to the "right" (i.e., index 0 becomes index 1, etc.). Suppose that the author publishes another post, then index `[0]` would no longer contain a comment. Post items have been structured differently from comment items, which could potentially break your script once you try to parse non-existing items. For example, posts do not have a `['body']` element that stores the comment text.

3. The object comprises 25 items, which implies that only the 25 most recent comments and posts are shown. Thus, we need to apply pagination to obtain historical records.

In [None]:
len(json_response['data']['children'])

### 2.4 Pagination

__Importance__

As you've just noticed, the API only returns a subset of all records (every time you scroll to the bottom of the page, it pulls in new data - ordered chronologically). After all, it would take ages to show all data for a user that has been active on Reddit since 2009! 

__Let's try it out__

Similar to `icanhazdadjoke`, we apply pagination to tell the API which part of the data to return. The difference, however, is that it's not a number (like `"page": 2`) but a string of characters that can only be obtained from the previous request (i.e., we cannot derive what the next key will be from a pattern, like page 2, 3, ..., etc.). The request we already made contains this "secret" key in the attribute `after`:


In [None]:
json_response['data']['after']

Next, we attach this key to our request with the `after` parameter to obtain the next subset of items and assign the responses to a variable called `json_response_after`: 

In [None]:
after = json_response['data']['after']
url = f'https://www.reddit.com/user/{mod}.json'
response = requests.get(url, 
                        headers=headers, 
                        params={"after": after})
json_response_after = response.json()
json_response_after

At the point of writing this tutorial (when you're doing this tutorial it's likely different!), the last item in `json_respose` is the following post (`Detroit's Brewing Heritage' on tap at Historical Museum`): 

<img src="https://raw.githubusercontent.com/hannesdatta/course-odcm/dev/content/docs/tutorials/apis101/images/json_response.png" width=60% align="left"/>

The first and second items in `json_response_after` are the two comments below that ("Shame on ... us back." and "Are you ... comment /u/ehchip"). In other words, where one object ends, another begins. We apply this concept to loop over the first ten pages. At each iteration, we store the `after` attribute, which we use as a parameter in the follow-up request.

In [None]:
after = None
item_type = []

for counter in range(10): 
    url = f'https://www.reddit.com/user/{mod}.json'
    print('processing ' + url + ' with after parameter: ' + str(after))
    response = requests.get(url, 
                            headers=headers, 
                            params={"after": after})
    json_response = response.json()
    after = json_response['data']['after'] 

    # loop over all items in a request
    for item in json_response['data']['children']:
        item_type.append(item['kind'])

# Let's view the item types: 
item_type

#### Exercise 6
1. Why do we define `after = None` at the top of the file? Can we leave it out? 
2. Without looking at the list's length, how many items do you expect in `item_type`? 
3. Of all items in `item_type`, calculate the percentage of posts (`t3`) and comments (`t1`). What does this tell you? 
4. Convert the code snippet above into a function `reddit_activity()` that takes a `username`, `attribute`, and `num_pages` as inputs and returns the attribute for the given user. For example, `reddit_activity("sixwaystop313", "subreddit_name_prefixed", 40)` should return a list of the subreddits in which the user has posted or commented across the 1000 most recent items. 
5. Use the function written in 3 to assess whether the moderator has actively contributed to the `r/marketing` subreddit recently?

In [None]:
# your answer goes here!

#### Solutions 
1. In our first request we don't know the value of `after` yet. It is important, however, to include this line because otherwise the `after` value in `params={}` is undefined. 
2. We expect the list to have a size of 10 (number of requests) * 25 (number of items per request) = 250. 

In [None]:
# Question 3
def item_frequency(items, item_filter):
    total_items = len(items)
    item_filter_count = items.count(item_filter)
    return item_filter_count / total_items * 100
            
perc_posts = item_frequency(item_type, 't1')
perc_comments = item_frequency(item_type, 't3')

print(f"The percentage of posts and comments is {perc_posts}% and {perc_comments}%, respectively")
# Thus, based on this subset of data, the author is more likely to start a new post than to comment on others' posts

In [None]:
# Question 4
def reddit_activity(username, attribute, num_pages):
    after = None
    activity = []

    for counter in range(num_pages): 
        url = f'https://www.reddit.com/user/{username}.json'
        print('processing ' + url + ' with after parameter: ' + str(after))
        response = requests.get(url, 
                                headers=headers, 
                                params={"after": after})
        json_response = response.json()
        after = json_response['data']['after']

        # loop over all items in a request
        for item in json_response['data']['children']:
            activity.append(item['data'][attribute])
    return activity

reddit_data = reddit_activity("sixwaystop313", "subreddit_name_prefixed", 40)

In [None]:
reddit_data

In [None]:
print(f"The percentage of posts and comments in the marketing subreddit is {item_frequency(reddit_data, 'r/marketing')}%")
# Thus, the moderator has not actively contributed to the marketing subreddit recently

---
### 2.5 Time Conversion

__Importance__

When retrieving data from an API - in particular *user-level data*, not only the content of a post or comment matters, but also when the comment or post was written. In other words, we seek to extract the date and time (timestamps) from a users' comments and posts on Reddit.

__Let's try it out__

Run the cell below, to extract the timestamp from `sixwaystop313`'s most recent activity on Reddit.


In [None]:
url = 'https://www.reddit.com/user/sixwaystop313.json'
response = requests.get(url, headers=headers)
response.json()['data']['children'][0]['data']['created_utc']

Hm... is *that* really a timestamp?

Well... computers handle time differently than humans. And what programmers somewhat converged to is that timestamps are best measured in *the number of seconds passed since 1 January, 1970*. With the use of the `time` library, we can easily convert it into a readable date and time:

In [None]:
import time 
time_example = response.json()['data']['children'][0]['data']['created_utc']
time_converted = time.gmtime(time_example)
print(time_converted)

From `time_converted` you can extract the day, month, and year separately:

In [None]:
print(f"The day is: {time_converted.tm_mday}")
print(f"The month is: {time_converted.tm_mon}")
print(f"The year is: {time_converted.tm_year}")

Or together, like this (characters that start with `%` have a special meaning, the `-` in  between these characters are literally the dashes you see in the output): 

In [None]:
print(time.strftime("%d-%m-%Y", time_converted))  
# %d = day
# %m = month
# %Y = year (4 digits) and %y = year (2 digits)

--- 
#### Exercise 7 
1. In a similar way, you can convert the UTC time into an hour (`%H`) and minute (`%M`). Transform `time_example` into a readable time. The output should be `06:17`. 
2. Suppose we want to analyze the Reddit use of `sixwaystop313` throughout the day. More specifically, we want to know during what hours the user is most active on the platform. 
  * Use the function `reddit_activity()` you wrote earlier to pull in the UTC timestamps (set `num_items` to `10`). 
  * Extract the hour from these timestamps. 
  * Determine the top 3 hours the user is most active on Reddit. You can assume that the total number of posts and comments is a reasonable proxy for time spend on the platform. 

In [None]:
# your answer goes here!

#### Solutions

In [None]:
# Question 1 
print(time.strftime("%H:%M", time_converted))

In [None]:
# Question 2
time_data = reddit_activity("sixwaystop313", "created_utc", 10)
hours = []

for timestamp in time_data: 
    time_converted = time.gmtime(timestamp)
    hours.append(time_converted.tm_hour)
    
for hour in range(24):
    print(f"Hour {hour}: {hours.count(hour)} items")
    
# Check out which hours are listed most to find the answer!

---

### 2.6 Building an API extraction module

__Importance__

Up to now, we've written functions that in itself carry out separate tasks: one for obtaining data from the subreddit's about page, and one to obtain particular attributes of users of the subreddit.

However, when we use APIs for research, we are not so much interested in the results of "single-shot" API requests, but we would like to obtain a *copy* of the entire data, so that we can analyze it later.

So, the purpose of this section is to "stitch" together individual API requests. For now, we assume that we are interested in studying how the posting behavior of users currently active on the channel influences the total number of active users of the community.

In other words, we need to 
- obtain a list of all users who have currently posted on the subreddit (the first 25), and
- store all of their posts and comments in a dataset.

__Let's try it out__

Let's first make use of a function `get_users()` that returns the currently active users from the `marketing` subreddit.

In [None]:
def get_users(subreddit):
    url = f'https://www.reddit.com/r/{subreddit}.json'
    response = requests.get(url, 
                            headers=headers)
    json_response = response.json()
    users = []
    # loop over all items in a request
    for item in json_response['data']['children']:
        users.append(item['data']['author'])
    return users

users = get_users('marketing')
users

We also have written our `reddit_activity` function, which we could call now using the (for prototypical purposes) first moderator.

In [None]:
reddit_activity(users[0], "created_utc", 3)

The code above produced a list of timestamps of when that user was active. The function right now is not particularly useful yet, because it doesn't return multiple fields (but only one). So, let's first extend the reddit_activity function to...
- take on a `list` of attributes to extract, rather than just one
- to store the results in a dictionary, rather than a list (easier for writing to CSV later), and
- to always store the type of content (`kind`, e.g., post or comment).


In [None]:
def reddit_activity_updated(username, attributes, num_pages):
    after = None
    activity = []

    for counter in range(num_pages): 
        url = f'https://www.reddit.com/user/{username}.json'
        print('processing ' + url + ' with after parameter: ' + str(after))
        response = requests.get(url, 
                                headers=headers, 
                                params={"after": after})
        json_response = response.json()
        after = json_response['data']['after']

        # loop over all items in a request
        for item in json_response['data']['children']:
            tmp = {}
            tmp['kind'] = item['kind']
            for attribute in attributes:
                try:
                    tmp[attribute] = item['data'][attribute]
                except:
                    0 # do nothing
            activity.append(tmp)
    return activity

reddit_data = reddit_activity_updated(users[0], ["created_utc", "subreddit_name_prefixed", "body"], 10)

In [None]:
reddit_data

Observe now that the updated `reddit_activity_updated()` function returns multiple fields. The function also checks whether a field is actually *part* of the data (e.g., there is never a `body` for type `t3` (post), but only for `t1` (comments)

__Exercise 8__

1. Extend the `reddit_activity_updated()` function, so that it also stores a users' user name in the results.

2. Write a loop that calls `reddit_activity_updated()` on all 25 most recent users of the subreddit `marketing`, and stores the result in a (long) list of dictionaries. Limit yourself to the first 10 pages per user (to save time doing this exercise).

In [None]:
# your answer goes here!

__Solutions__

*Question 1*

In [None]:
def reddit_activity_updated(username, attributes, num_pages):
    after = None
    activity = []

    for counter in range(num_pages): 
        url = f'https://www.reddit.com/user/{username}.json'
        print('processing ' + url + ' with after parameter: ' + str(after))
        response = requests.get(url, 
                                headers=headers, 
                                params={"after": after})
        json_response = response.json()
        after = json_response['data']['after']

        # loop over all items in a request
        for item in json_response['data']['children']:
            tmp = {}
            tmp['username'] = username # <-- SOLUTION TO QUESTION 1 IS HERE
            tmp['kind'] = item['kind']
            for attribute in attributes:
                try:
                    tmp[attribute] = item['data'][attribute]
                except:
                    0 # do nothing
            activity.append(tmp)
    return activity

reddit_data = reddit_activity_updated(users[0], ["created_utc", "subreddit_name_prefixed", "body"], 10)

In [None]:
# let's preview the first few results
reddit_data[0:4]

*Question 2*

In [None]:
users = get_users('marketing')

all_data = []
for user in users:
    print('processing content for moderator ' + user)
    reddit_data = reddit_activity_updated(user, ["created_utc", "subreddit_name_prefixed", "body"], 10)
    for item in reddit_data:
        all_data.append(item)

In [None]:
# preview the results
print(len(all_data))
all_data[0:5]

---

### 2.7 Exporting data to CSV

__Importance__

Alright, you've almost made it. We've accomplished a whole lot by now: we just wrote a tool that extracts user-level data, and appends all that data in a list of dictionaries stored in Python. But, how can we port the data to another software program (e.g., R, Excel)?

We need to convert the data to a Comma Separated Values (CSV) file.

More specifically, we'd like to have a file with five columns, containing:
- the username,
- the type of content (post vs. comment),
- the timestamp (readable for humans),
- the sub reddit name, and
- the body, if present.

To faciliate writing to a CSV file, we'll make use of the `csv` library.

__Let's try it out__

Here, we'll start with a code snippet to parse the `username`, content type (`kind`) and timestamp (`created_utc`) to a CSV file. Run the snippet and the next cells to see the result (it should create a new file `reddit_posts.csv` in your current working directory)! 

In [None]:
import csv 

with open("reddit_posts.csv", "w") as csv_file: # <<- this is the line with the "flag" see exercises below
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["username", "kind", "created_utc"])
    for content in all_data:
        writer.writerow([content['username'], content['kind'], content['created_utc']])
print('done!')

Let's preview the content of that file directly in Jupyter, using the `pandas` library.

In [None]:
import pandas as pd

df = pd.read_csv("reddit_posts.csv", sep=";")
# shows top 10 rows
df.head(10)

Good job, so far. But the file is not ready yet, so let's work on the exercises by extending the code snippet above.

---

__Exercise 9__  
The `reddit_posts.csv` file now only includes 3 columns (`userame`, `kind`, `created_utc`). Please add the following columns as well: the subreddit name, the text of users' comments, and the timestamp converted to YYYY-MM-DD (date, e.g., 2021-01-15) and HH:MM (e.g., 08:00). 

In [None]:
# your answer goes here!

__Solutions__

In [None]:
import csv 
import time

with open("reddit_posts.csv", "w", encoding = "utf-8") as csv_file: # <<- this is the line with the "flag"l see exercises below
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(["username", "kind", "created_utc", "subreddit_name_prefixed", "date", "time", "body"])
    for content in all_data:
        time_converted = time.gmtime(content['created_utc'])
        datestamp = time.strftime("%Y-%m-%d", time_converted)
        timestamp = time.strftime("%H:%M", time_converted)
        try:
            bodytext = content['body']
        except:
            bodytext = ''
        writer.writerow([content['username'], content['kind'], content['created_utc'], content['subreddit_name_prefixed'],
                        datestamp, timestamp, bodytext])
print('done!')

In [None]:
import pandas as pd

df = pd.read_csv("reddit_posts.csv", sep=";")
# shows top 10 rows
df.head(10)

### 2.8 Wrap-up

Good job - you've made it!

After working on this set of exercises, you should be able to further explore the Reddit API on your own. Does `sixwaystop313` spend the most time in subreddits in which he gets the most upvotes? Did his posting behavior change over time? Are users that have posted recently more likely to be a premium Reddit user? Think what data is required to obtain such data, and then try to extract such data.

At the same time, realize that we have only scratched the surface of what's possible with APIs. Headers and pagination played a vital role in requests and were sufficient thus far. Yet, the majority of [API endpoints](https://www.reddit.com/dev/api/) require authentication (OAuth), which is a whole topic on its own.