# DSCI 511: Data acquisition and pre-processing<br>Chapter 8: Establishing a Database with Documentation
## In-depth Exercises



### A. Building and Storing Conversational Thread Topologies

Run the script in the cell below to build the reddit data object `data`. We'll be using these throughout the exercise.


In [None]:
import requests

def get_data(sub_id):
  post_url = "https://api.pushshift.io/reddit/search/submission/?ids=" + sub_id
  post = requests.get(post_url)
  post_resp = requests.get(post_url)
  post = post_resp.json()

  data = [post['data'][0]]

  comments_url = "https://api.pushshift.io/reddit/submission/comment_ids/" + sub_id
  comments_resp = requests.get(comments_url)
  ids = comments_resp.json()

  batch_size = 500
  for batch_num in range(len(ids['data'])//batch_size):
    url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'][batch_size*batch_num:batch_size*(batch_num + 1)])
    resp = requests.get(url)
    batch = resp.json()
    data.extend(batch['data'])

  if len(data) != len(ids['data']) + 1:
    url = "https://api.pushshift.io/reddit/comment/search?ids=" + ','.join(ids['data'][len(data):])
    resp = requests.get(url)
    batch = resp.json()
    data.extend(batch['data'])

  return(data)

sub_id = "j1dynm"
data = get_data(sub_id)

#### A.1 Exercise: Reviewing the Reddit comment data structure
As with __Chapter 4's Exercise B.1__, take 5 minutes to review `data`, but this time focus on the following:
- What is the overall object type?
- What does a single element (comment) look like? (think schema)
- How do these data connect together, i.e., where's the 'thread'?

Write any responses to these questions that you determine in the response box below.

_Response._

In [None]:
## code here

#### A.2 Exercise: Fast access by comment id
As in __Chapter 4's Exercise B.2__, if we want to be able to quickly interact between comment, a convenient option would be to re-format into a dictionary. In particular, consutrct a `dict` called `comments` from `data` that is of the format:

```
comments = {
  id: comment,
  ...
}
```

In [None]:
## code here

#### A.3 Exercise: Constructing the thread's index
Now that we have all of our data in this nice format, it would be great if we could see the actual thread's structure (topology). For this, create a nested dictionary of submission and comment ids, where the keys are the ids, and the values are dictionaries of any replies.

For example, a thread with two top-level comments, having one and two replies (respectively) would have the following structure:

```
{
  submission_id:{
    comment_id1: {comment_id2: {}},
    comment_id3: {comment_id4: {}, comment_id5: {}}
  }
}
```

So for this first part, define the `path` of a comment to be a list that contains all `id`s starting from the top of the tree, that must be passed to arrive at the target comment. 

For example, the `path` for `comment_id2` would be

```
paths[comment_id2] = [submission_id, comment_id1, comment_id2]
```

Specify the paths in pseudocode like this for `comment_id5` and `comment_id3` in the markdown cell below.

_Response._

#### A.4 Determine comment paths recursively
In this part of the problem your job is to write a function:

```
def get_path(post_id, posts, subpath = []):
...
  
```

where `post_id` is the `post['id']` identifier for a `post`, and `posts` is a dictioany of all posts in the thread keyed by id (includng the main submission and all comments).

Importantly, the `subpath` argument will begin as an empty list and be constructed recursively. __Important__: As a caveat to start, the first time through in Case 1 (below) the `subpath` should immediately be seeded with the `post_id`.

In particular, `get_path()` should perform either one of two cases:

1. if the post is not the main submission&mdash;__when it has a `parent_id`__&mdash;this function determines the next piece missing (the `parent_id`) from the subpath (starting from right to left), and places it at the front of an `updated_subpath`. This updated subpath is then used to `return get_path(parent_id, posts)`. Note: this is the recursive step where the function calls itself.
2. if the post is the main submission, i.e., it __has no `parent_id`__, then the job is already completed and you must `return` the `subpath`.



In [None]:
## code here

#### A.5 Build the branching thread object
For this part we'll initialize our branching thread object as a `tree` data type (a recursive dictionary) by building from defauldict:

```
from collections import defaultdict
tree = lambda: defaultdict(tree)
```

With this you can start by calling:

```
thread = tree()
```

To populate our thread, build another function:

```
def branch(thread, path):
...
```

where now, the `thread` is as above, and `path` is the result from `get_path(post_id, posts, [])`, for any `post_id`, i.e., `path` is the index for `post_id` in the `thread`, which we will utilize to build the `thread`'s structure.

In particulr, `branch` must do either:

1. if the path has more than one element apply itself (recursively) towards its leave node, by calling: `branch(thread[path[0]], path[1:])` or
2. if the path has only one element (i.e., the element is a leaf), set its value to an empty dictionary setting: `thread[path[0]] = {}`.

When this function has completed its operation, the `thread` object should be populated through mutability and recursion.

In [None]:
## code here