# DSCI 511: Data acquisition and pre-processing<br>Chapter 7: Building and Maintaining a Robust Acquisition Stream

## Exercises
Note: numberings refer to the main notes.

#### 7.1.1.2 Exercise: Understanding API rate limits
Read each of the above API docs and describe the how much API usage is allowed per day from each platform for a given app. Do all apps get the same bandwidth? What methods/metrics do the platforms use to determine limits and overuse? How should an app be constructed to maximize data access?

_Response._

#### 7.1.2.3 Exercise: robots.txt
Take a look at the robots file for each of `facebook.com` and `amazon.com`. Determine and discuss any allowances/disallowances for bots that you might create to crawl these sites. Do you infer any cultural differences around data sharing and access between these companys and also with Twitter?

_Response._

#### 7.2.1.2 Exercise: understanding a crontab for a recurrent, whole-site data access application
Gutenberg is an open data repository, so we should be able to download all of its data!  To start, let's review the robots file on Project Gutenberg's website:
- http://www.gutenberg.org/robots.txt

What do you notice about this file. Is anyone allowed to crawl the site? Do you think Gutenberg uses the newer, big tech rules? How frequently can we make requests?

Use the `robotexclusionrulesparser` module from Section 7.1.2.4 to determine if we can access a given data file. Use the URL for the text copy of Moby dick: 
- https://www.gutenberg.org/files/2701/2701-0.txt

Following the above, review the instructions on mirroring the repository:
- https://www.gutenberg.org/wiki/Gutenberg:Mirroring_How-To

and explain why Gutenberg requests using the `rsync` command-line utility to copy its data. Can you decode the two presented crontab patterns?

_Response._

In [None]:
## place code here

#### 7.3.2.2 Exercise: a script restarter using psutil that also kills zombies
Rewrite `check_process(name)` above by using psutil to 1) obtain process names more easily without regex, and use this 2) to restart our dummy process if it's finished after 3 or fewer passes in the while loop, and kill it if it's still running after 4 or more passes. 

In [None]:
## place code here

## Additional In-depth Exercises
### A. Setting up an allowable-paths spider 
Here, the overall goal will be to build a tool that uses the `robotexclusionrulesparser` module to identify all allowable next steps in a web-crawl (using beautiful soup to parse the page).
#### 1. Collecting lists of allowed links
To start, set up a function that
1. collects the robots.txt file for a given url
2. constructs all allowable paths granted to 'User-agent: *' specifed by the file.

In [None]:
## code here

#### 2. Determine which page links are allowed
Using the `robots.txt` file obtained in __(1)__, iterate over the specified site's hyperlinks (using beautiful soup) and output a dictionary of boolean values (keyed by the site's hyperlinks) that indicates which hyperlinks are robotically accessible.

__Bonus:__ Discuss the potential usage of any available site maps (what are these?).

In [None]:
## code here

#### 3. Plan a strategy
How will you approach a broader site crawl? Specifically, determine a strategy for going deep into the site, repeatedly determineing any new paths and whether or not they are accessible

_Response._

### B. Non-authenticated Twitter API access (shh)
Technically, this week's chapter shows us a nice little secret about Twitter's API&mdash;the `robots.txt` file technically shows us we're allowed to pull data from the standard REST search API from their front end. We'll build this up in pieces, first analyzing from our work with Robots.
#### 1. Scoping Twitter's crawlable content
To start, apply the two tools developed in __(A1&ndash;A2)__ to `twitter.com` to determine what's available from the site.

_Discussion._ 

#### 2. Reviewing all allowable links
Apply the all allowable links function from __(1)__ to `twitter.com`, and review. In particular focus on the one related to search. Discuss these urls and what they actually provide access to.

[__Hint__. Go back to the example in __Sec. 7.1.2.4__]

_Discussion._ 

In [None]:
## code here

#### 3. Scoping utilility with standard search parameters
Review the standard search parameters and attempt to pass some to confirm that we are indeed accessing the same resources (just reviewing them wrapped in html). 

- https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators

In particular, which parameters give us generalized access to the search API?

_Discussion._ 

In [None]:
## code here

#### 4. Rigorous resource confirmation
Since we know the real-time search API is not available to robots (again, go back to  __Sec. 7.1.2.4__), we should probably assume that these robot-available data are not technically the same sample of tweets we'd get from authenticated access. But is this true?

Here, the job is to replicate exactx-same API calls between the front-end scraper and Twitter's API to confirm that we're indeed

_Discussion._ 

In [None]:
## code here

#### 5. Abstracting the resource
What we'd really like to have out of this is a second, unauthenticated Twitter search API that is based off of their front end. But this means that we want 'tweet objects' (see Twttier: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object). 

So, here the job is to build a function that works with BeautifulSoup to extract all content from a front-end search result (499 `q` query characters, but otherwise accepts other standard search parameters, as possible) and constructs something as close as possible to  a list of tweet objects with cursors for any available pagenation.

_Discussion._ 

In [None]:
## code here

### C. Non-authenticated Twitter API access (again, shh)
Considering the outcome of Exercise B (see Solutions, Section B., above), it seems that if non-authenticated access is possible, that it would be along a different endpoint. In fact, one such endpoint appears to exist because this as already been built. 

Consider the following Python module, `GetOldTweets3`:

- https://github.com/Mottl/GetOldTweets3/blob/master/bin/GetOldTweets3

####1. What's gonig on here?
Review the module's code on github, particularly the README to assess what it does and why it works. In addition, see if you can get it working in the notebook here. How can you access the code, via github, pip? 

_Discussion._ 

In [None]:
## code here

#### 2. Try out the Module
Now let's try out some of the examples on their README to see everything works ok. 

In [None]:
## code here

#### 3. Evaluate the Level of Access
Select an endpoint that this 'API' appears to mimic with respect to Twitter's REST API and compare the volume of requests (as determined by the endpoint compared) that `GetOldTweets3` appears to provide, as compared to the REST API expectations.

_Discussion._

In [None]:
## code here

#### 4. Explain this Utility
Review Twitter's `robots.txt` and determine the best possible explanation you can for why this module is able to exist. In this discussion, include an assessment of its use as a data engineering resource and the potential for its continued existence.

_Discussion._ 

In [None]:
## code here