Skip to content

Spider Silk

Alejandro Carrizosa Grant edited this page Feb 23, 2022 · 9 revisions

Delving Into the Deep Web

  • Before expounding on this topic, one should be familiar with how my search engine in connected with my backend and frontend:
    Search Page (search submitted)
    ↓ (raw query is processed by Redux store then passed to Flask) ↓
    Flask Search Routes (search processed)
    ↓ (processed query is passed to Scrapy framework) ↓
    ↓ (asynchronous Crochet library facilitates the connection) ↓
    Scrapy Crawler (scraps web with query)
    ↓ (yielded data is processed by Flask and sent back to Redux store) ↓
    Search Results (results displayed)
    
  • I wanted to access the wealth of information that lies a layer under the surface web. I focused my search on resource repositories like JSTOR and the Wiley Library.
  • I ran into many blockers doing this:
    • After several unsuccessful attempts to access JSTOR's records programmatically, I went back to the scrapy shell to check if something was going awry with the fetch and discovered the following:
      >>> fetch('https://www.jstor.org/')
      2022-02-09 10:01:40 [scrapy.core.engine] INFO: Spider opened
      2022-02-09 10:01:40 [scrapy.core.engine] DEBUG: Crawled (420) <GET https://www.jstor.org/> (referer: None)
      • I was receiving a 420 Enhance Your Calm response, a nonstandard status code originating from Twitter, returned when the client is rate limited.
      • This, comically, conveyed to me that they weren't open to my crawler using their resources. I moved on to other libraries out of respect for the site admins and to avoid expending additional time addressing manufactured obstacles.
    • The answer as to why I couldn't access the Wiley Library was not as, initially, forthcoming. My shell was showing successful fetches for both the root and subdirectories:
      >>> fetch('https://onlinelibrary.wiley.com/')
      2022-02-09 10:16:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://onlinelibrary.wiley.com/?cookieSet=1> from <GET https://onlinelibrary.wiley.com/>
      2022-02-09 10:16:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://onlinelibrary.wiley.com/> from <GET https://onlinelibrary.wiley.com/?cookieSet=1>
      2022-02-09 10:16:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://onlinelibrary.wiley.com/> (referer: None)
      
      >>> fetch('https://onlinelibrary.wiley.com/action/doSearch?AllField=poetry')
      2022-02-09 10:18:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://onlinelibrary.wiley.com/action/doSearch?AllField=poetry> (referer: None)
      • I experimented with several methods in Scrapy to try to address why I couldn't get a response from the library. Finally after some time researching I checked their robots.txt by navigating to https://onlinelibrary.wiley.com/robots.txt and laughed upon discovering that all the resources I hoped to access were disallowed:
        User-agent: *
        Disallow: /action
        Disallow: /help
        Disallow: /search
        ...
        
        Crawl-delay: 1
        
      • I had configured my crawler to abide by robots.txt documents for the sake of respectful crawling, to prevent my IP from being blacklisted, and—as aforementioned—to avoid time spent traversing intentional obstacles.
      • It still struck me as odd that my shell was able to get a successful response. I thought it could because I'd moved my crawler settings to a new document (for Flask integration purposes), and commented out the one provided by Scrapy.
        • I tested whether this could be the case by commenting in ROBOTSTXT_OBEY = True in the original settings document, starting a new shell, and fetching again.
        • I thought, perhaps, this was why the shell was able to fetch with impunity. However, it still received successful responses. This told me that there must be another layer of settings for the scrapy shell.
    • I started looking at robots.txt documents ab ovo which saved me a lot of time. I uncovered a pattern that many resource repositories had of disallowing crawling. This was present across the board, from county libraries to universities.
    • Eventually, I discovered one library that was open to being crawled. This would be my gateway to the deep web, I could almost taste the data. Still, I needed to familiarize myself with outfitting my spiders with POST capabilities through Scrapy. I started by grounding myself in the following concepts:
      • GET and POST requests are the two ways with which search forms can be interacted.
      • The former requires string interpolation to hit different subdirectories based on the raw query received from the frontend, here's a conceptual outline:
        url = 'https://example.com/resources/search?'
        raw_query = 'dogs'
        req = scrapy.Request(url=f'{url}{raw_query}')
      • The latter requires the use of Scrapy's FormRequest.from_response() method which uses data from the initial fetch response to target the chosen input which it then populates with data it's provided. Once it's yielded, the Scrapy response parameter will contain what it's gleaned.
        import scrapy
        from scrapy.http import FormRequest
        
        class DeepCrawler1(scrapy.Spider):
            """Deep crawling spider."""
        
            name = 'deep_crawler_1'
            start_urls = ['https://librarytechnology.org/repository/']
        
            def parse(self, response):
                """Send post request."""
                try:
                    data = { 'q': self.raw_query }
                    request = FormRequest.from_response(
                                        response,
                                        method='POST', 
                                        formdata=data, 
                                        headers={ 'Content-Type': 'application/x-www-form-urlencoded' },
                                        callback=self.process_search
                                    )
                    yield request
                except: print(f'End of the line error in parse method for {self.name}.')
        
            def process_search(self, response):
                """Process search results."""
                try:
                    inputs = response.css('input.SubmitLink')
                    resource_indices = inputs.xpath("//input/../../input[@name='RC']/@value").getall()
                    for i in range(len(inputs)):
                        input = inputs[i]
                        base_url = 'https://librarytechnology.org/document/'
                        resource_idx = resource_indices[i]
                        url = f'{base_url}{resource_idx}'
                        text = input.css('::attr(value)').get()
                        yield { 'url': url, 'text': text }
                except: print(f'End of the line error in process_search method for {self.name}.')
        • The entries on the site were not anchor tags, they were forms with a series of inputs within them.
          <form style="display: inline; margin: 0;" action="https://librarytechnology.org/document/23656" method="post">
              <input type="hidden" name="SID" value="20220209933842771">
              <input type="hidden" name="code" value="bib">
              <input type="hidden" name="RC" value="23656">
              <input type="hidden" name="id" value="23656">
              <input type="hidden" name="SID" value="20220209933842771">
              <input type="hidden" name="Row" value="2">
              <input type="hidden" name="code" value="bib">
              <p style="margin-left: 5em; text-indent: -5em">
                  <strong><span style="width: 4em; text-align: right">2. </span></strong>
                  <input type="submit" name="submit" class="SubmitLink" value="Lust for money joins Australians' and New Zealanders’ lust for blood in the most borrowed library books Index">
                  <span class="journaltitle">Press Release</span>
                  <span style="font-size: 70%"> (full text <img src="/images/fulltext.gif" alt="Full text available" border="0">)</span>
                  [August 8, 2018]
              </p>
          </form>
      • I did this programmatically as well as in the shell, the latter so that I could use Scrapy's view(response) functionality which opens a tab in the browser reflecting what information was harvested. The structure of the site is reconstructed from the scraped data and, thus, greatly assists in the development a precise course of action.
        >>> from scrapy.http import FormRequest
        >>> data = { 'q': 'dogs' }
        >>> request = FormRequest.from_response(
        ...             response,
        ...             method='POST', 
        ...             formdata=data, 
        ...             headers={ 'Content-Type': 'application/x-www-form-urlencoded' },
        ... )
        >>> fetch(request)
        2022-02-09 09:49:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://librarytechnology.org/repository/search.pl> (referer: None)
        >>> view(response)

Front to Back Validation

Here is an outline of my methodology:

  • Frontend
    • HTML attribute to prevent invalid input where appropriate
    • DOM element to contain backend errors (for complex validation)
    • CSS input/button/cursor to show prohibited activity
    • If statements on the frontend preventing the wrong data from being dispatched
  • Middle
    • React store alert with backend errors for malicious actors
    • Console.error (where overt alerts are excessive)
  • Backend
    • Flask decorator to ensure user is logged in
    • Flask form to protect database (where specific input format is needed)
    • If statements in python definitions to ensure the right conditions are met
    • Database constraints to assure all data is stored usefully
  • Giving care to the very tip of your frontend, particularly HTML attributes can go a long way. It’s better to prevent the user from taking an action against their best interest then give them an error at every wrong turn. This prevents the client from having to discern the source of an error. HTML attributes will either conveniently display the problematic input or prevent the wrong input from even occurring (like in the case of max-length and accepted attributes).
  • It’s safe to say that HTML attributes can account for all non-malicious users which leaves only those who seek to subvert your security either for white/gray hat activity or for black hat purposes. In those cases, triggering an DOM alert is appropriate which means that you may even be saved from having to allocate space to errors on some pages since the typical client will be steered away from the error altogether by your frontend pre-validation and the atypical client will be alerted (from the store, fed by a backend form), both of which don't require a DOM element to store your errors for display.
  • Of course there will be exceptions; there are errors for which the user will have to discern the radix, as is the case with logins and signups. However, there's space even there for pre-validations, especially for the forms with less complex requirements. Whether it's graying out the unaccepted files with the HTML accepted attribute or halting an excessive input with the max-length attribute, stopping invalid inputs before they even reach your second layer of validation (which is still critical) can ease the experience of your user and be a boon to your backend servers since they will have most of the invalid inputs fielded before reaching their validators.

Working With Different Regex Flavors

  • I've found that JavaScript's regex flavor has a few key differences to Python's. It all revolves around the information packed into the regex string or object vs. the methods surrounding it. Python puts more responsibility on the methods to guide the action of the pattern matching while Javascript embeds all the relevant information in the expression itself. Take the example of making a title abbreviator as I did in this project:
    • Py
      long_time_zone = 'Pacific Daylight Time'
      short_time_zone = ''.join(re.findall(r'([A-Z]){1}\w+', long_time_zone)) # PDT
    • JS
      const longTimeZone = 'Pacific Daylight Time';
      const shortTimeZone = longTimeZone.replace(/([A-Z]){1}\w+|(\s)/g, '$1'); // PDT
  • JavaScript's emphasis on the expression over the method is elucidated in the previous example. The g flag is offering the expression more specificity in its search in the same way that the findall method is doing for the Python expression. Even the flag placement in the code highlights the different philosophies. JavaScript chooses to keep the flags as close to the expression as possible while Python makes them arguments of the methods:
    • JS
      const str = 'apple'
      const pattern = /ApPlE/im
      const match =  pattern.test(str); // true
    • Py
      import re
      
      string = 'apple'
      pattern = r'ApPlE'
      match = re.search(pattern, string, flags=re.I | re.M)
      bool(match) # True
  • The RegExp constructor and re.compile pattern object allow for multiline comments to be created in their respective languages (JS and Py). re.compile is very useful in regex since one can store a flag on the object and reuse that flag in multiple expressions as opposed to including it in every method. This is not a situation one would encounter in JS's flavor of regex since flags are directly appended to the regular expressions. Here are two examples of multiline regex comments, one in JS and the other in Py, both dealing with dates.
    • JS
      const dateRegex = new RegExp([
                              '([A-Z]{1}[a-z]{2}),\\s', // day of the week
                              '(\\d{2}\\s[A-Z]{1}[a-z]{2}\\s\\d{4})\\s', // day, month, and year
                              '(\\d{2}:\\d{2}:\\d{2})\\s', //? time
                              '(.*)' // time zone
                              ].join(''), 'g');
    • Py
      js_date_regex = re.compile(r'''
      ([A-Z]{1}[a-z]{2}\s[A-Z]{1}[a-z]{2}\s\d{2}\s\d{4}\s\d{2}:\d{2}:\d{2})\s # date and time
      ([A-Z]{1,5}[-|+]\d{4})\s # gmt offset
      \((.*)\) # time zone
      ''', re.VERBOSE)

Sharing Redux State Between Sibling Components

  • Originally, I was thinking about resetting state in my create theme component every time the user switched a theme. I came to realize that this isn't a desirable behavior since a user may want to switch themes while continuing to create a theme. The create theme and edit theme components are naturally decoupled since I have them in separate components so it was no issue allowing my create theme component state to be different than that which is stored in the Redux store and present throughout the rest of the document.
  • However, due to the amount of obstacles I encountered while pursuing that undesirable behavior, I was deeply curious how I would have made the two communicate if I actually wanted them to (i.e. selecting a theme on edit form resets input values on create form).
  • The main crux is that albeit style values are changing for the whole application immediately when the use theme button is pressed in the edit theme component, the state isn't reset in the create theme component. I decided that if I wanted to do this I could use React context which is what I used to construct my modals.
  • I cycled through several ideas to arrive at that conclusion:
    • altering the DOM directly (since create theme component elements are available to the edit theme component in the DOM) doesn't last
    • merging the components so that they can share state is a step backwards in development since they were intentionally compartmentalized
    • passing down props is not a possibility since the components are siblings and architecturally React only supports props passing from parent to child
    • storing a boolean value in the Redux store that changes when the button in use theme is pressed and then having a function with a conditional in the create theme component reset the state if that value is true, then changing that value to false in the store is a viable option but seems more convoluted than using React context