Skip to content

URS v3.2.0

Compare
Choose a tag to compare
@JosephLai241 JosephLai241 released this 26 Feb 02:12
· 447 commits to master since this release

Release date: February 25, 2021

Summary

  • Added analytical tools
    • Word frequencies generator
    • Wordcloud generator
  • Significantly improved JSON structure
  • JSON is now the default export option; the --json flag is deprecated
  • Added numerous extra flags
  • Improved logging
  • Bug fixes
  • Code refactor

Full Changelog

Added

  • User Interface
    • Analytical tools
      • Word frequencies generator.
      • Wordcloud generator.
  • Source code
    • CLI
      • Flags
        • -e - Display additional example usage.
        • --check - Runs a quick check for PRAW credentials and displays the rate limit table after validation.
        • --rules - Include the Subreddit's rules in the scrape data (for JSON only). This data is included in the subreddit_rules field.
        • -f - Word frequencies generator.
        • -wc - Wordcloud generator.
        • --nosave - Only display the wordcloud; do not save to file.
      • Added metavar for args help message.
      • Added additional verbose feedback if invalid arguments are given.
    • Log decorators
      • Added new decorator to log individual argument errors.
      • Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
      • Added new decorator to log when an invalid file is passed into the analytical tools.
      • Added new decorator to log when the scrapes directory is missing, which would cause the new make_analytics_directory() method in DirInit.py to fail.
        • This decorator is also defined in the same file to avoid a circular import error.
    • ASCII art
      • Added new art for the word frequencies and wordcloud generators.
      • Added new error art displayed when a problem arises while exporting data.
      • Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
      • Added new error art displayed when an invalid file is passed into the analytical tools.
  • README
    • Added new Contact section and moved contact badges into it.
      • Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
    • Added new sections for the analytical tools.
    • Updated demo GIFs
      • Moved all GIFs to a separate branch to avoid unnecessary clones.
      • Hosting static images on Imgur.
  • Tests
    • Added additional tests for analytical tools.

Changed

  • User interface
    • JSON is now the default export option. --csv flag is required to export to CSV instead.
    • Improved JSON structure.
      • PRAW scraping export structure:
        • Scrape details are now included at the top of each exported file in the scrape_details field.
          • Subreddit scrapes - Includes subreddit, category, n_results_or_keywords, and time_filter.
          • Redditor scrapes - Includes redditor and n_results.
          • Submission comments scrapes - Includes submission_title, n_results, and submission_url.
        • Scrape data is now stored in the data field.
          • Subreddit scrapes - data is a list containing submission objects.
          • Redditor scrapes - data is an object containing additional nested dictionaries:
            • information - a dictionary denoting Redditor metadata,
            • interactions - a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
          • Submission comments scrapes - data is an list containing additional nested dictionaries.
            • Raw comments contains dictionaries of comment_id: SUBMISSION_METADATA.
            • Structured comments follows the structure seen in raw comments, but includes an extra replies field in the submission metadata, holding a list of additional nested dictionaries of comment_id: SUBMISSION_METADATA. This pattern repeats down to third level replies.
      • Word frequencies export structure:
        • The original scrape data filepath is included in the raw_file field.
        • data is a dictionary containing word: frequency.
    • Log:
      • scrapes.log is now named urs.log.
      • Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
      • Rate limit information is now included in the log.
  • Source code
    • Moved PRAW scrapers into its own package.
    • Subreddit scraper's "edited" field is now either a boolean (if the post was not edited) or a string (if it was).
      • Previous iterations did not distinguish the different types and would solely return a string.
    • Scrape settings for the basic Subreddit scraper is now cleaned within Basic.py, further streamlining conditionals in Subreddit.py and Export.py.
    • Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the LogPRAWScraper class in Logger.py.
    • Passing the submission URL instead of the exception into the not_found list for submission comments scraping.
      • This is a part of a bug fix that is listed in the Fixed section.
    • ASCII art:
      • Modified the args error art to display specific feedback when invalid arguments are passed.
    • Upgraded from relative to absolute imports.
    • Replaced old header comments with docstring comment block.
    • Upgraded method comments to Numpy/Scipy docstring format.
  • README
    • Moved Releases section into its own document.
    • Deleted all media from master branch.
  • Tests
    • Updated absolute imports to match new directory structure.
    • Updated a few tests to match new changes made in the source code.
  • Community documents
    • Updated PULL_REQUEST_TEMPLATE:
      • Updated section for listing changes that have been made to match new Releases syntax.
      • Wrapped New Dependencies in a code block.
    • Updated STYLE_GUIDE:
      • Created new rules for method comments.
    • Added Releases:
      • Moved Releases section from main README to a separate document.

Fixed

  • Source code
    • PRAW scraper settings
      • Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
      • Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
      • Fix: Returning the invalid objects list from each scraper into GetPRAWScrapeSettings.get_settings() to circumvent this issue.
    • Basic Subreddit scraper
      • Bug: The time filter all would be applied to categories that do not support time filter use, resulting in errors while scraping.
      • Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
      • Fix: Added a conditional to check if the category allows for a time filter, and applies either the all time filter or None accordingly.

Deprecated

  • User interface
    • Removed the --json flag since it is now the default export option.