Release URS v3.2.0 · JosephLai241/URS

Release date: February 25, 2021

Summary

Added analytical tools
- Word frequencies generator
- Wordcloud generator
Significantly improved JSON structure
JSON is now the default export option; the --json flag is deprecated
Added numerous extra flags
Improved logging
Bug fixes
Code refactor

Full Changelog

Added

User Interface
- Analytical tools
  - Word frequencies generator.
  - Wordcloud generator.
Source code
- CLI
  - Flags
    - -e - Display additional example usage.
    - --check - Runs a quick check for PRAW credentials and displays the rate limit table after validation.
    - --rules - Include the Subreddit's rules in the scrape data (for JSON only). This data is included in the subreddit_rules field.
    - -f - Word frequencies generator.
    - -wc - Wordcloud generator.
    - --nosave - Only display the wordcloud; do not save to file.
  - Added metavar for args help message.
  - Added additional verbose feedback if invalid arguments are given.
- Log decorators
  - Added new decorator to log individual argument errors.
  - Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
  - Added new decorator to log when an invalid file is passed into the analytical tools.
  - Added new decorator to log when the scrapes directory is missing, which would cause the new make_analytics_directory() method in DirInit.py to fail.
    - This decorator is also defined in the same file to avoid a circular import error.
- ASCII art
  - Added new art for the word frequencies and wordcloud generators.
  - Added new error art displayed when a problem arises while exporting data.
  - Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
  - Added new error art displayed when an invalid file is passed into the analytical tools.
README
- Added new Contact section and moved contact badges into it.
  - Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
- Added new sections for the analytical tools.
- Updated demo GIFs
  - Moved all GIFs to a separate branch to avoid unnecessary clones.
  - Hosting static images on Imgur.
Tests
- Added additional tests for analytical tools.

Changed

User interface
- JSON is now the default export option. --csv flag is required to export to CSV instead.
- Improved JSON structure.
  - PRAW scraping export structure:
    - Scrape details are now included at the top of each exported file in the scrape_details field.
      - Subreddit scrapes - Includes subreddit, category, n_results_or_keywords, and time_filter.
      - Redditor scrapes - Includes redditor and n_results.
      - Submission comments scrapes - Includes submission_title, n_results, and submission_url.
    - Scrape data is now stored in the data field.
      - Subreddit scrapes - data is a list containing submission objects.
      - Redditor scrapes - data is an object containing additional nested dictionaries:
        
        information - a dictionary denoting Redditor metadata,
        
        interactions - a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
      - Submission comments scrapes - data is an list containing additional nested dictionaries.
        
        Raw comments contains dictionaries of comment_id: SUBMISSION_METADATA.
        
        Structured comments follows the structure seen in raw comments, but includes an extra replies field in the submission metadata, holding a list of additional nested dictionaries of comment_id: SUBMISSION_METADATA. This pattern repeats down to third level replies.
  - Word frequencies export structure:
    - The original scrape data filepath is included in the raw_file field.
    - data is a dictionary containing word: frequency.
- Log:
  - scrapes.log is now named urs.log.
  - Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
  - Rate limit information is now included in the log.
Source code
- Moved PRAW scrapers into its own package.
- Subreddit scraper's "edited" field is now either a boolean (if the post was not edited) or a string (if it was).
  - Previous iterations did not distinguish the different types and would solely return a string.
- Scrape settings for the basic Subreddit scraper is now cleaned within Basic.py, further streamlining conditionals in Subreddit.py and Export.py.
- Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the LogPRAWScraper class in Logger.py.
- Passing the submission URL instead of the exception into the not_found list for submission comments scraping.
  - This is a part of a bug fix that is listed in the Fixed section.
- ASCII art:
  - Modified the args error art to display specific feedback when invalid arguments are passed.
- Upgraded from relative to absolute imports.
- Replaced old header comments with docstring comment block.
- Upgraded method comments to Numpy/Scipy docstring format.
README
- Moved Releases section into its own document.
- Deleted all media from master branch.
Tests
- Updated absolute imports to match new directory structure.
- Updated a few tests to match new changes made in the source code.
Community documents
- Updated PULL_REQUEST_TEMPLATE:
  - Updated section for listing changes that have been made to match new Releases syntax.
  - Wrapped New Dependencies in a code block.
- Updated STYLE_GUIDE:
  - Created new rules for method comments.
- Added Releases:
  - Moved Releases section from main README to a separate document.

Fixed

Source code
- PRAW scraper settings
  - Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
  - Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
  - Fix: Returning the invalid objects list from each scraper into GetPRAWScrapeSettings.get_settings() to circumvent this issue.
- Basic Subreddit scraper
  - Bug: The time filter all would be applied to categories that do not support time filter use, resulting in errors while scraping.
  - Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
  - Fix: Added a conditional to check if the category allows for a time filter, and applies either the all time filter or None accordingly.

Deprecated

User interface
- Removed the --json flag since it is now the default export option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URS v3.2.0

Release date: February 25, 2021

Summary

Full Changelog

Added

Changed

Fixed

Deprecated