New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement CNN Transcript Scraper #58
Conversation
402080e
to
03e2558
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is beautiful. The statement-cleaning utils are really well organized, concise, and well-documented. Bless. Nearly everything is small syntactical/typo things taking that 10x to an 11x. No need for a re-review after fixing.
I do have one open test file formatting question I'm curious about.
getNormalizedSpeaker, | ||
normalizeStatementSpeakers, | ||
removeNetworkStatements, | ||
removeUnattributableStatements, | ||
} from '../cnn' | ||
|
||
describe('isTranscriptListUrl', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Pointing to a line that didn't change, I know, but here's where it matters.)
There are two things I'm doing in my test files that I'm not sure why I'm doing it, I just picked it up.
- Nesting all the tests within a file in an outer describe, like so. So here that would be
describe('utils/cnn')
around everything. This gives an explicit and concise description to the test output, although with our focused test files this can be easily determined by the test's own filename/location. - Prefixing function names with
#
. I think this might supposed to only be used when testing class methods, and I'm improperly using them everywhere?
Your thoughts on both would be 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the first; I have no insight on the second...
Do you know where that # convention was documented that you found it from? I simply don't know enough to know one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just picked it up looking at other people's tests at some point. Probably late at night.
Scrapers are slightly different from crawlers, though we may find reason to abstract common elements between the two since they both work with HTTP requests. This implements our first scraper, the CNNTranscriptStatementScraper which takes a transcript URL and spits out statements (text attributed to a speaker). These statements eventually need to be processed and passed to ClaimBuster, which will refine the statements into claims. Issue #24
03e2558
to
fb325cf
Compare
Scrapers are slightly different from crawlers, though we may find reason to abstract common elements between the two since they both work with HTTP requests.
This implements our first scraper, the CNNTranscriptStatementScraper which takes a transcript URL and spits out statements (text attributed to a speaker).
These statements eventually need to be processed and passed to ClaimBuster, which will refine the statements into claims.
This does NOT create a queue for it.
Issue #24