Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CNN Transcript Scraper #58

Merged
merged 1 commit into from Jun 14, 2019
Merged

Conversation

slifty
Copy link
Contributor

@slifty slifty commented Jun 13, 2019

Scrapers are slightly different from crawlers, though we may find reason to abstract common elements between the two since they both work with HTTP requests.

This implements our first scraper, the CNNTranscriptStatementScraper which takes a transcript URL and spits out statements (text attributed to a speaker).

These statements eventually need to be processed and passed to ClaimBuster, which will refine the statements into claims.

This does NOT create a queue for it.

Issue #24

@slifty slifty requested a review from reefdog June 13, 2019 15:56
src/server/utils/cnn.js Outdated Show resolved Hide resolved
@slifty slifty force-pushed the 24-add-cnn-transcript-crawler branch 2 times, most recently from 402080e to 03e2558 Compare June 13, 2019 16:03
Copy link
Collaborator

@reefdog reefdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beautiful. The statement-cleaning utils are really well organized, concise, and well-documented. Bless. Nearly everything is small syntactical/typo things taking that 10x to an 11x. No need for a re-review after fixing.

I do have one open test file formatting question I'm curious about.

src/server/utils/cnn.js Show resolved Hide resolved
src/server/utils/cnn.js Show resolved Hide resolved
src/server/utils/cnn.js Show resolved Hide resolved
src/server/utils/cnn.js Outdated Show resolved Hide resolved
src/server/utils/cnn.js Show resolved Hide resolved
getNormalizedSpeaker,
normalizeStatementSpeakers,
removeNetworkStatements,
removeUnattributableStatements,
} from '../cnn'

describe('isTranscriptListUrl', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Pointing to a line that didn't change, I know, but here's where it matters.)

There are two things I'm doing in my test files that I'm not sure why I'm doing it, I just picked it up.

  • Nesting all the tests within a file in an outer describe, like so. So here that would be describe('utils/cnn') around everything. This gives an explicit and concise description to the test output, although with our focused test files this can be easily determined by the test's own filename/location.
  • Prefixing function names with #. I think this might supposed to only be used when testing class methods, and I'm improperly using them everywhere?

Your thoughts on both would be 👍.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the first; I have no insight on the second...

Do you know where that # convention was documented that you found it from? I simply don't know enough to know one way or another.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just picked it up looking at other people's tests at some point. Probably late at night.

src/server/utils/cnn.js Outdated Show resolved Hide resolved
Scrapers are slightly different from crawlers, though we may find reason
to abstract common elements between the two since they both work with
HTTP requests.

This implements our first scraper, the CNNTranscriptStatementScraper which
takes a transcript URL and spits out statements (text attributed to
a speaker).

These statements eventually need to be processed and passed to
ClaimBuster, which will refine the statements into claims.

Issue #24
@slifty slifty force-pushed the 24-add-cnn-transcript-crawler branch from 03e2558 to fb325cf Compare June 13, 2019 20:35
@slifty slifty merged commit 1ccef1a into master Jun 14, 2019
@slifty slifty deleted the 24-add-cnn-transcript-crawler branch June 14, 2019 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants