Implement CNN Transcript Scraper #58

slifty · 2019-06-13T15:56:27Z

Scrapers are slightly different from crawlers, though we may find reason to abstract common elements between the two since they both work with HTTP requests.

This implements our first scraper, the CNNTranscriptStatementScraper which takes a transcript URL and spits out statements (text attributed to a speaker).

These statements eventually need to be processed and passed to ClaimBuster, which will refine the statements into claims.

This does NOT create a queue for it.

Issue #24

src/server/utils/cnn.js

reefdog

This is beautiful. The statement-cleaning utils are really well organized, concise, and well-documented. Bless. Nearly everything is small syntactical/typo things taking that 10x to an 11x. No need for a re-review after fixing.

I do have one open test file formatting question I'm curious about.

src/server/utils/cnn.js

src/server/workers/scrapers/CnnTranscriptStatementScraper.js

reefdog · 2019-06-13T16:41:58Z

src/server/utils/__test__/cnn.test.js

+  getNormalizedSpeaker,
+  normalizeStatementSpeakers,
+  removeNetworkStatements,
+  removeUnattributableStatements,
 } from '../cnn'

 describe('isTranscriptListUrl', () => {


(Pointing to a line that didn't change, I know, but here's where it matters.)

There are two things I'm doing in my test files that I'm not sure why I'm doing it, I just picked it up.

Nesting all the tests within a file in an outer describe, like so. So here that would be describe('utils/cnn') around everything. This gives an explicit and concise description to the test output, although with our focused test files this can be easily determined by the test's own filename/location.

Prefixing function names with #. I think this might supposed to only be used when testing class methods, and I'm improperly using them everywhere?

Your thoughts on both would be 👍.

I'm fine with the first; I have no insight on the second...

Do you know where that # convention was documented that you found it from? I simply don't know enough to know one way or another.

I think I just picked it up looking at other people's tests at some point. Probably late at night.

src/server/utils/cnn.js

Scrapers are slightly different from crawlers, though we may find reason to abstract common elements between the two since they both work with HTTP requests. This implements our first scraper, the CNNTranscriptStatementScraper which takes a transcript URL and spits out statements (text attributed to a speaker). These statements eventually need to be processed and passed to ClaimBuster, which will refine the statements into claims. Issue #24

slifty requested a review from reefdog June 13, 2019 15:56

reefdog reviewed Jun 13, 2019

View reviewed changes

src/server/utils/cnn.js Outdated Show resolved Hide resolved

slifty force-pushed the 24-add-cnn-transcript-crawler branch 2 times, most recently from 402080e to 03e2558 Compare June 13, 2019 16:03

reefdog requested changes Jun 13, 2019

View reviewed changes

reefdog reviewed Jun 13, 2019

View reviewed changes

src/server/utils/cnn.js Outdated Show resolved Hide resolved

slifty force-pushed the 24-add-cnn-transcript-crawler branch from 03e2558 to fb325cf Compare June 13, 2019 20:35

reefdog approved these changes Jun 14, 2019

View reviewed changes

slifty merged commit 1ccef1a into master Jun 14, 2019

slifty deleted the 24-add-cnn-transcript-crawler branch June 14, 2019 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CNN Transcript Scraper #58

Implement CNN Transcript Scraper #58

slifty commented Jun 13, 2019

reefdog left a comment

reefdog Jun 13, 2019

slifty Jun 13, 2019

reefdog Jun 13, 2019

Implement CNN Transcript Scraper #58

Implement CNN Transcript Scraper #58

Conversation

slifty commented Jun 13, 2019

reefdog left a comment

Choose a reason for hiding this comment

reefdog Jun 13, 2019

Choose a reason for hiding this comment

slifty Jun 13, 2019

Choose a reason for hiding this comment

reefdog Jun 13, 2019

Choose a reason for hiding this comment