Batch job: write script to pull all current SED transcripts #5

andrewmarklloyd · 2018-10-31T02:55:42Z

Convert PDF to text format that can be scraped
Input into BigTable as unclassified

andrewmarklloyd · 2018-11-11T06:07:00Z

Currently have a proof of concept script to get all pdf URLs and download the PDFs. I have another function to convert the pdf to text. I need to figure out how to associate the transcript URL with the episode id in softwaredaily.com MongoDB or something equivalent. The algorithm will be something like the following:

get all transcript objects from Wordpress API
for each transcript object:
- use URL for pdf to download pdf
- convert transcript pdf to text
- upload text to Big Table as unclassified

andrewmarklloyd changed the title ~~Batch job: write script to pull all SED transcripts~~ Batch job: write script to pull all current SED transcripts Oct 31, 2018

andrewmarklloyd added this to To do in Feed 2.0 Nov 2, 2018

andrewmarklloyd added the ContentIngester label Nov 2, 2018

andrewmarklloyd moved this from To do to In progress in Feed 2.0 Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch job: write script to pull all current SED transcripts #5

Batch job: write script to pull all current SED transcripts #5

andrewmarklloyd commented Oct 31, 2018 •

edited

andrewmarklloyd commented Nov 11, 2018 •

edited

Batch job: write script to pull all current SED transcripts #5

Batch job: write script to pull all current SED transcripts #5

Comments

andrewmarklloyd commented Oct 31, 2018 • edited

andrewmarklloyd commented Nov 11, 2018 • edited

andrewmarklloyd commented Oct 31, 2018 •

edited

andrewmarklloyd commented Nov 11, 2018 •

edited