Skip to content
This repository has been archived by the owner on Jul 21, 2019. It is now read-only.

Batch job: write script to pull all current SED transcripts #5

Open
andrewmarklloyd opened this issue Oct 31, 2018 · 1 comment
Open

Comments

@andrewmarklloyd
Copy link
Contributor

andrewmarklloyd commented Oct 31, 2018

  • Convert PDF to text format that can be scraped
  • Input into BigTable as unclassified
@andrewmarklloyd andrewmarklloyd changed the title Batch job: write script to pull all SED transcripts Batch job: write script to pull all current SED transcripts Oct 31, 2018
@andrewmarklloyd andrewmarklloyd added this to To do in Feed 2.0 Nov 2, 2018
@andrewmarklloyd
Copy link
Contributor Author

andrewmarklloyd commented Nov 11, 2018

Currently have a proof of concept script to get all pdf URLs and download the PDFs. I have another function to convert the pdf to text. I need to figure out how to associate the transcript URL with the episode id in softwaredaily.com MongoDB or something equivalent. The algorithm will be something like the following:

  • get all transcript objects from Wordpress API
  • for each transcript object:
    • use URL for pdf to download pdf
    • convert transcript pdf to text
    • upload text to Big Table as unclassified

@andrewmarklloyd andrewmarklloyd moved this from To do to In progress in Feed 2.0 Nov 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Feed 2.0
  
In progress
Development

No branches or pull requests

1 participant