Extract your highlights and annotations from PDFs stored in Zotero and get insights into your readings and thoughts from the past.
I am a user of Zotero. I store all my research literature in Zotero collections. I annotate the PDFs with highlights and comments. I add notes to many of my Zotero entries.
I want a (basic) text analysis of a specific collection or particular files to aid my literature review: this can be quantitative (word counts and frequencies, e.g.) or qualitative (the context of word appearances and basic sentiment analysis).
I would like the analysis to provide three artefacts:
- A pre-formatted “report” (HTML/PDF)
- An interactive “report” with “cookbooks” that allows me to tweak the analysis
- A data dump in a “tidy” format
- Create Zotero plugin
- PDF.js to scan attachments
- Export article/annotation content and metadata to a tidy format
- Platform-independent, interactive reports
- Python, in Jupyter
- R Notebooks
- Works with Binder
What do we want to get out of the text, highlights, and comments?
- Basic quant. analysis of full texts.
- See Voyant: https://voyant-tools.org/?corpus=b5abdc236ff751a87090c7aa2ecdf14a
- Analysis of full texts and annotations
- What are the most frequent categories for individual papers?
- What are the relevant papers for individual categories?
- Interactively explore articles & annotations
- Show all occurrence of a certain category including quotes from full-texts
“Tidy” data frame for exporting data
QUESTION: How to handle floating annotations? Closest section? Closest paragraph?
|ID||numeric||unique, sequential (1…)|
|section #||numeric||sequential (1…)|
|section ID||string / factor [if available]||Abstract / Introduction / etc.|
|Paragraph #||numeric||Within section? Overall document?|
|word||string / factor||–|
|highlight ID||numeric||Corresponds to sentences/phrases highlighted.|
|annotation-text||string / dataframe||String can be prepared into own “tidy” data format?|
|code / tag||dataframe||#hashtag(s)|
- Above does NOT include footnotes / endnotes.
- Working examples