Skip to content

NHagar/substack-collection

Repository files navigation

This repository contains data collection code for a research project examining Substack newsletters. You can read a little about the research in this Tow Center report.

This code runs two types of data collection:

  1. Newsletter collection: For every category in Substack's internal taxonomy, collect all available newsletters.

  2. Post collection: For all available newsletters, collect every (non-paywalled) post.

Installation

git clone https://github.com/NHagar/substack-collection.git

pip install -r requirements.txt

Usage

python category_collect.py runs data collection on all Substack categories

python newsletter_collect.py runs data collection for newsletters surfaced in category collection (this takes about a week to run)

json_to_pq.py translates raw json files into parquet dataset

scripts/ contains one-off checks and analyses

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published