Installation

This repository contains data collection code for a research project examining Substack newsletters. You can read a little about the research in this Tow Center report.

This code runs two types of data collection:

Newsletter collection: For every category in Substack's internal taxonomy, collect all available newsletters.
Post collection: For all available newsletters, collect every (non-paywalled) post.

Installation

git clone https://github.com/NHagar/substack-collection.git

pip install -r requirements.txt

Usage

python category_collect.py runs data collection on all Substack categories

python newsletter_collect.py runs data collection for newsletters surfaced in category collection (this takes about a week to run)

json_to_pq.py translates raw json files into parquet dataset

scripts/ contains one-off checks and analyses

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
R		R
analysis		analysis
collection		collection
parsing		parsing
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Research-substack.Rproj		Research-substack.Rproj
_targets.R		_targets.R
analysis_categories.py		analysis_categories.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

analysis

analysis

collection

collection

parsing

parsing

scripts

scripts

.gitignore

.gitignore

README.md

README.md

Research-substack.Rproj

Research-substack.Rproj

_targets.R

_targets.R

analysis_categories.py

analysis_categories.py

requirements.txt

requirements.txt

Repository files navigation

Installation

Usage

About

Releases

Packages

Contributors 2

Languages

NHagar/substack-collection

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

About

Resources

Stars

Watchers

Forks

Languages