Skip to content

Utility for adding archive.org links to markdown files in the format [...](original link) ([a](archive.org link))

Notifications You must be signed in to change notification settings

NunoSempere/longnow-for-markdown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

About

This utility takes a markdown file, and creates a new markdown file in which each link is accompanied by an archive.org link, in the format [...](original link) ([a](archive.org link)).

I use it to archive links in this forecasting newsletter, which contains the following footer:

Note to the future: All links are added automatically to the Internet Archive, using this tool (a). "(a)" for archived links was inspired by Milan Griffes (a), Andrew Zuckerman (a), and Alexey Guzey (a).

Requirements

This utility requires archivenow as a dependency, which itself requires a python installation. archivenow can be installed with

pip install archivenow ## respectively, pip3, pipx, etc. depending on the system. I use pipx

You can instead install it with a virtual environment, which is what works on Ubuntu 24.04:

cd ~/.local
git clone git@github.com:oduwsdl/archivenow.git
cd archivenow
python3 -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt
pip install setuptools
pip install ./

ln -s $(realpath ./venv/bin/archivenow) ~/.local/bin/archivenow

longnow also requires jq, which can be installed as:

sudo apt install jq

if on Debian, or using your distribution's package manager otherwise.

Installation

Add this file to your path, for instance by moving it to the /usr/bin folder and giving it execute permissions (with chmod 755 longnow)

curl https://raw.githubusercontent.com/NunoSempere/longNowForMd/master/longnow.sh > longnow
cat longnow ## probably a good idea to at least see what's there before giving it execute permissions
sudo chmod 755 longnow
mv longnow /bin/longnow

Usage

$ longnow file.md

Initially, for a reasonably sized file, the process took a long time, so this was more of a "fire and forget, and then come back in a couple of hours" tool. The process can be safely stopped and restarted at any point, and archive links are remembered, but the errors file is created again each time. However, as of a recent iteration of this program, if archive.org already has a snapshot of the page, that snapshot is taken instead. This results in massive time savings, but could imply that a less up to date copy is used. If this behavior is not desired, it can be easily excised manually, by removing the lines around if [ "$urlAlreadyInArchiveOnline" == "" ]; then.

To do

  • Deal elegantly with images. Right now, they are also archived, and have to be removed manually afterwards.
  • Possibly: Throttle requests to the internet archive less. Right now, I'm sending a link roughly every 12 seconds, and then sleeping for a minute every 15 requests. This is probably too much throttling (the theoretical limit is 15 requests per minute), but I think that it does reduce the error rate.
  • Do the same thing but for html files, or other formats
  • Present to r/DataHoarders
  • Pull requests are welcome.

How to use to back up Google Files

You can download a .odt file from Google, and then convert it to a markdown file with

function odtToMd(){

  input="$1"
  root="$(echo "$input" | sed 's/.odt//g' )"
  output="$root.md"

  pandoc -s "$input" -t markdown-raw_html-native_divs-native_spans-fenced_divs-bracketed_spans | awk ' /^$/ { print "\n"; } /./ { printf("%s ", $0); } END { print ""; } ' | sed -r 's/([0-9]+\.)/\n\1/g' | sed -r 's/\*\*(.*)\*\*/## \1/g'  | tr -s " " | sed -r 's/\\//g' | sed -r 's/\[\*/\[/g' | sed -r 's/\*\]/\]/g' > "$output"
  ## Explanation: 
  ## markdown-raw_html-native_divs-native_spans-fenced_divs-bracketed_spans: various flags to generate some markdown I like
  ## sed -r 's/\*\*(.*)\*\*/## \1/g': transform **Header** into ## Header
  ## sed -r 's/\\//g': Delete annoying "\"s
  ## awk ' /^$/ { print "\n"; } /./ { printf("%s ", $0); } END { print ""; } ': compress paragraphs; see https://unix.stackexchange.com/questions/6910/there-must-be-a-better-way-to-replace-single-newlines-only
  ## sed -r 's/([0-9]*\.)/\n\1/g': Makes lists nicer.
  ## tr -s " ": Replaces multiple spaces
}

## Use: odtToMd file.odt

Then run this tool (longnow file.md). Afterwards, convert the output file (file.longnow.md) back to html with

function mdToHTML(){
  input="$1"
  root="$(echo "$input" | sed 's/.md//g' )"
  output="$root.html"
  pandoc -r gfm "$source" -o "$output"
  ## sed -i 's|\[ \]\(([^\)]*)\)| |g' "$source" ## This removes links around spaces, which are very annoying. See https://unix.stackexchange.com/questions/297686/non-greedy-match-with-sed-regex-emulate-perls
}

## Use: mdToHTML file.md

Then copy and paste the html into a Google doc and fix fomatting mistakes.

About

Utility for adding archive.org links to markdown files in the format [...](original link) ([a](archive.org link))

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages