I generate four types of content (documents, photos, videos, audio) from many devices. Periodically, I need to organise the data in my laptop, renaming, deduplicating, and copying files around, before loading them to my NAS. The process is tedious, irreproducible, and error prone.
A tool that reads from a "dump" folder and automatically sorts content to various drives on my home server:
- read data from a
dump
folder - run a script
- find content organised by date/type on my server
Here we read from ./data/dump
which contains some sample files. A migration plan is prepared and executed (copy + rename) to separate images, videos, audio, and archives in corresponding subdirectories under ./data/server
.
Output tree:
├── data
│ ├── server
│ │ ├── audio
│ │ ├── documents
│ │ │ └── 2020-05-28
│ │ │ ├── file-sample_100kB_20200528_075418.doc
│ │ │ ├── file-sample_100kB_20200528_075421.docx
│ │ │ ├── file_example_PPT_250kB_20200528_075527.ppt
│ │ │ ├── file_example_XLSX_10_20200528_075440.xlsx
│ │ │ └── file_example_XLS_10_20200528_075315.xls
│ │ ├── photo
│ │ │ └── 2020-05-28
│ │ │ ├── file_example_GIF_500kB_20200528_080038.gif
│ │ │ ├── file_example_JPG_100kB_20200528_075950.jpg
│ │ │ ├── file_example_MP3_700KB_20200528_074935.mp3
│ │ │ ├── file_example_OOG_1MG_20200528_075651.ogg
│ │ │ └── file_example_PNG_500kB_20200528_080012.png
│ │ └── video
│ │ └── 2020-05-28
│ │ ├── file_example_MP4_480_1_5MG_20200528_074720.mp4
│ │ └── file_example_WAV_1MG_20200528_074955.wav
CLI script
python ./code/run.py --dump "./data/dump" --staging "./data/staging" --server "./data/server"
Reports
From an arbitrarily-nested directories and file types, I want to:
- list all files recursively
- timestamp files based on creation time
_YYYY-MM-DD_H-M-S
- fuzzy-search time information in the file name
- deduplicate files (filename + extension + creation time)
- organise files in tree based on file type and creation date
YYYY-MM-DD
- load the new structure to my server
- handle errors gracefully
- stage files before loading to the final destination (optional)
- audit data migration plans
- provide building blocks to perform arbitrary migrations (including those within the NAS itself)
A migration is made of two function calls:
plan()
Implements a migration plan and includes these steps:
- Discover all files in
dump
- Build a table to describe them (name, extension, time, size, type)
- Extend the table with information of where they should be moved to (migration table)
execute()
Takes the original absolute path of each file (abspath_src) and their new absolute path at the destination (abspath_dst). Each file is then moved/copied, replacing/skipping depending on the global settings:
replace
: if True files with the same name are replacedmode
: one of ['copy', 'move']. Choose 'copy' for idempotency.
File types are inferred form their extension (or suffix). I considered using filetype but it's expensive when the dump
directory is over the network, so I decided to assume that the file suffix reflects the true file type. Extensions are mapped by the extensions_and_types()
function which maps file_types ("image", "video", "audio", "archive") to file_extensions.
The code is modular and it's possible to perform custom migrations by altering the way plans are built. Inside jobs
there are two simple scripts that demonstrate this.
- locate the internal (LAN) IP address <my.ip.addr.ess> of yor server (e.g. 192.168.1.110)
- From finder
cmd + K
and typeafp://<my.ip.addr.ess>
. Replace afp with smb to user Samba3. - Authenticate with userbame and pw and you should see the volume mounted
- For me thse are under:
- /Volums/photo
- /Volums/video
- /Volums/photo
- /Volums/documents
- The function
has_time_info
assumes that "-" is used to separate datetime parts, while "_" is used as a separator (for exampleFILE_20200101.jpg
is considered not to have time info). - I assume the creation time is the minimum between ctime and mtime. Next I might look at ways to read the creation time from the file metadata (with projects like exif)
- Warning when mode="copy",
dump
is over 5GB in size, and we are using astaging
plan - Add tests
Constructive feedback is welcome, open an issue or submit a PR :D.