Skip to content

Commit

Permalink
;doc: update command help
Browse files Browse the repository at this point in the history
  • Loading branch information
simonmichael committed Mar 25, 2024
1 parent be24d65 commit 70b75e4
Showing 1 changed file with 41 additions and 40 deletions.
81 changes: 41 additions & 40 deletions hledger/Hledger/Cli/Commands/Import.txt
Expand Up @@ -21,48 +21,49 @@ hledger import bank.csv or perhaps hledger import *.csv.
Note you can import from any file format, though CSV files are the most
common import source, and these docs focus on that case.

Deduplication

import does time-based deduplication, to detect only the new
transactions since the last successful import. (This does not mean
"ignore transactions that look the same", but rather "ignore
transactions that have been seen before".) This is intended for when you
are periodically importing downloaded data, which may overlap with
previous downloads. Eg if every week (or every day) you download a
bank's last three months of CSV data, you can safely run
hledger import thebank.csv each time and only new transactions will be
imported.

Since the items being read (CSV records, eg) often do not come with
unique identifiers, hledger detects new transactions by date, assuming
that:
"Deduplication"

import tries to import only the transactions which are new since the
last import. So if your bank's CSV includes the last three months of
data, you can download and import it every month (or week, or day) and
only the new transactions will be imported each time.

It works as follows. For each imported FILE (usually a CSV file): - It
tries to find the latest date seen previously, by reading it from a
hidden .latest.FILE in the same directory. - Then it processes FILE,
ignoring any transactions on or before the "latest seen" date.

And after a successful import, it updates the .latest.FILE(s) for next
time (unless --dry-run was used).

This is simple but fairly effective. It assumes:

1. new items always have the newest dates
2. item dates do not change across reads
3. and items with the same date remain in the same relative order
across reads.

These are often true of CSV files representing transactions, or true
enough so that it works pretty well in practice. 1 is important, but
violations of 2 and 3 amongst the old transactions won't matter (and if
you import often, the new transactions will be few, so less likely to be
the ones affected).

hledger remembers the latest date processed in each input file by saving
a hidden ".latest.FILE" file in FILE's directory (after a succesful
import).

Eg when reading finance/bank.csv, it will look for and update the
finance/.latest.bank.csv state file. The format is simple: one or more
lines containing the same ISO-format date (YYYY-MM-DD), meaning "I have
processed transactions up to this date, and this many of them on that
date." Normally you won't see or manipulate these state files yourself.
But if needed, you can delete them to reset the state (making all
transactions "new"), or you can construct them to "catch up" to a
certain date.

Note deduplication (and updating of state files) can also be done by
print --new, but this is less often used.
2. item dates are stable across successive CSV downloads
3. the order of same-date items is stable across CSV downloads

These are true of most CSV files representing transactions, or true
enough. If you have a bank whose CSV dates or ordering occasionally
changes, you can reduce the chance of this happening in new transactions
by importing more often (and in old transactions it doesn't matter).

Note, import avoids reprocessing the same dates across successive runs,
but it does not detect transactions that are duplicated within a single
run. So eg if you downloaded but did not import bank.1.csv, and later
downloaded bank.2.csv with overlapping data, you should not import both
of them in a single run (hledger import bank.1.csv bank.2.csv); instead,
import them one at a time (hledger import bank.1.csv, then
hledger import bank.2.csv).

Normally you can ignore the .latest.* files, but if needed, you can
delete them (to make all transactions unseen), or construct/modify them
(to catch up to a certain date). The format is just a single ISO-format
date (YYYY-MM-DD), possibly repeated on multiple lines. It means "I have
seen transactions up to this date, and this many of them occurring on
that date".

(hledger print --new also uses and updates these .latest.* files, but it
is not often used.)

Related: CSV > Working with CSV > Deduplicating, importing.

Expand Down

0 comments on commit 70b75e4

Please sign in to comment.