;doc: update command help

simonmichael · Mar 25, 2024 · 70b75e4 · 70b75e4
1 parent be24d65
commit 70b75e4
Showing 1 changed file with 41 additions and 40 deletions.
diff --git a/hledger/Hledger/Cli/Commands/Import.txt b/hledger/Hledger/Cli/Commands/Import.txt
@@ -21,48 +21,49 @@ hledger import bank.csv or perhaps hledger import *.csv.
 Note you can import from any file format, though CSV files are the most
 common import source, and these docs focus on that case.
 
-Deduplication
-
-import does time-based deduplication, to detect only the new
-transactions since the last successful import. (This does not mean
-"ignore transactions that look the same", but rather "ignore
-transactions that have been seen before".) This is intended for when you
-are periodically importing downloaded data, which may overlap with
-previous downloads. Eg if every week (or every day) you download a
-bank's last three months of CSV data, you can safely run
-hledger import thebank.csv each time and only new transactions will be
-imported.
-
-Since the items being read (CSV records, eg) often do not come with
-unique identifiers, hledger detects new transactions by date, assuming
-that:
+"Deduplication"
+
+import tries to import only the transactions which are new since the
+last import. So if your bank's CSV includes the last three months of
+data, you can download and import it every month (or week, or day) and
+only the new transactions will be imported each time.
+
+It works as follows. For each imported FILE (usually a CSV file): - It
+tries to find the latest date seen previously, by reading it from a
+hidden .latest.FILE in the same directory. - Then it processes FILE,
+ignoring any transactions on or before the "latest seen" date.
+
+And after a successful import, it updates the .latest.FILE(s) for next
+time (unless --dry-run was used).
+
+This is simple but fairly effective. It assumes:
 
 1.  new items always have the newest dates
-2.  item dates do not change across reads
-3.  and items with the same date remain in the same relative order
-    across reads.
-
-These are often true of CSV files representing transactions, or true
-enough so that it works pretty well in practice. 1 is important, but
-violations of 2 and 3 amongst the old transactions won't matter (and if
-you import often, the new transactions will be few, so less likely to be
-the ones affected).
-
-hledger remembers the latest date processed in each input file by saving
-a hidden ".latest.FILE" file in FILE's directory (after a succesful
-import).
-
-Eg when reading finance/bank.csv, it will look for and update the
-finance/.latest.bank.csv state file. The format is simple: one or more
-lines containing the same ISO-format date (YYYY-MM-DD), meaning "I have
-processed transactions up to this date, and this many of them on that
-date." Normally you won't see or manipulate these state files yourself.
-But if needed, you can delete them to reset the state (making all
-transactions "new"), or you can construct them to "catch up" to a
-certain date.
-
-Note deduplication (and updating of state files) can also be done by
-print --new, but this is less often used.
+2.  item dates are stable across successive CSV downloads
+3.  the order of same-date items is stable across CSV downloads
+
+These are true of most CSV files representing transactions, or true
+enough. If you have a bank whose CSV dates or ordering occasionally
+changes, you can reduce the chance of this happening in new transactions
+by importing more often (and in old transactions it doesn't matter).
+
+Note, import avoids reprocessing the same dates across successive runs,
+but it does not detect transactions that are duplicated within a single
+run. So eg if you downloaded but did not import bank.1.csv, and later
+downloaded bank.2.csv with overlapping data, you should not import both
+of them in a single run (hledger import bank.1.csv bank.2.csv); instead,
+import them one at a time (hledger import bank.1.csv, then
+hledger import bank.2.csv).
+
+Normally you can ignore the .latest.* files, but if needed, you can
+delete them (to make all transactions unseen), or construct/modify them
+(to catch up to a certain date). The format is just a single ISO-format
+date (YYYY-MM-DD), possibly repeated on multiple lines. It means "I have
+seen transactions up to this date, and this many of them occurring on
+that date".
+
+(hledger print --new also uses and updates these .latest.* files, but it
+is not often used.)
 
 Related: CSV > Working with CSV > Deduplicating, importing.