/
book_review.html
730 lines (644 loc) · 42.7 KB
/
book_review.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
<meta charset="UTF-8">
<h2 align="center"> Probability theory: The Logic of Science - a biased review.</h2>
<p align="right" >
<i>P(A|B) = [P(A)*P(B|A)]/P(B), all the rest is commentary.</i> <br>-
Astral Codex Ten tagline.
</p>
<!--
<p>
If I told you that over the course of last year I have read Hartshorne's graduate math textbook
"<a href="https://en.wikipedia.org/wiki/Algebraic_Geometry_(book)">Algebraic Geometry</a>" and did all the exercises in it,
you would probably assume I have learned some mathematics.
</p>
<p>
If I told you that over the course of last year I have read
Thích Nhất Hạnh's "<a href="https://en.wikipedia.org/wiki/The_Miracle_of_Mindfulness">The Miracle of Mindfulness</a>",
and practicd all the techniques in it, you could deduce that I have developed some applicable skills stemming from a metaphysical doctrine.
</p>
<p>
In reality I did neither of these things.
</p>
<p>
I did, however, read E. T. Jaynes's
"<a href="https://www.cambridge.org/gb/academic/subjects/physics/theoretical-physics-and-mathematical-physics/probability-theory-logic-science">Probability Theory: The Logic of Science</a>"
(PT:TLoS from here on)
and <a href="https://github.com/jezgillen/JaynesProbabilityTheory"a> solved (most of) the exercises</a>.
</p>
<p>
If you
<a href="https://www.lesswrong.com/posts/kXSETKZ3X9oidMozA/the-level-above-mine">have</a>
<a href="https://intelligence.org/research-guide/">heard</a>
<a href="https://statmodeling.stat.columbia.edu/2007/09/13/jaynes_is_no_gu/">anything</a>
about this book, you may have expected that
I have learned some mathematics, and developed some applicable skills stemming from a metaphysical doctrine.
In reality, I have mostly learned that in the 20th century disputes concerning probability and statistics,
physicists whose last names start with J where (almost) always right, and everyone else was almost always wrong.
Ok, fine, I did learn some math and got a fair bit of what can be called metaphysical indoctrination as well.
But the math was mostly learned in pursuit of understanding of an offhand remark or a solution to an exercise,
often by following the crumb trail of hints and references left by Jaynes in the text,
and only occasionally from the text itself. As for the metaphysical indoctrination, well, there was some of that, but
one does not simply join the
<a href="https://www.lesswrong.com/posts/fnEWQAYxcRnaYBqaZ/initiation-ceremony">Bayesian Conspiracy</a>
by reading a 700+ page book.
One must read a <a href="https://intelligence.org/rationality-ai-zombies/">1600+ page book</a> at least!
</p>
-->
<h3> On the origin of PT:TLoS.</h3>
<p>
Edwin Thompson (i.e. E.T.) Jaynes was a Ph.D student of Eugene Wigner, the
<a href="https://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences">unreasonably effective</a>
Nobel laureate physicist. Wigner is reported to have later characterized Jaynes as
"one of the two most under-appreciated people in physics."
Jaynes's PhD thesis was on ferroelectricity,
and apart from contributions to probability and statistical mechanics, he is perhaps most known for his
<a href="https://en.wikipedia.org/wiki/Jaynes%E2%80%93Cummings_model">work</a> in quantum optics.
</p>
<p>
Jaynes defended his PhD at Princeton in 1950, and then moved to Stanford.
He did what one is supposed to do there:
invested in a <a href="https://en.wikipedia.org/wiki/Varian_Associates">Palo Alto tech startup</a>.
Since his first field of research could be called "applied classical electrodynamics",
he also consulted for them, calculating behaviour of electrons in cavity resonators
and working on magnetic resonance. Apparently this led to him buying a fairly large
house -- though this was the 1950s, when
<a href="https://slatestarcodex.com/2019/07/23/book-review-the-electric-kool-aid-acid-test/">normal people</a>
could have houses in Palo Alto.
He bought an even larger house when he moved to University of Washington at St. Louis in 1960.
</p>
<p>
In 1957 Jaynes published
<a href="https://bayes.wustl.edu/etj/articles/theory.1.pdf">two</a>
<a href="https://bayes.wustl.edu/etj/articles/theory.2.pdf">papers</a>
on ``Information Theory and Statistical Mechanics''
concerned with formulating (Gibbs's picture of) statistical mechanics in terms of
information theory, first for classical and second for quantum systems.
At about the same time he delivered a series of lectures on
"Probability theory in science and engineering" at the Field Research Laboratory
of the Mobil oil company.
The <a href="https://bayes.wustl.edu/etj/articles/mobil.pdf">published version</a>
of 5 of these lectures is the first draft of the PT:TLoS. It includes
a now-extinct section on Gibbs model and one titled "why does statistical mechanics work?",
as well as (much) briefer versions of chapters 1, 2, 4, 5, 6, 11, and 18 of PT:TLoS,
for a total of about 200 typed pages overall.
It also contains a "historical introduction" explaining
"how it could happen that a person who is a rather strange
mixture of two thirds theoretical physicist and one-third
electrical engineer could <s>grow up to be a hero and a scholar</s> get really worried about the foundations of probability theory".
The answer, of course, is by "trying to understand what statistical mechanics
is all about and how it is related to communication theory".
I'd say that struggle that still goes on for many of us!
</p>
<p>
Jaynes says that "in the years 1957–1970 the lectures were repeated, with steadily increasing content, at
many other universities and research laboratories." In 1974 some of this steadily increasing content
was assembled into a 446 page "fragmentary edition" entitled
"Probability theory with applications in science and engineering"
with a stated goal of eventually having "approximately 30 Lectures" in the project.
It now also included some of what will become chapters 10, 13, 19 and 22 of PT:TLoS,
as well as a chapter on irreversible statistical mechanics.
One can also find in it what is, in hindsight, a rather genteel "word of explanation and apology to mathematicians
who may happen on this book not written for them",
excusing absence of measure-theoretic notions. Jaynes says that he "is not opposed" to them and
"will gladly use and teach them as soon as" he finds "one specific
real application where they are needed".
In the PT:TLoS the rejection of
modern mathematical toolkit continues unabated (arguably with some detrimental effects,
more on this later), but any tone of apology is gone.
What a difference 24 little years made.
</p>
<p>
The magnum opus itself was woefully unfinished at the time of Jaynes's death in 1998.
The manuscript was massaged into a book shape by Jaynes's former graduate student Larry Bretthorst,
resulting in the 727 page commentary on Bayes's theorem that we are now reviewing.
</p>
<h3> What Jaynes taught.</h3>
<p>
While no <a href="https://www.scottaaronson.com/blog/?p=277">Australian fashion models</a>
seem to be available to distill the core idea of PT:TLoS into a single passage,
we can get something reasonably close from Jaynes himself.
Right from the start, he declares: "Our topic is the optimal processing of incomplete information",
and the focus is on producing "quantitative rules for conducting inference".
Note that while other frameworks might
<a href="https://slatestarcodex.com/2014/09/01/book-review-and-highlights-quantum-computing-since-democritus/">learn hypothesis</a>
<a href="https://en.wikipedia.org/wiki/Probably_approximately_correct_learning">consitent with data</a>,
Jaynes is after not just "good enough" processing, but an "optimal" one.
Of course, the "quantitative rules" mentioned turn out to be those of
"probability theory and all of its conventional mathematics, but
now viewed in a wider context than that of the standard textbooks."
This is the essential content of "Cox's theorems" and Jaynes spends the first chapter fleshing out
more precisely what the "quantitative rules for conducting inference" are and how they should look like,
and the second one reproving Cox's results (i.e. that only probabilities allow us to do inference the way we would like).
</p>
<p>
With this first (but by no means last) tussle with foundations out of the way,
Jaynes proceeds to develops some of the
math needed for basic applications in "direct" and "inverse" probability.
Here, by "basic applications" I mean counting balls in urns.
(Lest you find this boring, let me remind you that counting things in urns,
is not only a centuries old pastime of probability theorists, but is
<a href="https://en.wikipedia.org/wiki/Attempts_to_overturn_the_2020_United_States_presidential_election">essential for functioning of any democratic society</a>.)
And by "direct probability" I mean things like:
if there are a hundred red and a hundred blue <s>ballots</s> balls in and urn and you draw 10 "at random",
what is the probability that they are all red? That 9 of them are red and 1 is blue? Et cetera.
This is "sampling theory" and is covered in chapter 3,
with the question of what "at random" means getting some love in section 3.8.1.
"Inverse probability", on the other hand, is the old-school name for the more interesting kind of question:
suppose you draw 10 balls at random from an urn containing 200 balls,
and all 10 are red (this is your "data").
How likely is it that there were 0 red balls in the urn? How about 1 red ball? How about 100?
Here of course the answer depends on what we thought about the number of red balls in the urn before doing the drawing
-- if we have looked in the urn just before and counted the balls directly,
the drawing itself is unlikely to change our opinion about this "prior" count.
This innocuous observation is the the fact that launched a thousand ships,
for this <b>prior</b> is the one missing ingredient after which Bayes's theorem
- yes, the P(A|B) = [P(A)*P(B|A)]/P(B) - finishes the job.
Jaynes lists 4 "principles" for obtaining this missing ingredient
(you know it's bad when there is more than one, and more than two is real trouble),
postpones further discussion to later chapters and proceeds to develop "inverse probability"
- aka hypothesis testing - assuming the prior is known somehow.
Along the way, we get introduced to measuring information (or "evidence")
provided by the data in decibels (which I believe Jaynes invented independently of the equivalent
"<a href="https://en.wikipedia.org/wiki/Hartley_(unit)">decibans</a>" of Turing and Good) in chapter 4.2,
and learn how to do multiple hypothesis testing in chapter <s>86</s> 4.4.
</p>
<p>
With all this hard work out of the way, we get to "queer uses of probability theory"
<s>also known as the seeds of CFAR curriculum</s>. While non-technical,
this chapter explains how to reason "in a Bayesean way" about telepathy,
why same evidence presented to different people may make their opinions diverge more,
how Bayesean nature of visual perception may explain optical illusions
how not to weigh evidence in court, and other useful things like that.
"It's the priors, stupid"- for the most part; yet the details are entertaining and sometimes illuminating.
</p>
<p>
By chapter 6 the break is over, and we return to our urns.
Amid some rather mundane calculations, some inspiring things happen.
Under the rubric of "effects of qualitative prior information" - of the type of knowing "who does what to whom" -
Jaynes introduces what we now can recognise as rudimentary probabilistic graphical models.
The question of the choice of a prior returns briefly, only to be postponed again.
For the most part it is a continuation of what has gone on before.
</p>
<p>
Chapter 7, dedicated to Gaussian distribution, is a change of pace.
While mathematically interesting, at first blush it may seem purely technical.
Yet there is a key question behind it: why is Gaussian distribution so ubiquitous?
Of course, mathematical reality being what it is,
all good explanations are connected to each other;
but the side from which one approaches the network of explanations matters both philosophically,
and in terms of what further ideas it generates.
Here, as in many other situations, Jaynes has a favorite side.</p>
<p>
A "standard" answer is commonly taught: if a number we are considering is a
sum of many (sufficiently) independent random "pieces" the result will be approximately Gaussian.
Since many things have multiple "small causes", this is a common situation.
Mathematically, this is expressed as the
<a href=""https://en.wikipedia.org/wiki/Central_limit_theorem>Central limit theorem</a>.
A mechanism that makes this work also explains why Gaussian distribution is connected
to least squares fitting of linear models, and, more generally,
illuminates why mean and variance are the only thing that matter in a Gaussian distribution.
Thus Jaynes's favorite explanation is reached:
Gaussian distribution is the one we would obtain if we agree that we know some random number's mean and variance,
and nothing else.
It is the distribution of <b>maximal entropy</b> subject to that knowledge,
the one expressing total ignorance beyond those two values.
Thus, out of a technical sounding-question in a technical-looking chapter a major technical theme is born:
if you know something, and want to get a prior reflecting that knowledge and nothing else,
look for a maximal entropy distribution compatible with this knowledge.
What this means mathematically, and how to find the maximal entropy distribution
(at least for "finite" situations) is explored in chapter 11.
(This is also where the seams start to show:
while producing Gaussian distribution as a maximal entropy one is easy,
once the material in chapter 11 is absorbed,
as far as I can tell Jaynes never actually gets around to doing this.
Chapter 11 is in part II, where completeness of the text begins to decline.)
</p>
<p>
Maximal entropy is one of the four methods for finding priors that Jaynes mentioned back in chapter 4,
the one most closely associated with Jaynes himself.
Another one is "group invariance" (more properly, "equivariance"), explored in Chapter 12 of PT:TLoS.
The name hides a simple idea and a surprising complication.
The idea is simple indeed: if your setting is unchanged by some modification (expanding some object by a factor of 2, for example)
- and this includes your state of knowledge
(if I don't know anything about the length of something then
I don't know anything about twice its length
<i>and I think my ignorance should be expressed the same way mathematically</i>) -
then my prior should be unchanged by this modification.
It turns out that in many situations this suffices to mostly determine the prior
(for a case of a length - also known as "scale"- parameter the prior probability density
at length L is then proportional to 1/L). The surprising complication is
that often this is not enough. For simple examples like "scale" above this complication does not arise,
but for a case of determining "scale" and "location" simultaneously it does,
and Jaynes gets it wrong. The analysis hinges on the difference between something called
"right invariant (Haar) measure" and "left invariant (Haar) measure"
(the "correct" one to use, as explained, for example,
in the <a href="https://www.springer.com/gp/book/9780387960982">book</a> of Berger
(to which, by the way, Jaynes refers elsewhere in PTLTLoS)
is the right one).
In his generally very positive and friendly <a href="https://archive.siam.org/news/news.php?id=81">review</a>
Stanford statistician Persi Diaconis mentions that Jaynes has been accused of "not knowing his left from his right Haar measure".
In fact, in PT:TLoS Jaynes seems wholly oblivious to the issue in the first place.
His language is sufficiently imprecise to be confusing rather than enlightening
-- which is doubly strange since the explanations in Berger's book are considerably clearer.
</p>
<p>
All of this "inference" business is about what to think, but who cares about that.
We want to know what to do! Thus, we need decision theory.
The shift in focus from inference to decision gives an occasion for some discoursing
on British vs. American priorities in life --
which is particularly amusing given that the main credit for decision theory goes to the
<a href="https://slatestarcodex.com/2017/05/26/the-atomic-bomb-considered-as-hungarian-high-school-science-fair-project/">hungarian</a>
mathematician <a href="https://en.wikipedia.org/wiki/Abraham_Wald">Abraham Wald</a>,
of the "<a href="https://en.wikipedia.org/wiki/Survivorship_bias#In_the_military">it's the missing bullet hole locations that you need to worry about</a>" fame.
(Wald's dramatic life story is second perhaps only to that of <a href="https://en.wikipedia.org/wiki/Alexander_Grothendieck">Alexander Grothendieck</a> in its Holywood potential.)
Wald's decision theory proceeds by assigning to each possible action (say: buy, sell) some utility,
dependent on the "true state of nature" (say, the price tomorrow).
The recommended action is then the one that maximizes the expected utility, "expected" meaning average over your believes about the true state of nature
(i.e. tomorrow's price). That is, ignoring transaction costs:
buy if the expected utility of tomorrow's price is higher than utility of today's price, and sell otherwise.
(Of course the economists being naturally dismal talk about minimizing loss - or cost -
rather than maximizing utility.)
</p>
<p>
This may sound trivial, but that's because we are already talking in the language of "believes about the true state of nature" --
what a statistician may call "distribution of the model parameter",
something which is not really allowed in "orthodox" or "frequentist" approach to statistics.
Instead, a frequentist might be concerned with a "decision procedure" or "strategy" based on some data,
i.e. some process that takes in data and spits out the action to take.
This procedure should not be too wild, and what "not too wild" means is formalized by Wald
and is given the name "admissible" (a term which Jaynes seems to interpret as "good" and proceeds to rally against,
by providing some not-so good admissible strategies; I think simply interpreting "admissible" as "not obviously stupid"
would've ameliorated that particular pet peeve).
Then the triumph of Bayesianism is at hand: many years after starting the study of "admissible strategies",
Wald proved that they are all equivalent to starting with
some prior "believes about the true state of nature", updating them based on the data - via Bayes's theorem, of course - and then applying the "obvious" rule above.
Moreover, in the case where the "decision" is actually "estimating a parameter", by varying your utility/loss function and applying the above strategy,
one may recover such estimators as "take the poster maximum" (of which "classical" maximum likelihood is a special case), or "take the posterior mean".
Jaynes rightly points out that the shape of loss function can change the decision quite drastically:
in deciding between cutting your hair too short or too long, one type of error is much less costly than another;
the cost of various errors in a "William Tell-type scenario" is even further from the usual models.
</p>
<p>
With this - essentially final - layer of theory, we are ready for some applications,
first in distinguishing a signal from the noise -and Jaynes does mean "signal" -
an electrical one, in volts (it is probably that one-third electrical engineer speaking),
and then in deciding what widgets to produce in our widget factory.
While the first, simpler, task is arguably
<a href="https://en.wikipedia.org/wiki/1983_Soviet_nuclear_false_alarm_incident">more important</a>,
it is the later that is more revealing of both Jaynes's process and its flaws: the analysis is fine - great even -
when taken on its own, but there are no sanity checks, no robustness analysis.
If I actually had a widget factory, I would probably assign a rather low weight to the whole thing,
at least before hiring someone to vary the model and see how it flexes.
</p>
<p>
Among the issues that remain is the following.
Imagine I have a coin. I may say that the probability that it will land heads on the next toss is half,
but this is far from capturing all my beliefs about the coin.
Perhaps I have personally forged the coin to the most exacting specifications, or perhaps I have never seen it before in my life.
Now, imagine I see it be tossed and come up heads 10 times in a row.
What would my prediction of the next toss be now? In the first case, still pretty close to 50-50
(perhaps my manufacturing process was flawed, or perhaps I should just hedge against <a href="https://slatestarcodex.com/2015/08/20/on-overconfidence/">being overconfident</a>).
In the second case I might start to suspect that the coin is not fair, and adjust my forecast accordingly.
The question before us is how to account for this difference. Jaynes takes this up in chapter 18,
and essentially invents a two-level
<a href="https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling#Hierarchical_models">hierarchical Bayesian model</a>.
Roughly, if I record my beliefs about the coin in a probability "density" I assign to various statements
A_p =(the coin is biased to flip heads with probability p), then I can update this density based on the observed results of flipping the coin.
The difference in the two scenarios above is in the initial shape of the distribution for A_p - the "I forged this coin" initial distribution has a high peak near p=0.5,
while the "this is just some coin" one is more spread out
(incidentally, if our initial distribution of believes about A_p
is in the <a href="https://en.wikipedia.org/wiki/Beta_distribution">Beta family</a>,
then this is particularly <a href="https://en.wikipedia.org/wiki/Conjugate_prior">easy to do</a>, which is what makes the section 18.5 work out).
One thing to note however, is that we are now talking about something like "probabilities of probabilities",
and this is not what we discussed when we were talking about the whole "extension of logic" business.
In fact, I agree with the <a href="https://meaningness.com/probability-and-logic">the contention</a>
"logic" in "the logic of science" is to be initially understood as "propositional calculus"
(and that this is the setting of Cox's theorems), and with this hierarchical extension as "Aristotelian logic".
The question of probabilistic extensions of predicate (and higher-order) logic seems to be the subject of some current research.
As to whether this has bearing on the question whether Bayesianism is "a complete theory of rationality"
is a question which is slightly too philosophical for my usual tastes.
</p>
<p> All of this is no doubt very thrilling (I mean, we are "only" solving the question of how one
should reason about - and act in!- the world;
we call it "inference" just to keep the excitement down and keep philosopher-logicians off our back). But it is
not nearly as much fun as the numerous polemical tirades against "the orthodoxy",
be it of Fisher, Pearson, or Feller patriarchate. </p>
<h3> À la recherche du temps perdu.</h3>
<p>Chapters 8, 16 and 17 give some account of - and Jaynesean commentary on - the classical statistics.
These were not there in the earlier drafts, which were more focused
on expounding Jaynes's theories and less on criticizing "the orthodoxy".
Perhaps this was also due to the ongoing nature of the polemic at the time.
In PT:TLoS the gloves are mostly off.
"Orthodox" statistics is described in terms of its "pathology" and "folly".
His main charge is that the methods are "ad hoc" - a phrase that appears 47 times in PT:TLoS.
Coming from the work whose chief aim is to develop systematic rules of inference,
this is probably not surprising. </p>
<p>
If one were to pick out a single antagonist in the PT:TLoS it would have to be Sir Ronald Aylmer Fisher.
One could say that Fisher was a geneticist and a statistician. Or, one could say that he was
"the greatest of Darwin’s successors" and "the single most important figure in 20th century statistics".
Bradley Efron (another Stanford statistician) <a href="https://projecteuclid.org/journals/statistical-science/volume-13/issue-2/R-A-Fisher-in-the-21st-century-Invited-paper-presented/10.1214/ss/1028905930.full">writes</a> that "one
difficulty in assessing the importance of Fisherian
statistics is that it’s hard to say just what it is.
Fisher had an amazing number of important ideas
and some of them, like randomization inference and
conditionality, are contradictory. It’s a little as if in
economics Marx, Adam Smith and Keynes turned
out to be the same person."
</p>
<p>
Among many charges Jaynes lays at Fisher is that of establishing statistics as a collection
of (ad hoc!) recipes for analyzing data. In Jaynes's view Fisher's cookbooks (primarily "<a href="https://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">Statistical Methods for Research Workers
</a>", but also <a href="https://en.wikipedia.org/wiki/The_Design_of_Experiments">The Design of Experiments
</a>) established the situation in which a scientist was to follow the recipes,
but was not to question the reasoning behind these recipes.
</p>
<p>
Then, as per Jaynes:
</p>
<p>
"Whenever a real scientific problem arose that was not covered by the published recipes,
the scientist was expected to consult a professional statistician for advice on how to analyze
his data, and often on how to gather them as well. There developed a statistician–client
relationship rather like the doctor–patient one, and for the same reason. If there are simple
unifying principles (as there are today in the theory we are expounding), then it is easy to
learn them and apply them to whatever problem one has; each scientist can become his own
statistician. But in the absence of unifying principles, the collection of all the empirical,
logically unrelated procedures that a data analyst might need, like the collection of all the
logically unrelated medicines and treatments that a sick patient might need, was too large
for anyone but a dedicated professional to learn."
</p>
<p>
Jaynes's statement that "deep change in the sociology of
science – the relationship between scientist and statistician – is now underway" and that
"each scientist involved in data analysis can be his own
statistician" seems premature. My impression is that basic courses in "applied statistics"
are routinely
taught without even attempting to impart much conceptual understanding, and for many scientists
doing your own statistics is still dangerously close to rolling your own crypto.
</p>
<p>
Be that as it may, hardy anyone can be against getting scientists
to understand the statistics they are practicing. According to Jaynes,
one of the earliest attempts to do this is the 1939 "Theory of Probability" by (future Sir) Harold Jefferys.
</p>
<p>
This book is perhaps the most direct prior influence on Jaynes and on PT:TLoS -
which, after all, is "dedicated to the memory of
Sir Harold Jeffreys, who saw the truth and preserved it."
</p>
<p>
In Jaynes's telling, Jeffreys "was buried under an avalanche of
criticism which simply ignored his mathematical demonstrations and substantive results
and attacked his ideology".
</p>
<p>
Jaynes writes:
</p>
<p>
"We need to recognize that a large part of their differences arose from the fact that
Fisher and Jeffreys were occupied with very different problems. Fisher studied biological
problems, where one had no prior information and no guiding theory (this was long before
the days of the DNA helix), and the data taking was very much like drawing from Bernoulli’s
urn. Jeffreys studied problems of geophysics, where one had a great deal of cogent prior
information and a highly developed guiding theory (all of Newtonian mechanics giving the
theory of elasticity and seismic wave propagation, plus the principles of physical chemistry
and thermodynamics), and the data taking procedure had no resemblance to drawing from
an urn. Fisher, in his cookbook defines statistics as the study of populations;
Jeffreys devotes virtually all of his analysis to problems of inference where there is no
population."
</p>
<p>
But just in case you had any doubt whose side he is on, Jaynes then adds:
</p>
<p>
"What Fisher was never able to see is that, from Jeffreys’ viewpoint, Fisher’s biological
problems were trivial, both mathematically and conceptually."
</p>
<p>
Them are fightin words!
</p>
<p>
Incidentally, Jaynes credits Fisher with having a "deep intuitive multidimensional space
intuition", which allowed him to calculate many sampling distributions for the first time,
but points out "that, just before
starting to produce those results, Fisher spent a year (1912–1913) as assistant to the theoretical
physicist Sir James Jeans, who was then preparing the second edition of his book
on kinetic theory and worked daily on calculations with high-dimensional multivariate
Gaussian distributions".
Yes, even these stem from a physicist whose last name starts with J!
</p>
<p>
A secondary antagonist is <a href="https://en.wikipedia.org/wiki/William_Feller">William Feller</a>,
the author of "the most successful treatise on probability ever written".
He is also accused by Jaynes of being too clever - and thus being able to get
away with not doing things systematically. According to Jaynes, "his readers get the impression that: (1)
probability theory has no systematic methods; it is a collection of isolated, unrelated clever tricks,
each of which works on one problem but not on the next one; (2) Feller was possessed of superhuman cleverness;
(3) only a person with such cleverness can hope to find new useful results in probability theory" -
with the unstated implication that we should doubt all three. As an illustration of "clever tricks" Jaynes
chooses the following problem:
</p>
<p>
"Peter and Paul toss a coin alternately starting with Peter, and the one who
first tosses ‘heads’ wins. What are the probabilities p, p' for Peter or Paul to win?
</p>
<p>
The direct, systematic computation would sum (1/2)^n over the odd and even integers:
p =Σ (1/2)^(2n+1)=2/3, p'=Σ (1/2)^(2n)=1/3.
</p>
<p>The clever trick notes instead that Paul will find himself in Peter’s shoes if Peter fails to
win on the first toss: <i>ergo</i>, p' = p/2, so p = 2/3, p = 1/3."
</p>
<p>
The "ergo" is saying that Paul will win if (Peter does not win immediately)
and (Paul wins, given that Peter does not win immediately). The probability of the first clause is 1/2,
and that of the second is p (since after Peter tosses a tail
Paul's situation is the same as that of Peter at the start of the game); ergo, p' = p/2.
</p>
<p>
Alternatively, one can solve this problem by saying instead that either Peter wins immediately,
or Paul wins on the second toss, or they are back where they started.
In math, this says that p=1/2+1/4*p -- here 1/2 is the probability of Peter's immediate win, 1/4 is the probability of (Peter not winning immediately, then Paul not winning right after),
and p is the probability of (Peter wins from there).
</p>
<p>
Of course, Jaynes himself can do things that are clever;
his dexterity with, among other things, generating functions,
transform methods, and asymptotic expansions, can appear magical to those not
trained as applied mathematicians or physicists.
</p>
<p>But there is additional irony here in that this "Peter and Paul problem" is exactly the wrong example with which
to complain about
“isolated clever tricks and gamesmanship”! In fact, thinking about a system moving between states
and analyzing how likely it is to reach certain
"terminal states" - i. e. setting up and analysing a Markov chain - is a fairly general method
to solve similar probability problems, well connected to other key areas of probability theory.</p>
<p>This serves as an illustration of a deeper point - many clever tricks when well understood become powerful methods,
much more powerful indeed than straightforward but uninspiring computations.</p>
<p>There is less disagreement here than may at first appear.
I agree with Jaynes in calling for “general mathematical techniques which will work not only on our present problem,
but on hundreds of others”;
it’s just that your current “general technique” may solve a given problem,
but not explain what is going on in it (mathematician Paul Zeitz calls this “How vs. Why”).
A clever trick may lead you to a better general theory,
closer to answering the “why” question — as indeed the Peter and Paul
coin tossing example illustrates. I am arguing not for gamesmanship,
but for bringing the game to the next level.
</p>
<p>
There are many other things Jaynes has to say about "orthodox" statistics and statisticians.
One other such volley is a defense of Jeffreys in an argument with another "orthodox" statistician,
Jerzy Neuyman, in which, according to Jaynes "Jeffreys is clearly right" - the conclusion that
I see as the only reason for including the episode in the book
(since the actual nature of the dispute is not given explicitly).
What is my reason for including it in this review? Well, having read the relevant
parts of the original sources, I can report that Jeffreys was clearly wrong.
I encourage you to decide whether
I am wrong that Jaynes is wrong that Neyman is wrong for yourself.
</p>
<p>
In the <a href="https://archive.siam.org/news/news.php?id=81">review</a>
I have mentioned, Diaconis calls PT:TLoS "wonderfully out of date", saying that
"the wonderful part is that Jaynes discusses
and points to dozens of papers from the 1950s through the 1980s
that have slipped off the map." A noticeable fraction of this pointing is in fact
pointing fingers at people doing things wrong.
It also forces the reader to either mostly ignore these sidetracks and discussions,
or to follow them up. Either strategy is admissible - and I have found the second one
quite rewarding when I followed it -
but it does make reading the PT:TLoS much less straightforward.
</p>
<p>
Jayenes's critiques are of course not limited to statistics.
He has things to say on the set theory, measure theory, the infinite, Kolmogorov's
axiomatization of probability, generalized functions,
Godel's incompleteness and so on. I was much reassured by Jaynes saying early on that
"we shall find ourselves defending Kolmogorov
against his critics on many technical points" - not because I think Kolmogorov needs defending,
but, conversely, because this increased my confidence that Jaynes's math will be mostly right.
Yet the contents of Appendix B, in which much of the attack against modern mathematical
formalism is collected, convince me that Jaynes have not perceived the goals of finding the right
language and level of generality for all things that underpin much of modern mathematical developments.
To me his insistence that using these modern techniques leads to errors is akin to
complaining that summing infinite series leads to errors: sure it does, if you do it "naively",
or even if you do it in a complex but incorrect way. That's precisely why mathematicians have thought
long and hard about how one could do it without running into problems,
and developed multiple sophisticated and precise theories about this
(the most common of which they now teach in the "sequences and series" part of courses on mathematical analysis).</p>
<p>
In the same vein, I found the chapter on "paradoxes of probability theory"
the second most disappointing
(after the chapter on ignorance priors and transformation groups).
</p>
<p>
In math, there are paradoxes of various kinds: roughly, there are true statements that
subvert naive intuition (a la Banach-Tarski paradox),
there are faulty demonstrations (Achilles and the tortoise),
or arguments that reveal deficiency of terminology or of definition of terms (Russell's paradox).
There remains a possibility of finding a paradox of yet another kind - a true contradiction,
but for the standard axioms of mathematics this has not yet happened.
Thus all "paradoxes" in PT:TLoS should be of the "non-contradictory" type.
Alas, some of them don't even rise to that level: "non-conglomerability" is essentially
a demonstration that assuming that probabilities satisfy only "finite additivity"
-- as opposed to the more restrictive "countable additivity" which is part of Kolmogorov's axioms
-- would allow some "probability" assignments that behave in pathological ways.
This is a good example of something that may "defend Kolmogorov
against his critics on a technical point", but is hardly a paradox.
The "Borel-Kolmogorov paradox" is mostly of terminological type -
it poses a question of how to make sense of "conditioning on event of probability zero".
It was pretty much solved by Kolmogorov - the solution being that there is no
intrinsic sense in which one can talk about it, but one can sometimes do this if
one has either a sequence of events of positive probability "converging" to the event in question
(a resolution that Jaynes would love).
One common scenario is when the "event in question" is some "random variable" (in the technical sense)
- this is what arises most commonly in practice.
In full generality one has the theory of "disintegration" and of "conditioning on a sub sigma algebra".
All of this is part of well-developed theory, so passionately criticised by Jaynes in Appendix B.
Finally, the "marginalization paradox" is concerned with pathological behaviour of Bayesean inference in
some situations where improper priors are used, and is part of what Diaconis calls Jaynes's
"long-running debate with Dawid-Stone-Zideck".
I have delved into it to some depth during my read,
looking up some of the papers of Dawid-Stone-Zidek and all that,
but seems to have happily forgotten whatever insights I might have found there, other than
"improper priors are a constant source of trouble (but maybe not in exactly the way Jaynes thinks)".
If anyone has a better understanding, I'd be happy to be enlightened - especially if they
manage to find at least "one specific real application" where these insights are needed.
</p>
<h3>Exegi monumentum.</h3>
<p>
What are we to make of all this,
<a href="https://slatestarcodex.com/2017/03/16/book-review-seeing-like-a-state/">as</a>
<a href="https://slatestarcodex.com/2016/12/02/contra-robinson-on-schooling/">the</a>
<a href="https://slatestarcodex.com/2019/07/23/book-review-the-electric-kool-aid-acid-test/">saying</a>
<a href="https://slatestarcodex.com/2015/09/05/if-you-cant-make-predictions-youre-still-in-a-crisis/">goes</a>?
</p>
<p>
PR:TLoS is, to put it mildly, a very special book. It is neither a textbook, nor a reference test,
nor a philosophical treatise, nor a history book - and it is a bit of all of those.
It is singularly shaped by the person of E. T. Jaynes: by his "two thirds theoretical physicist and one-third
electrical engineer" background, with its consequent interest in radards and in statistical mechanics,
by his unconventional thinking,
his polemic style of his long-standing disputes with statisticians of his age,
and by his untimely death.
</p>
<p>
It's chapters written earlier and polished for longer are some of the strongest, while those added late
are often more open to criticism or incomplete. Yet for all those flaws, it's influence is tremendous
- it has 7.5 thousand citations, including in such high impact texts as Taleb's "Black Swan",
Goodfellow et al.'s "Deep Learning", Koller and Friedman's "Probabilistic graphical models"
and many others. Notably, some of the references simply recommend it as
"an additional resource" on probability and information theory for those with
"absolutely no prior experience with these subjects" or even "to the general reader"
- a use for which I find it rather poorly suited,
and not just because it lacks many of the more recent developments.
At best, it may serve as a kind of "A Companion to Probability:
<a href="https://www.maa.org/press/maa-reviews/a-companion-to-analysis-a-second-first-and-first-second-course-in-analysis">A Second First and A Fist Second Course</a>
in Probability").
Overall, it may be one of these books that many wish to have read,
but not as many wish to actually read.
</p>
<p>
If this review seems overly critical - and though I do feel mildly apprehensive putting out a review of the work of
<a href="https://www.lesswrong.com/posts/kXSETKZ3X9oidMozA/the-level-above-mine">Nosferatu</a>
himself, how critical can I really be, given the amount of time I willingly spent with this tome?
- it may be because Jaynes has by now, won many of his battles.
It is difficult to appreciate an insight once it becomes the usual mode of thinking,
the <a href="https://en.wikipedia.org/wiki/This_Is_Water">proverbial water</a>.
It is also, however, because the book itself is incomplete, and often frustrating.
<p>
In very first paragraph of the editor's preface, Larry Butterhurst explains: </p>
<p>
"I could have written [the] latter chapters and
filled in the missing pieces, but if I did so, the work would no longer be Jaynes’; rather, it
would be a Jaynes–Bretthorst hybrid with no way to tell which material came from which
author. In the end, I decided the missing chapters would have to stay missing – the work
would remain Jaynes’".
</p>
<p>
This is a decision which one
<a href="https://www.amazon.com/gp/customer-reviews/RUJH5ZTNY9VH1/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&ASIN=0521592712"">Amazon review</a>
calls "a bad mistake", saying "What [Jaynes] needed was an editor, but what he got instead was a hagiographer."
This is certainly how I felt when I was reading the book; now I am less sure.
Once you have struggled through it, the motivation to make the struggle less onerous diminishes,
and you begin to think that "keeping the work Jaynes'" may actually be a valid consideration,
and not just a lazy cop out you thought it to be whilst in the thick of it all.
</p>
<p>
And yet I, too, find myself mourning for what this book could have been. I admit
that sometimes when faced with a choice (vanilla or chocolate? black jeans or blue?) I simply choose both.
We already have the Jaynes's version. Can we not get the "completed version" as well?
Could we not write the missing chapters, explain the cryptic references,
solve the unsolved exercises and release the result to the world?
Someone who is better than me at organizing things, and someone who knows more than me about copyright and publishing
would need to think about it. On the one hand, we are in the 21st century, with the power of internet,
crowdsourcing and social campaigns. On the other hand, it is
my understanding that it will almost be the 22nd century
before the copyright for PT:TLoS expires.
</p>
<p>
And while we wait for that, we read the version we have.
The version which makes clear Jaynes's message: "progress in science
goes forward on the shoulders of doubters, not believers".
The version that urges you to think for yourself rather than to defer to the "orthodoxy",
whatever it may be called in your time - to see the truth and preserve it.
</p>