public
Description: Good code.
Homepage: http://www.ralree.com
Clone URL: git://github.com/hank/life.git
life / oscon / 2008 / sessions / History.of.Failure.rdoc
eaa28789 » Erik 2008-07-23 History of Failure 1 = OSCON 2008, Session 5: A History of Failure
2 == Ancient Greece
3 - More than 2000 years ago
4 - Device - position of the stars, sun, planets, and moon
5 - First computer, but also first software collaboration
6 - Modification of device after created
7 - Bugfixes
8 - Feature Creep
9
10 - Plundered by Romans
11 - Sank, recovered in 1901.
12 - X-ray tomography, 2000 greek characters on the outside
13 - (Funny EULA)
14
15 == Modern Times
16 - First bug: 1947. A Real Insect
17 - 1983: Therac-25 Radiation Treatment Machine
18 - PDP-11
19 - Errors are caused by alpha particles and EM noise
20 - Picks the wrong mode 1 in 250M times, massive radiation overdose
21 - No hardware interlocks, software controlled
22 - Picked wrong mode 6 times in 3 years.
23
24 - Overcorrection killed a rocket because of absolute velocity vs. smoothed velocity
25 - Self-destruct buttons
26
27 "It's possible to make mistakes so large they invalidate your entire worth as a human being"
28
29 - Australian = $40,000/year, over lifespan of 80 years, $3.17M
30 - Metric = lifetime effort lost
31
32 == Bug 1: AT&T 1990
33 - Switches fail, tell its neighbors, they remove it from the routing table, bad switch spends 6 seconds trying to fix itself.
34 - Coming back up, it would 3way handshake with peers to add them back.
35 - Changed, still send fault, still self-fix, then just makes an outgoing call to the other switches.
36 - Bug: 1st switch made the call, 2nd switch updating routing table, crashes everyone!
37 - 75M calls were lost
38 - Lost revenue = $60M, 2300 years of productivity lost.
39
40 == 1996: Tiwai Point
41 - Aluminum smelter, computer controlled
42 - Comalco Australia programmed them
43 - 2 hours behind AUS
44 - Leap year, computers couldn't take day 366.
45 - All computers crash @ midnight.
46 - 2 hours pass, same problem happens in AUS
47 - Cells melted, had to be replaced.
48 - Unknown cost.
49
50 == Space vehicles
51 === 1996: Ariane 5
52 - Developed bug 37 seconds after launch
53 - Veered off course dramatically
54 - 64-bit FP to measure launch position
55 - Casting to 16-bit int
56 - No Exception Handling!
57 - Overflow, negative! Rocket turned around!
58 - Reused code from Ariane 4, could only move 1/2 the horizontal speed
59 - Testing? The bug showed up perfectly!!!
60 - The bug showed up afterwards in simulation
61 - $370M lost!
62 - 150 lifetimes, 12,000 years
63
64 == 1998: Mars Climate Orbiter
65 - Plummeted through the atmosphere
66 - Part of the code in imperial, some in metric
67 - Pound force, newtons :P
68 - Testing budget was cut before launch
69 - Mars Lander failed as well
70 - Thrusters stopped working
71 - Landing gear started vibrating, thought it was on the ground
72 - 8300 years of time lost
73
74 == Deeps Space 2: Hit Mars
75 - 644+ KM/h
76 - Sat in storage
77 - Launched it, and it hit mars
78 - Battery was dead!
79 - $30M, 10 lifetimes
80
81 == 2003: North American Blackouts
82 - 50M people
83 - 2.38 x AUS, 1/6 of USA
84 - Who's to blame?
85 - El Nino
86 - Canada blames New York, but was a sunny day
87 - Canada blames a nuclear power plant in Pennsylvania
88 - New York blames Canada
89 - Europe was saying USA had 3rd world electric grid
90 - 6 weeks later, there was a big blackout
91 - First Energy in Ohio
92 - 14:14 Alarm system fails *SILENTLY*
93 - Display said everything was fine
94 - Remained in that state for 27 minutes, crashed
95 - Hot spare failed silent after 13 minutes
96 - 345kV line goes down, alarm system isn't working
97 - Automatic re-route, other lines pick up the load
98 - 2 more lines went down, no one knows
99 - 11 more lines go down says MISO
100 - MISO calls First Energy to notify, then their own power went out
101
102 == Take away
103 - Race conditions
104 - Test
105 - Deploy in New Zealand First
106 == 1998: Auckland blackouts
107 - LOTR: Where the orcs come from
108 - 5 weeks without power
109 - 150MWatts of load, 110MW of rated cable :P
110 - 4 cables, 1 failed.
111 - Bad press recently, so no announcement
112 - 150MW of power on 85MW of cable
113 - Cable 2 fails -> 150MW of power on 50MW of cable
114 - Management willpower vs. physics. Classic.
115 - Blamed it on El Nino
116 - Actually a lack of sysadmins, engineers knew cables were overloaded
117 - 1980: "We should replace the cables, guys"
118 - Cost: $150M to Mercury power, unknown to business
119 - Economic gain to Wellington: Priceless
120
121 == Sysadmins
122 - Hard to get people to listen to you, doomsayers
123 - Disk failure, we need raid
124 - Power? UPS
125 - Listen to the sysadmins
126
127