This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
| eaa28789 » | Erik | 2008-07-23 | 1 | = OSCON 2008, Session 5: A History of Failure | |
| 2 | == Ancient Greece | ||||
| 3 | - More than 2000 years ago | ||||
| 4 | - Device - position of the stars, sun, planets, and moon | ||||
| 5 | - First computer, but also first software collaboration | ||||
| 6 | - Modification of device after created | ||||
| 7 | - Bugfixes | ||||
| 8 | - Feature Creep | ||||
| 9 | |||||
| 10 | - Plundered by Romans | ||||
| 11 | - Sank, recovered in 1901. | ||||
| 12 | - X-ray tomography, 2000 greek characters on the outside | ||||
| 13 | - (Funny EULA) | ||||
| 14 | |||||
| 15 | == Modern Times | ||||
| 16 | - First bug: 1947. A Real Insect | ||||
| 17 | - 1983: Therac-25 Radiation Treatment Machine | ||||
| 18 | - PDP-11 | ||||
| 19 | - Errors are caused by alpha particles and EM noise | ||||
| 20 | - Picks the wrong mode 1 in 250M times, massive radiation overdose | ||||
| 21 | - No hardware interlocks, software controlled | ||||
| 22 | - Picked wrong mode 6 times in 3 years. | ||||
| 23 | |||||
| 24 | - Overcorrection killed a rocket because of absolute velocity vs. smoothed velocity | ||||
| 25 | - Self-destruct buttons | ||||
| 26 | |||||
| 27 | "It's possible to make mistakes so large they invalidate your entire worth as a human being" | ||||
| 28 | |||||
| 29 | - Australian = $40,000/year, over lifespan of 80 years, $3.17M | ||||
| 30 | - Metric = lifetime effort lost | ||||
| 31 | |||||
| 32 | == Bug 1: AT&T 1990 | ||||
| 33 | - Switches fail, tell its neighbors, they remove it from the routing table, bad switch spends 6 seconds trying to fix itself. | ||||
| 34 | - Coming back up, it would 3way handshake with peers to add them back. | ||||
| 35 | - Changed, still send fault, still self-fix, then just makes an outgoing call to the other switches. | ||||
| 36 | - Bug: 1st switch made the call, 2nd switch updating routing table, crashes everyone! | ||||
| 37 | - 75M calls were lost | ||||
| 38 | - Lost revenue = $60M, 2300 years of productivity lost. | ||||
| 39 | |||||
| 40 | == 1996: Tiwai Point | ||||
| 41 | - Aluminum smelter, computer controlled | ||||
| 42 | - Comalco Australia programmed them | ||||
| 43 | - 2 hours behind AUS | ||||
| 44 | - Leap year, computers couldn't take day 366. | ||||
| 45 | - All computers crash @ midnight. | ||||
| 46 | - 2 hours pass, same problem happens in AUS | ||||
| 47 | - Cells melted, had to be replaced. | ||||
| 48 | - Unknown cost. | ||||
| 49 | |||||
| 50 | == Space vehicles | ||||
| 51 | === 1996: Ariane 5 | ||||
| 52 | - Developed bug 37 seconds after launch | ||||
| 53 | - Veered off course dramatically | ||||
| 54 | - 64-bit FP to measure launch position | ||||
| 55 | - Casting to 16-bit int | ||||
| 56 | - No Exception Handling! | ||||
| 57 | - Overflow, negative! Rocket turned around! | ||||
| 58 | - Reused code from Ariane 4, could only move 1/2 the horizontal speed | ||||
| 59 | - Testing? The bug showed up perfectly!!! | ||||
| 60 | - The bug showed up afterwards in simulation | ||||
| 61 | - $370M lost! | ||||
| 62 | - 150 lifetimes, 12,000 years | ||||
| 63 | |||||
| 64 | == 1998: Mars Climate Orbiter | ||||
| 65 | - Plummeted through the atmosphere | ||||
| 66 | - Part of the code in imperial, some in metric | ||||
| 67 | - Pound force, newtons :P | ||||
| 68 | - Testing budget was cut before launch | ||||
| 69 | - Mars Lander failed as well | ||||
| 70 | - Thrusters stopped working | ||||
| 71 | - Landing gear started vibrating, thought it was on the ground | ||||
| 72 | - 8300 years of time lost | ||||
| 73 | |||||
| 74 | == Deeps Space 2: Hit Mars | ||||
| 75 | - 644+ KM/h | ||||
| 76 | - Sat in storage | ||||
| 77 | - Launched it, and it hit mars | ||||
| 78 | - Battery was dead! | ||||
| 79 | - $30M, 10 lifetimes | ||||
| 80 | |||||
| 81 | == 2003: North American Blackouts | ||||
| 82 | - 50M people | ||||
| 83 | - 2.38 x AUS, 1/6 of USA | ||||
| 84 | - Who's to blame? | ||||
| 85 | - El Nino | ||||
| 86 | - Canada blames New York, but was a sunny day | ||||
| 87 | - Canada blames a nuclear power plant in Pennsylvania | ||||
| 88 | - New York blames Canada | ||||
| 89 | - Europe was saying USA had 3rd world electric grid | ||||
| 90 | - 6 weeks later, there was a big blackout | ||||
| 91 | - First Energy in Ohio | ||||
| 92 | - 14:14 Alarm system fails *SILENTLY* | ||||
| 93 | - Display said everything was fine | ||||
| 94 | - Remained in that state for 27 minutes, crashed | ||||
| 95 | - Hot spare failed silent after 13 minutes | ||||
| 96 | - 345kV line goes down, alarm system isn't working | ||||
| 97 | - Automatic re-route, other lines pick up the load | ||||
| 98 | - 2 more lines went down, no one knows | ||||
| 99 | - 11 more lines go down says MISO | ||||
| 100 | - MISO calls First Energy to notify, then their own power went out | ||||
| 101 | |||||
| 102 | == Take away | ||||
| 103 | - Race conditions | ||||
| 104 | - Test | ||||
| 105 | - Deploy in New Zealand First | ||||
| 106 | == 1998: Auckland blackouts | ||||
| 107 | - LOTR: Where the orcs come from | ||||
| 108 | - 5 weeks without power | ||||
| 109 | - 150MWatts of load, 110MW of rated cable :P | ||||
| 110 | - 4 cables, 1 failed. | ||||
| 111 | - Bad press recently, so no announcement | ||||
| 112 | - 150MW of power on 85MW of cable | ||||
| 113 | - Cable 2 fails -> 150MW of power on 50MW of cable | ||||
| 114 | - Management willpower vs. physics. Classic. | ||||
| 115 | - Blamed it on El Nino | ||||
| 116 | - Actually a lack of sysadmins, engineers knew cables were overloaded | ||||
| 117 | - 1980: "We should replace the cables, guys" | ||||
| 118 | - Cost: $150M to Mercury power, unknown to business | ||||
| 119 | - Economic gain to Wellington: Priceless | ||||
| 120 | |||||
| 121 | == Sysadmins | ||||
| 122 | - Hard to get people to listen to you, doomsayers | ||||
| 123 | - Disk failure, we need raid | ||||
| 124 | - Power? UPS | ||||
| 125 | - Listen to the sysadmins | ||||
| 126 | |||||
| 127 | |||||







