my solutions of the code-katas from Dave Thomas
Python Ruby
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
kata1
kata2
kata4
kata5
kata6
README.org
buildout.cfg
code_kata_five.html
code_kata_seven.html
code_kata_six.html

README.org

Code Kata One - Supermarket Pricing

Problem

This kata arose from some discussions we’ve been having at the DFW Practioners meetings. The problem domain is something seemingly simple: pricing goods at supermarkets.

Some things in supermarkets have simple prices: this can of beans costs $0.65. Other things have more complex prices. For example:

three for a dollar (so what’s the price if I buy 4, or 5?) $1.99/pound (so what does 4 ounces cost?) buy two, get one free (so does the third item have a price?)

This kata involves no coding. The exercise is to experiment with various models for representing money and prices that are flexible enough to deal with these (and other) pricing schemes, and at the same time are generally usable (at the checkout, for stock management, order entry, and so on). Spend time considering issues such as:

does fractional money exist? when (if ever) does rounding take place? how do you keep an audit trail of pricing decisions (and do you need to)? are costs and prices the same class of thing? if a shelf of 100 cans is priced using “buy two, get one free”, how do you value the stock?

This is an ideal shower-time kata, but be careful. Some of the problems are more subtle than they first appear. I suggest that it might take a couple of weeks worth of showers to exhaust the main alternatives. Goal

The goal of this kata is to practice a looser style of experimental modelling. Look for as many different ways of handling the issues as possible. Consider the various tradeoffs of each. What techniques use best for exploring these models? For recording them? How can you validate a model is reasonable? What’s a Code Kata?

As a group, software developers don’t practice enough. Most of our learning takes place on the job, which means that most of our mistakes get made there as well. Other creative professions practice: artists carry a sketchpad, musicians play technical pieces, poets constantly rewrite works. In karate, where the aim is to learn to spar or fight, most of a student’s time is spent learning and refining basic moves. The more formal of these exercises are called kata.

To help developers get the same benefits from practicing, we’re putting together a series of code kata: simple, artificial exercises which let us experiment and learn without the pressure of a production environment. Our suggestions for doing the kata are:

find a place and time where you won’t be interrupted focus on the essential elements of the kata remember to look for feedback for every major decision if it helps, keep a journal of your progress have discussion groups with other developers, but try to have completed the kata first

There are no right or wrong answers in these kata: the benefit comes from the process, not from the result.

Answer

Karate chop

Problem

A binary chop (sometimes called the more prosaic binary search) finds the position of value in a sorted array of values. It achieves some efficiency by halving the number of items under consideration each time it probes the values: in the first pass it determines whether the required value is in the top or the bottom half of the list of values. In the second pass in considers only this half, again dividing it in to two. It stops when it finds the value it is looking for, or when it runs out of array to search. Binary searches are a favorite of CS lecturers.

This Kata is straightforward. Implement a binary search routine (using the specification below) in the language and technique of your choice. Tomorrow, implement it again, using a totally different technique. Do the same the next day, until you have five totally unique implementations of a binary chop. (For example, one solution might be the traditional iterative approach, one might be recursive, one might use a functional style passing array slices around, and so on). Goals

This Kata has three separate goals:

As you’re coding each algorithm, keep a note of the kinds of error you encounter. A binary search is a ripe breeding ground for “off by one” and fencepost errors. As you progress through the week, see if the frequency of these errors decreases (that is, do you learn from experience in one technique when it comes to coding with a different technique?). What can you say about the relative merits of the various techniques you’ve chosen? Which is the most likely to make it in to production code? Which was the most fun to write? Which was the hardest to get working? And for all these questions, ask yourself “why?”. It’s fairly hard to come up with five unique approaches to a binary chop. How did you go about coming up with approaches four and five? What techniques did you use to fire those “off the wall” neurons?

Specification

Write a binary chop method that takes an integer search target and a sorted array of integers. It should return the integer index of the target in the array, or -1 if the target is not in the array. The signature will logically be:

chop(int, array_of_int) -> int

You can assume that the array has less than 100,000 elements. For the purposes of this Kata, time and memory performance are not issues (assuming the chop terminates before you get bored and kill it, and that you have enough RAM to run it). Test Data

Here is the Test::Unit code I used when developing my methods. Feel free to add to it. The tests assume that array indices start at zero. You’ll probably have to do a couple of global search-and-replaces to make this compile in your language of choice (unless your enlightened choice happens to be Ruby).

def test_chop
  assert_equal(-1, chop(3, []))
  assert_equal(-1, chop(3, [1]))
  assert_equal(0,  chop(1, [1]))
  #
  assert_equal(0,  chop(1, [1, 3, 5]))
  assert_equal(1,  chop(3, [1, 3, 5]))
  assert_equal(2,  chop(5, [1, 3, 5]))
  assert_equal(-1, chop(0, [1, 3, 5]))
  assert_equal(-1, chop(2, [1, 3, 5]))
  assert_equal(-1, chop(4, [1, 3, 5]))
  assert_equal(-1, chop(6, [1, 3, 5]))
  #
  assert_equal(0,  chop(1, [1, 3, 5, 7]))
  assert_equal(1,  chop(3, [1, 3, 5, 7]))
  assert_equal(2,  chop(5, [1, 3, 5, 7]))
  assert_equal(3,  chop(7, [1, 3, 5, 7]))
  assert_equal(-1, chop(0, [1, 3, 5, 7]))
  assert_equal(-1, chop(2, [1, 3, 5, 7]))
  assert_equal(-1, chop(4, [1, 3, 5, 7]))
  assert_equal(-1, chop(6, [1, 3, 5, 7]))
  assert_equal(-1, chop(8, [1, 3, 5, 7]))
end

Kata Three: How Big, How Fast?

Problem

Rough estimation is a useful talent to possess. As you’re coding away, you may suddenly need to work out approximately how big a data structure will be, or how fast some loop will run. The faster you can do this, the less the coding flow will be disturbed.

So this is a simple kata: a series of questions, each asking for a rough answer. Try to work each out in your head. I’ll post my answers (and how I got them) in a week or so. How Big?

roughly how many binary digits (bit) are required for the unsigned representation of: 1,000 1,000,000 1,000,000,000 1,000,000,000,000 8,000,000,000,000

My town has approximately 20,000 residences. How much space is required to store the names, addresses, and a phone number for all of these (if we store them as characters)? I’m storing 1,000,000 integers in a binary tree. Roughly how many nodes and levels can I expect the tree to have? Roughly how much space will it occupy on a 32-bit architecture?

How Fast?

My copy of Meyer’s Object Oriented Software Construction has about 1,200 body pages. Assuming no flow control or protocol overhead, about how long would it take to send it over an async 56k baud modem line? My binary search algorithm takes about 4.5mS to search a 10,000 entry array, and about 6mS to search 100,000 elements. How long would I expect it to take to search 10,000,000 elements (assuming I have sufficient memory to prevent paging). Unix passwords are stored using a one-way hash function: the original string is converted to the ‘encrypted’ password string, which cannot be converted back to the original string. One way to attack the password file is to generate all possible cleartext passwords, applying the password hash to each in turn and checking to see if the result matches the password you’re trying to crack. If the hashes match, then the string you used to generate the hash is the original password (or at least, it’s as good as the original password as far as logging in is concerned). In our particular system, passwords can be up to 16 characters long, and there are 96 possible characters at each position. If it takes 1mS to generate the password hash, is this a viable approach to attacking a password?

Kata Four: Data Munging

Problem

Martin Fowler gave me a hard time for KataTwo, complaining that it was yet another single-function, academic exercise. Which, or course, it was. So this week let’s mix things up a bit.

Here’s an exercise in three parts to do with real world data. Try hard not to read ahead—do each part in turn. Part One: Weather Data

In weather.dat you’ll find daily weather data for Morristown, NJ for June 2002. Download this text file, then write a program to output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column). Part Two: Soccer League Table

The file football.dat contains the results from the English Premier League for 2001/2. The columns labeled ‘F’ and ‘A’ contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them). Write a program to print the name of the team with the smallest difference in ‘for’ and ‘against’ goals. Part Three: DRY Fusion

Take the two programs written previously and factor out as much common code as possible, leaving you with two smaller programs and some kind of shared functionality. Kata Questions

To what extent did the design decisions you made when writing the original programs make it easier or harder to factor out common code? Was the way you wrote the second program influenced by writing the first? Is factoring out as much common code as possible always a good thing? Did the readability of the programs suffer because of this requirement? How about the maintainability?

Python

First attempt were two dirty regular expression matching. Then I factored out the regular expression handling in a generic Matcher class. This class has a compute function, which skips the non matching lines and apply the given function to the result.

Kata Five - Bloom Filters

Problem

There are many circumstances where we need to find out if something is a member of a set, and many algorithms for doing it. If the set is small, you can use bitmaps. When they get larger, hashes are a useful technique. But when the sets get big, we start bumping in to limitations. Holding 250,000 words in memory for a spell checker might be too big an overhead if your target environment is a PDA or cell phone. Keeping a list of web-pages visited might be extravagant when you get up to tens of millions of pages. Fortunately, there’s a technique that can help.

Bloom filters are a 30-year-old statistical way of testing for membership in a set. They greatly reduce the amount of storage you need to represent the set, but at a price: they’ll sometimes report that something is in the set when it isn’t (but it’ll never do the opposite; if the filter says that the set doesn’t contain your object, you know that it doesn’t). And the nice thing is you can control the accuracy; the more memory you’re prepared to give the algorithm, the fewer false positives you get. I once wrote a spell checker for a PDP-11 which stored a dictionary of 80,000 words in 16kbytes, and I very rarely saw it let though an incorrect word. (Update: I must have mis-remembered these figures, because they are not in line with the theory. Unfortunately, I can no longer read the 8” floppies holding the source, so I can’t get the correct numbers. Let’s just say that I got a decent sized dictionary, along with the spell checker, all in under 64k.)

Bloom filters are very simple. Take a big array of bits, initially all zero. Then take the things you want to look up (in our case we’ll use a dictionary of words). Produce ‘n’ independent hash values for each word. Each hash is a number which is used to set the corresponding bit in the array of bits. Sometimes there’ll be clashes, where the bit will already be set from some other word. This doesn’t matter.

To check to see of a new word is already in the dictionary, perform the same hashes on it that you used to load the bitmap. Then check to see if each of the bits corresponding to these hash values is set. If any bit is not set, then you never loaded that word in, and you can reject it.

The Bloom filter reports a false positive when a set of hashes for a word all end up corresponding to bits that were set previously by other words. In practice this doesn’t happen too often as long as the bitmap isn’t too heavily loaded with one-bits (clearly if every bit is one, then it’ll give a false positive on every lookup). There’s a discussion of the math in Bloom filters at www.cs.wisc.edu/~cao/papers/summary-cache/node8.html.

So, this kata is fairly straightforward. Implement a Bloom filter based spell checker. You’ll need some kind of bitmap, some hash functions, and a simple way of reading in the dictionary and then the words to check. For the hash function, remember that you can always use something that generates a fairly long hash (such as MD5) and then take your smaller hash values by extracting sequences of bits from the result. On a Unix box you can find a list of words in /usr/dict/words (or possibly in /usr/share/dict/words). For others, I’ve put a word list up at pragprog.com/katadata/wordlist.txt.

Play with using different numbers of hashes, and with different bitmap sizes.

Part two of the exercise is optional. Try generating random 5-character words and feeding them in to your spell checker. For each word that it says it OK, look it up in the original dictionary. See how many false positives you get.

Notes during development

An interface for a checker can be divided in two important phases:

  • populate
  • check

In the first phase we read the word list and generate the set or do any other preparation operation, while in the second phase we actually check if a word is part of the dictionary.

The naive implementation uses just a set as data structure, reading all the words from the words file.

Then I started to reason about the performances, and how do I compare the results given by a Naive checking against the results of the Bloom filter. So it would be interesting to have some unit tests which are able to tell me if things are gettings slower.

Implementation

The first idea to implement the Bloom filter was to generate an array of booleans and use it smartly. Unfortunately booleans are not supported in as types in the array, so we have to resort to use long integers and do some math with them.

Kata Six: Anagrams

Problem

Back to non-realistic coding this week (sorry, Martin). Let’s solve some crossword puzzle clues.

In England, I used to waste hour upon hour doing newspaper crosswords. As crossword fans will know, English cryptic crosswords have a totally different feel to their American counterparts: most clues involve punning or word play, and there are lots of anagrams to work through. For example, a recent Guardian crossword had:

Down: ..

  1. Most unusual form of arrest (6)

The hint is the phrase ‘form of,’ indicating that we’re looking for an anagram. Sure enough ‘arrest’ has six letters, and can be arranged nicely into ‘rarest,’ meaning ‘most unusual.’ (Other anagrams include raster, raters, Sartre, and starer)

A while back we had a thread on the Ruby mailing list about finding anagrams, and I’d like to resurrect it here. The challenge is fairly simple: given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other. For example, your program might include in its output:

kinship pinkish enlist inlets listen silent boaster boaters borates fresher refresh sinks skins knits stink rots sort

If you run this on the word list here you should find 2,530 sets of anagrams (a total of 5,680 words). Running on a larger dictionary (about 234k words) on my OSX box produces 15,048 sets of anagrams (including all-time favorites such as actaeonidae/donatiaceae, crepitant/pittancer, and (for those readers in Florida) electoral/recollate).

For added programming pleasure, find the longest words that are anagrams, and find the set of anagrams containing the most words (so “parsley players replays sparely” would not win, having only four words in the set). Kata Objectives

Apart from having some fun with words, this kata should make you think somewhat about algorithms. The simplest algorithms to find all the anagram combinations may take inordinate amounts of time to do the job. Working though alternatives should help bring the time down by orders of magnitude. To give you a possible point of comparison, I hacked a solution together in 25 lines of Ruby. It runs on the word list from my web site in 1.5s on a 1GHz PPC. It’s also an interesting exercise in testing: can you write unit tests to verify that your code is working correctly before setting it to work on the full dictionary.

Kata Seven: How’d I Do?

Problem

The last couple of kata have been programming challenges; let’s move back into mushier, people-oriented stuff this week.

This kata is about reading code critically—our own code. Here’s the challenge. Find a piece of code you wrote last year sometime. It should be a decent sized chunk, perhaps 500 to 1,000 lines long. Pick code which isn’t still fresh in your mind.

Now we need to do some acting. Read through this code three times. Each time through, pretend something different. Each time, jot down notes on the stuff you find.

The first time through, pretend that the person who wrote this code is the best programmer you know. Look for all the examples of great code in the program. The second time through, pretend that the person who wrote this code is the worst programmer you know. Look for all the examples of horrible code and bad design. The third (and last) time though, pretend that you’ve been told that this code contains serious bugs, and that the client is going to sue you to bankruptcy unless you fix them. Look for every potential bug in the code.

Now look at the notes you made. What is the nature of the good stuff you found? Would you find similar good stuff in the code you’re writing today. What about the bad stuff; are similar pieces of code sneaking in to your current code too. And finally, did you find any bugs in the old code? If so, are any of them things that that you’d want to fix now that you’ve found them. Are any of them systematic errors that you might still be making today? Moving Forward By Looking Back

Perhaps you’re not like me, but whenever I try this exercise I find things that pleasantly surprise me and things that make me cringe in embarrassment. I find the occasional serious bug (along with more frequent less but serious issues). So I try to make a point of looking back at my code fairly frequently.

However, doing this six months after you write code is not the best way of developing good software today. So the underlying challenge of this kata is this: how can we get into the habit of critically reviewing the code that we write, as we write it? And can we use the techniques of reading code with different expectations (good coder, bad coder, and bug hunt) when we’re reviewing our colleagues code?

Kata Eight: Conflicting Objectives

Problem

Why do we write code? At one level, we’re trying to solve some particular problem, to add some kind of value to the world. But often there are also secondary objectives: the code has to solve the problem, and it also has to be fast, or easy to maintain, or extend, or whatever. So let’s look at that.

For this kata, we’re going to write a program to solve a simple problem, and we’re going to write it with three different sub-objectives. Our program is going do process the dictionary we used in previous kata, this time looking for all six letter words which are composed of two concatenated smaller words. For example:

al + bums => albums bar + ely => barely be + foul => befoul con + vex => convex here + by => hereby jig + saw => jigsaw tail + or => tailor we + aver => weaver

Write the program three times.

The first time, make program as readable as you can make it. The second time, optimize the program to run fast fast as you can make it. The third time, write as extendible a program as you can.

Now look back at the three programs and think about how each of the three subobjectives interacts with the others. For example, does making the program as fast as possible make it more or less readable? Does it make easier to extend? Does making the program readable make it slower or faster, flexible or rigid? And does making it extendible make it more or less readable, slower or faster? Are any of these correlations stronger than others? What does this mean in terms of optimizations you may perform on the code you write?

Kata Nine: Back to the CheckOut

Problem

Back to the supermarket. This week, we’ll implement the code for a checkout system that handles pricing schemes such as “apples cost 50 cents, three apples cost $1.30.”

Way back in KataOne we thought about how to model the various options for supermarket pricing. We looked at things such as “three for a dollar,” “$1.99 per pound,” and “buy two, get one free.”

This week, let’s implement the code for a supermarket checkout that calculates the total price of a number of items. In a normal supermarket, things are identified using Stock Keeping Units, or SKUs. In our store, we’ll use individual letters of the alphabet (A, B, C, and so on). Our goods are priced individually. In addition, some items are multipriced: buy n of them, and they’ll cost you y cents. For example, item ‘A’ might cost 50 cents individually, but this week we have a special offer: buy three ‘A’s and they’ll cost you $1.30. In fact this week’s prices are:

Item Unit Special Price Price


A 50 3 for 130 B 30 2 for 45 C 20 D 15

Our checkout accepts items in any order, so that if we scan a B, an A, and another B, we’ll recognize the two B’s and price them at 45 (for a total price so far of 95). Because the pricing changes frequently, we need to be able to pass in a set of pricing rules each time we start handling a checkout transaction.

The interface to the checkout should look like:

co = CheckOut.new(pricing_rules) co.scan(item) co.scan(item)

   :

price = co.total

Here’s a set of unit tests for a Ruby implementation. The helper method price lets you specify a sequence of items using a string, calling the checkout’s scan method on each item in turn before finally returning the total price.

class TestPrice < Test::Unit::TestCase

def price(goods) co = CheckOut.new(RULES) goods.split(//).each { |item| co.scan(item) } co.total end

def test_totals assert_equal( 0, price(“”)) assert_equal( 50, price(“A”)) assert_equal( 80, price(“AB”)) assert_equal(115, price(“CDBA”))

assert_equal(100, price(“AA”)) assert_equal(130, price(“AAA”)) assert_equal(180, price(“AAAA”)) assert_equal(230, price(“AAAAA”)) assert_equal(260, price(“AAAAAA”))

assert_equal(160, price(“AAAB”)) assert_equal(175, price(“AAABB”)) assert_equal(190, price(“AAABBD”)) assert_equal(190, price(“DABABA”)) end

def test_incremental co = CheckOut.new(RULES) assert_equal( 0, co.total) co.scan(“A”); assert_equal( 50, co.total) co.scan(“B”); assert_equal( 80, co.total) co.scan(“A”); assert_equal(130, co.total) co.scan(“A”); assert_equal(160, co.total) co.scan(“B”); assert_equal(175, co.total) end end

There are lots of ways of implementing this kind of algorithm; if you have time, experiment with several. Objectives of the Kata

To some extent, this is just a fun little problem. But underneath the covers, it’s a stealth exercise in decoupling. The challenge description doesn’t mention the format of the pricing rules. How can these be specified in such a way that the checkout doesn’t know about particular items and their pricing strategies? How can we make the design flexible enough so that we can add new styles of pricing rule in the future?