Skip to content
Leberwurscht edited this page Mar 29, 2012 · 144 revisions

I don't know any OCAML, so I'll try to do as much as possible in python. SKS uses some unix sockets, perhaps this is also the best way to make the python and OCAML components talk with each other.

  • The testserver and testclient components can be altered to use unix sockets. [done]
  • Python manages synchronization connections and forwards the connections to the unix sockets. [done]
  • After synchronisation, the acquired sets must be somehow transmitted to python
  • Python needs access to the prefixtree (add/delete hashes). I think locking is needed; for now python will care about this: Alter testadd.ml to listen on a unix socket for hashes to be added, let python only use this socket while no synchronization socket is in use.
  • Problem: currently, if I run the server in one program instance, add a key there and then connect with another program instance as client, the key is not transmitted. The in-memory copy of the ptree does not know that the on-disk copy of the ptree changed, it seems. There is some "catchup" stuff in the Ocaml code. When a key is added, only the db component knows this at first. The db component keeps a log that can be queried from time to time by the recon component. Probably it is the best to combine testadd/testserver/testclient into one component sharing one ptree instance and eventloop. Then communicate via multiple unix sockets with this components.

Python components

  • a server waiting for synchronisation attempts from other servers [reconserver.py]
  • a server that provides a search api and submission api [apiserver.py]
  • a component that deletes old database entries and connects to other servers for synchronisation regularly [schedule.py, using cleanup.py and reconclient.py]

Both reconserver.py and reconclient.py will receive a list of hashes from the other server and need to process it. This common functionality will be deposited into hashreceive.py.

Problem: When the Ocaml component does the synchronisation, the hashes need to be transferred to the python component. This is currently realised through the "hashes.ocaml2py.sock" unix domain socket provided by the python component. If there are two python programs running simultaneously, only one can provide the socket. So there should only be one python server running for both the tasks of reconserver.py and schedule.py. I thought it would be nice to have a reconserver.py executable that can run permanently and a reconclient.py executable that can be run manually or used by schedule.py. However, it seems that running reconclient.py manually while reconserver.py runs will not be possible - but never mind.

I noticed one thing: the ocaml eventloop can only run one callback at a time, it does not fork or create threads. I can't actually find any of the words "thread" or "fork" in the whole sks source. I'll ignore this for the proof of concept, but this has the effect that clients connecting have to wait until the server has processed the other requests.

Another thing I learned: Only one process at a time must use the database. If I run the testtrie program and call the testlist program(or in fact any program just using a "init_db" line) while it is running, I keep getting Bdb.DBError("PANIC: fatal region error detected; run recovery"). So I delete everything but the testtrie program.

Current state

Look at the README to learn how to setup the test environment.

The next step is to add a own database (the bdb is only to store the prefix tree containing the hashes) and properly implement the ideas described in Planning. Tip: use sqlite-manager for testing.

To verify captcha signatures, we need a public key cryptography system for python. There are python-m2crypto, python-openssl, python-gpgme and python-paramiko. I think python-paramiko is the most easy one to use. It is a SSH implementation which also provides functions to load a public key and verify a signature with it, which is what we need. Later, we can perhaps even use it for making the authentication process for synchronisation with other servers more secure.

Example how to use paramiko with a keypair generated with ssh-keygen:

import base64, paramiko

def get_public_key(path="key.pub"):
    f = open(path)
    content = f.read()
    f.close()

    # first two words are keytype and base64 encoded key
    keytype, key = content.split()[:2]

    if not keytype=="ssh-rsa":
        raise Exception, "Need keytype ssh-rsa."

    data=base64.decodestring(key)
    public_key = paramiko.RSAKey(data=data)

    return public_key

def get_private_key(path="key"):
    private_key = paramiko.RSAKey(filename=path)
    return private_key

def sign(private_key, text):
    sig_message = private_key.sign_ssh_data(paramiko.randpool, text)
    sig_message.rewind()

    keytype = sig_message.get_string()
    assert keytype=="ssh-rsa"

    signature = sig_message.get_string()

    return signature

def verify_signature(public_key, signature, text):
    sig_message = paramiko.Message()
    sig_message.add_string("ssh-rsa")
    sig_message.add_string(signature)
    sig_message.rewind()

    return public_key.verify_ssh_sig(text, sig_message)

private_key = get_private_key()
public_key = get_public_key()

text = "hello"
signature = sign(private_key, text)
print len(signature)
print verify_signature(public_key, signature, text)

Progress update

By now, a test database entry can be created with test.py(which will run a small webserver with a webfinger profile and retrieve the entry from this server) and transmitted from one user directory server to the other(see README). Only entries with a valid captcha signature are accepted. Taking control samples is not implemented yet, that's the next step.

Before, the plan needs to be enhanced somehow: See last headword in Planning. This is done now.

I noticed that it is too much work to maintain the object relational mapping by myself because the database structure is changed very often at this stage of development. That's why I rewrote the database modules to use sqlalchemy before proceeding.

Violations must be saved to a log file to be able to decide whether a partner must be kicked. A partner must be kicked only if there are to much violations compared with the overall traffic with it. So we also need a log file that keeps track of how many entries have been transmitted in a certain period of time. (A simple counter is not sufficient for this.) This is done by the Conversations class in partners.py.

Now, the checks for violations must be implemented. I want to divide them into "offenses" and "violations". An offense is something that is not sufficient to kick a partner, but offenses cumulate. A violation is a reason to kick a partner immediately. Everytime an offense is added for a partner, it will be checked whether the number of offenses reaches a certain threshold. If this threshold is reached, a violation will be added. Each time a violation is added to a partner, an exeption will be thrown that breaks the execution of the process_hashes function.

Now, the violations and offenses are implemented and process_hashes makes checks, but most of this is untested. The test script must be enhanced. Several servers should be able to run simultaneously on different ports, with different databases, but on the same machine and within the same copy of the program.

It would be nice to have a Trieserver class with the following interface:

  • a Trieserver(db_path="PTree") constructor, calling the trieserver process with the command line argument db_path
  • a synchronise_with_server(socket) function, returning the hashes we need to fetch from the Entryserver
  • a synchronise_with_client(socket) function, returning the hashes we need to fetch from the Entryserver
  • a add(hashes) function to add hashes to the trie
  • a close() function, terminating the trieserver process. exitfuncs should be added, too, because python will not terminate processes created with the subprocess module automatically.

The Trieserver class is responsible for the client/server/add/hashes sockets, they need to be prefixed with db_path to avoid ambiguity.

Then sduds.py should provide a class "sduds" with the following interface:

  • a sduds(synchro_port=20000, entryserver_port=synchro_port+1, ptreedb="PTree", entriesdb="entries.sqlite", partnersdb="partners.sqlite") constructor, creating a trieserver property
  • a connect_to_server(server) function
    • set up socket, authenticate
    • hashes = self.trieserver.synchronise_with_server(socket)
    • self.fetch_entries_from_partner(hashes, server)
  • a run_entryserver() function
  • a run_synchronisation_server() function
    • listen for connections on synchro_port
    • for each connection, start a thread: self.handle_client(clientsocket)
  • a handle_client(clientsocket) function:
    • check authentication, identify client
    • hashes = self.trieserver.synchronise_with_client(socket)
    • self.fetch_entries_from_partner(hashes, client)
  • a fetch_entries_from_partner(hashes, partner) function, which fetches the entries from the other entryserver and does the checking
  • a fetch_entry_from_profile() function, adding/updating a new entry by retrieving the information from the webfinger profile
  • the class should handle access to the partners db somehow
  • and finally, a close() function

Then, a test script can create many instances of sduds, run servers, add entries to these servers, and let the server fetch entries from each other. It must provide a web server that serves valid webfinger profiles.

Now this is done. There is a also test script that will run two instances of the program, add an entry to one instance, conduct a synchronization and check if the entry was transmitted to the other instance.

I added some other tests to verify that synchronization in both directions with many entries and the most important violations/offenses work.

current work on branch "entry-deletion"

Now, I want to properly implement address submission. If a profile has changed, a new entry must be created and the old one must be removed. If a profile has vanished, the entry must be deleted. When an entry is deleted, the hash must be added to a deleted list. The hash must be deleted from the trie. Now, if we synchronize with another server that does not know yet that the entry must be deleted, the algorithm will notice that we lack a hash. We look whether this hash is in our deleted list, and if it is, we tell the other server that it should delete the entry, too (push a deletion request via http).

We need an additional communication channel for the transmission of entries that are to be deleted. This channel needs to be authenticated. So we need authentication in both directions, but until now, there are only passwords for one direction. I think I need to rewrite the partners module to take account this into account, and write one overall web server which will do the job EntryServer does now, receive to-be-deleted notifications after synchronization, provide an address submission interface, and finally provide a search interface.

Plan for authentication: To be able to use http, I keep the authentication with simple passwords and do not use paramiko. To simplify partnering, I'll implement also a partnering interface in the web server later. The synchronization port will not be saved in the database anymore but will be queried from the web server.

Part of this is done, but the problem is that when we tell the other server that an entry should be deleted, it must be able to verify that this operation is legal. For this purpose, it must already know the new entries that will override the deleted ones. Thus, we must wait until the entries got fetched already by the other side. This is not possible easily. It's perhaps better to run one SocketServer that a client can connect to. The whole synchronization process as described in Planning should be conducted by this SocketServer, except the part involving the hashtrie. I have to think about this:

These are the types of operations one partner can send to another one:

  • add new address
  • remove an address
  • update the entry for an address

Either the sender must make clear whether an entry should be updated or removed, or the receiver must conclude it from the existance/absence of a new entry. I decided for the receiver.

Algorithm (sketch):

  • we have a list of hashes that must be added and a list of hashes that must be deleted: addhashes, deletehashes
  • retrieve the entries for addhashes from the partner
  • add the entries to the database, overwriting existing webfinger_addresses if their submission timestamp is older, and taking control samples
  • remove the remaining entries from deletehashes, also taking control samples

better solution

The above mentioned problem arises only because we don't know whether the entries got already retrieved in the moment the deletion is pushed. If we find a way that both partners know from each other that the deletion requests got transmitted, the partners can wait for this event before they start retrieving the entries.

So a nice interface would be:

  • add_hashes, delete_hashes = connect(partner)
  • add_hashes, delete_hashes = listen(partner)

A server must know add_hashes before he can send delete_hashes to the partner. But after link_sockets is called, we can't use the connection to the partner anymore. So we need two separate connections for this task, one control line and one line that is directy forwarded to the ocaml component (after authentication).

connect and partner can be methods of SDUDS.

This step will need a lot of refactoring [branch WIP-refactor]. Now, the tests are passing again.

even better solution

Now, another alternative crossed my mind: The traffic of the synchronization socket can be tunneled through the control socket in a way so that we know when the synchronization socket is closed, for example by dividing it in packets of known length. This will allow me to get the hashes synchronously for both server and client, so only one channel is needed for synchronization. This works now. I can proceed with implementing entry deletion.

Once again - the algorithm:

  • we have a list of hashes that must be added and a list of hashes that must be deleted: addhashes, deletehashes
  • retrieve the entries for addhashes from the partner
  • add the entries to the database, overwriting existing webfinger_addresses if their submission timestamp is older, and taking control samples
    • what to do if the submission timestamp of the own entry is more recent?
      • First case: the older entry is listed in the removed hashes list. This will not happen since removed hashes are already filtered out of add_hashes.
      • Second case: it is not in this list. This might happen for example if an user submits his profile to multiple servers (which he shouldn't do, as this causes unneccesary synchronization traffic). The right thing should be to discard the entry in every case. Every entry represents a submission event, so we need only the most recent one. It is not necessary to notify the other server of this incident, as he will be informed about the newer entry automatically.
      • SHORT ANSWER: simply ignore it.
      • I'll implement this directly in Entrylist.save.
  • remove the remaining entries from deletehashes, also taking control samples. (For this, a deleted hash needs also a retrieval timestamp)
    • go through delete_hashes
    • for some hashes:
      • get entry for this hash from database to have the webfinger address, then try to load the profile from this address
      • if it succeeds, there are two cases:
        • Case one: The hash changed. This means that the partner told us to delete the hash but there is a new profile that the partner didn't tell us about (if he would have told it to us, the old hash would have been deleted from delete_hashes when the new profile was added to the database). If the retrieval timestamp of the deletion is not too old, we can expect the partner to know the current state of the webfinger profile, so register an DeleteChangedOffense for the partner. Now that we know the current state of the profile, we delete the old entry anyway and add the new one.
        • Case two: The hash is the same as before. This means that the partner sent an invalid deletion request, so register an DeleteExistingOffense for the partner.
      • Set Offense.guilty to true only if the retrieval timestamp is not too old. If it is too old, the administrator should be notified that there exists a too long chain of directory servers.

Important: also count DeleteChangedOffense and DeleteExistingOffense only once per webfinger address

This is done, but not tested enough.

For the next steps, I have two ideas:

move the tunnel function to the O'Caml component

... using threads. Forking would probably be problematic as a copy of the whole process memory, including the database cache for the trie, is created.

I succeeded in writing a test program in O'Caml that uses a thread to take data from stdin, manipulate it and write it to another channel, while the main program (Server.handle/Client.handle) can keep running, and read from this channel. I can rewrite trieserver so that it doesn't use any unix domain sockets anymore.

Now, this is done.

unify process_hashes_from_partner and process_submission using a Queue

... which consumes

  • tuples containing a profile address and submission timestamp
  • or tuples containing a profile address, an offline copy of the profile (or the claim that it does not exist anymore), and the partner who claims that this copy/claim is valid.

During this, I discovered a problem that must be resolved: With the current plan, anyone could resubmit all addresses of the database, causing an immense amount of synchronization traffic. So we need a way to verify that it was the profile owner who resubmitted his profile, and it would also be nice to have a way to prove to the partners that the owner wants his profile resubmitted.

Solution:

  • the profile itself contains a field submission_timestamp
    • A problem is that there is no motivation for the profile owner to keep his submission_timestamp constant over a longer period of time, even if the content did not change. But I think it is difficult to solve the problem without relying on this behaviour of the profile owner.

Another problem: To prevent DoS, submissions from the same address must be rejected if they occur too often. Submission timestamp is not a suitable timestamp for this, because the user can first submit his profile with a very old submission timestamp and then submit it many times while increasing the submission timestamp in large enough intervals up to the present.

Possible Solution: Make an additional backlog, containing only (timestamp, webfinger_address) tuples. But this won't help so much - a bad user can still resubmit all profile addresses, which will cause a lot of traffic(but no synchronization traffic, which is resolved by the above). I think there is no way to prevent this, the proper way to handle this might be to set a maxsize for the queue and reject submissions if the queue is full. This should be done in way so that synchronization is not disturbed. Idea: two queues, and one loop which works on them, prefering the synchronization queue.

The queue is now implemented.

Next steps:

  • I made a mistake calculating the offense percentage - we need to divide by the number of control samples taken and not by the number of totally received entries (fixed)
  • Finally provide a web API (and user interface) to submit webfinger addresses (done)
  • provide a web API (and user interface) for searching (done, but very rudimentary and slow)
  • make a job scheduler for synchronization and expiration (done, but untested)
  • At some point, I will need to refactor again - perhaps using a "State" class which does reflect better was is to be done in the worker than Entry/None (done)

After the refactoring

Now I'm content with the interfaces of the classes defined in context.py and all lower-level classes. The sduds.py code might need to be improved later.

The next steps:

  • split lib.py into separate files (done)
  • split CLIs and library to be able to name the test files like the modules which are tested (done)
  • write documentation for lib using sphinx (done)
  • write tests for lib (done)
  • rename trieserver/trieserver -> trie_manager/manager (done)
  • write a new sduds.lib.communication module with a recvall function, as socket.makefile is not compatible with timeouts. The _read and _write functions from synchronization.py can also be moved there. (done)
  • add cache for control samples (done)
  • rename Partner -> Peer, submission queue -> retrieval_queue
  • write a document that gives an overview over the layers
  • from the lowest layer (trie_manager) to the nearly-highest layer (context.py)
    • replace use of socket.makefile().read() with sduds.lib.communication.recvall() and document usage of the communication module in doc/source/lib.rst
    • document the layer
    • write tests for the layer, fix bugs
  • TODO: normalize webfinger addresses (e.g. upper/lower case)
  • write integration tests, fix bugs
  • merge into develop branch

Link to html documentation (WIP)

Long-term TODO list