Skip to content
FriedrichFoerster edited this page May 27, 2020 · 5 revisions

Multi-reference alignment and classification in real space

Overview

The procedure is rather dated, nevertheless still functional. Hence, the documentation is still included.

Multiple correlation optimization using mcoEXMX.py or mcoAC.py

Another option is to use a classification algorithm based on correlation to different class averages (multiple correlation optimization, MCO). The goal is to find a distribution of particles that optimises the correlation to a set of different references.

MCO comes in two different flavors: the simple version is a optimization by a local, gradient optimization (expectation maximization) akin to k-means classification (MCO-EXMX). Alternatively, stochastic elements (simulated annealing) can be incorporated into the classification procedure (MCO-AC) (Hrabe et. al. 2012).

In short, the idea is to group the subtomograms such that they match best to a class average. In an iterative procedure the assignment of particles is changed such that the (constrained) correlation of the class average to its respective class members get optimized (usage of the constrained correlation makes sure that the missing wedge is considered in the similarity measure). The simplest optimization algorithm is expectation maximization (EXMX), which is also the basis for subtomogram alignment. This algorithm converges rapidly, but it can end up in local minima. Therefore, we designed an 'outer loop' for performing EXMX multiple times in the AC algorithm. At the start of each iteration the class assignments are perturbed and then optimized using EXMX. The new assignments are accepted or rejected according to the so-called Metropolis criterion, which depends on a pseudo-temperature. Thus, MCO-AC is a (computationally much more demanding) extension of MC-EXMX.

The method are called using the scripts mcoEXMX.py or mcoAC.py like this:

mpirun --hostfile "pathToYourHostfile" -c "numberOfCPUs" pytom PathToPytom/bin/mcoAC.py -j class.xml
The job file 'class.xml' can be created using the PyTom user interface (see below). Make sure that all particles in the particle list have either the same class label (e.g., <Class Name="0"/> ) or no class property at all. The program will take more than one class as a priori distribution and continue from there.

Parameters for MCO-EXMX and MCO-AC

Required parameters for classification jobs are:

  • Particle List. A particle list in XML format that points to all subtomograms.
  • Mask. The mask may be spherical or non-symmetrical. To focus alignment on a particular region, you may use a mask of aribtrary shape. Spherical masks Can be generated using PyTom following these instructions.
  • Number of local refinment rounds: (MCO-EXMX)
  • Number of classes: The initial number of classes used for classification. The final number of classes can be different because classes can be depopulated.
  • Convergence criterion: If the numer of class changes drops below this value (0 ≤ value ≤ 1) , classification will terminate
  • Temperature type (MCO-AC only): A little explanation.
  • Classification criterion (MCO-AC only): Can either be Metropolis criterion (), or Threshold Acceptance ()
  • Initial temperature (MCO-AC only): A little explanation.
  • Annealing step (MCO-AC only): (decrease) during iterations
### Output of MCO-EXMX and MCO-AC We start with the simpler Output of MCO-EXMX. Inside the destination directory, you will find:
  • RandomisedParticleList.xml: this is the initial random class distribution determined prior to starting the classification
  • directories 0,1,2, ... N: these are the directories for each EXMX iteration. Inside each of these directories you will find:
    • directories class0, class1, ..., classC: these C (=number of classes) directories contain class averages determined according to the current class distribution.
    • AlignmentList-1.xml, AlignmentList-2.xml, ... AlignmentList-N.xml: AlignmentLists after each iteration. These AlignmentLists contain the class assignments.
    • Scores determined for each subtomogram with the respective class average
    • classifiedParticles.xml: Particle list with class labels determined in the current EXMX iteration. Class averages will be created at the start of the next iteration.

Output of MCO-AC in the destination directory:

  • directories 0,1,2, ..., N_anneal : These are the directories for each annealing round.
    Inside of each directory there is the content of an MCO-EXMX run (see above).
  • swapList.xml: List of class swaps induced by annealing.
  • currentBestParticleList.xml: The best scoring particle assignment is stored. During the annealing iterations it may get worse.

MCO classification on the tutorial data

The tutorial shows how to apply both types of classification to the tutorial data. Particles are classified directly after localization in the postLocalizationClassification with MCO-EXMX and classification after alignment with MCO-AC. Please note that the order was chosen based on our standard workflow but can be swapped for individual adjustment.
  • MCO-EXMX
    The postLocalizationClassification folder contains all scripts and files required to run a coarse classification on the tutorial particles. As specified in job.xml, it will classify all particles based on their orientations determined during localization. Furthermore, it will split the whole data-set into 5 (NumberClasses) classes and run through 10 iterations (NumberIterations). It will stop before the 10th iteration if less than 0.05 (EndThreshold) : 5% of all particles have changed their class assignment.
    Please note that the initial cluster assignments are distributed randomly so that results are not necessarily reproduceable. If you want to reproduce results, you must start with the RandomisedParticleList.xml in the results directory. Classification progress is deterministic from then on.
    The mergeClasses.sh script demonstrates how to merge multiple classes back into one. Running all commands in this script will ultimately display a similarity tree where you can determine which classes to merge and should be used for further resolution refinement. Please also follow the scripts at the bottom of this page.
  • MCO-AC
    The mcoAClassification folder contains classification scripts and results after the previously MCO-EXMX classified particles were aligned. As mentioned above and in Hrabe et. al. 2012, MCO-AC is essentially an extended version of MCO-EXMX. Classification is adjusted to 3 (NumberIterations) annealing iterations, meaning that each of three rounds of MCO-EXMX will be followed by an annealing step where particles randomly change class labels. This stochastic process is adjusted (on the bottom of job.xml) to AnnealingTemperature where the particle temperature and step is specified. The most important feature of AnnealingCriterion is AllowClassCollapse. Random assignment may cause extinction where no particles will be assigned to one class. If you set AllowClassCollapse to True, you allow for this to happen. If it is set to False, the algorithm will repeat random assignment up to 10x in case one class goes missing. If this happens over 10 tries, the algorithm will continue with less one class. While MCO-AC is running, you will see result folders come and go. These folders store MCO-EXMX results and will be deleted after each iteration. The real MCO-AC is currentBestParticleList.xml with the best class assignment found thus far.

Merging classes

It is a common problem of classification algorithms that they tend to ignore small classes. A solution to this problem is to oversample the class number in the classification: e.g., if ~4 classes are expected to be present in the dataset one may initially group the dataset into ~20 classes, which increases the chances that a sparse class will be distinguished and not fused into a more populated class. Thus, of the ~20 classes, many classes will be essentially identical, which can subsequently be merged again.

In order to merge classes by hirarchical clustering, you have to run the following commands.

  1. Average: Generate class averages with this script:
    $ ipytom
    from pytom.basic.structures import ParticleList
    pl = ParticleList()
    pl.fromXMLFile('bestAfter31Its.xml')
    pls.sortByClassLabel()
    pls = pl.splitByClass()
    for i in xrange(len(pls)):
        c = pls[i]
        print i, ' == classlabel : ', c[0].getClassName(), '    Class size: ' , len(c)
        c.average('classAverages/class' + str(i) + '.em')
    

    classAveragesList = ParticleList('classAverages') classAveragesList.loadDirectory() classAveragesList.toXMLFile('classAverageList.xml')

  2. CCC: Determine a correlation matrix between all class averages, which can then be used for hierarchical clustering. Copy the below into a script and run either in parallel or sequential.
    $ ipytom
    from pytom.cluster.correlationMatrixStructures import CorrelationMatrixJob
    from pytom.basic.structures import ParticleList,Mask,Particle
    from pytom.cluster.correlationMatrix import distributedCorrelationMatrix
    import os
    pl = ParticleList()
    pl.fromXMLFile(''classAverageList.xml')
    

    mask=Mask('./mask.em') job = CorrelationMatrixJob(particleList=pl, mask=mask, resultMatrixName = 'correlationMatrix.em',applyWedge=False, binningFactor=2, lowestFrequency=0, highestFrequency=17) distributedCorrelationMatrix(job,False)

  3. CCC to ascii: Convert matrix from EM format to text file that allows import to R:
    $ ipytom
    from pytom_volume import read
    from pytom.tools.files import writeMatrix2RMatrix
    m = read('correlationMatrix.em')
    writeMatrix2RMatrix(m,'correlationMatrix-R.txt')
    
  4. Clustering: Code for hierarchical classification using the following commands in scipy:
    $ ipytom
    from pytom.basic.files import read
    from pytom_volume import abs
    from pytom_numpy import vol2npy
    from scipy.cluster.hierarchy import linkage,dendrogram
    from matplotlib.pyplot import show
    

    matrix = abs(read('correlationMatrix.em')-1)

    classification= linkage(vol2npy(matrix)) print classification dendrogram(classification) show()

    You should see a hierarchical tree that suggests which classes should be merged.
  5. Merge classes: Using the below script you first split the particle list according the previously defined (too fine) classes and subsequently merge selected classes (in this example class 0, 2, and 3 into one class, classes 1 and 4 into another, and 5 and 6 into a third new class).
    $ipytom
    from pytom.basic.structures import ParticleList
    pl = ParticleList()
    pl.fromXMLFile(''classAverageList.xml')
    pls.sortByClassLabel()
    pls = pl.splitByClass()
    

    proteinState1 = pls[0] + pls[2] + pls[3] proteinState1.average('proteinInState1') proteinState2 = pls[1] + pls[4] proteinState2.average('proteinInState2')

    outliers = pls[5] + pls[6] outliers.average('outliers')

Create an classification job in the terminal with mcoEXMXJob.py and mcoACJob.py

Please note that we provide the mcoEXMXJob.py and mcoACJob.py scripts to set up classification jobs through the terminal. Check the terminal parameters in mcoEXMXJob.py --help and mcoACJob.py --help for details.

Coarse classification after localization

You can perform classification directly after localization in order to get rid of obvious false positives such as gold beads. In order to do so, we recomend to set up an MCO-EXMX classification job (without annealing).

Hence, you can directly classify your reconstructed subtomograms using the coarse alignment determined during localization. Please read below how to set up an MCO-EXMX job, which is essentially identical to setting up a coarse MCO-AC job. You will have to use the particle list you obtained after localization.

After MCO-EXMX, you want to select the good subtomograms. You can do so by running the following script:

$ipytom
from pytom.basic.structures import ParticleList

pl = ParticleList() #6 is the iteration number #classifiedParticles.xml is the result pl.fromXMLFile('./6/classifiedParticles.xml') pl.sortByClassLabel() pls = pl.splitByClass() for cl in pls: className = cl[0].getClassName() cl.average('class' + str(className) + '.em') print className, ' contains ' , len(cl) , ' particles'

#choose the corresponding result class you want to use for further processing, e.g., classes 1,2,3,4 #the decision on which particles to take you will make based on the averages written out above ribos = pls[1] + pls[2] + pls[3] + pls[4] ribos.toXMLFile('filtered_pl.xml') ribos.average('./filteredIniRef.em')

Clone this wiki locally