Spark cheminformatics utils

This repository contains some utilities that can be used to build Spark-based massively parallel cheminformatics pipelines.

Parsers

This package provides custom Spark input formats for chemical structures. To use it in your maven project please add this entries to your pom.xml file:

<repositories>
    ...
    <repository>
        <id>pele.farmbio.uu.se</id>
        <url>http://pele.farmbio.uu.se/artifactory/libs-snapshot</url>
    </repository>
    ...
</repositories>

<dependencies>
    ...
    <groupId>se.uu.farmbio</groupId>
        <artifactId>parsers</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </dependency>
    ...
</dependencies>

Usage

This is how you read a SDF file using the SDFInputFormat (for other input formats works in the same way).

val sc = new SparkContext(conf)
//This sets the number of SDF loaded as a single string, in many cases it makes sense to 
//have it set to more than 1 in order to reuse objects within a single mapper. Default
//value is 30 for SDF.
sc.hadoopConfiguration.set("se.uu.farmbio.parsers.SDFRecordReader.size", 30)
val sdfRDD = sc.hadoopFile[LongWritable, Text, SDFInputFormat]("/path/to/molecules.sdf")

Signature Generation

This library creates complete molecule-signatures (a type of molecular fingerprint) and provides functionality to create LabeledPoint and Vectors which can be used for machine-learning using Spark. To use this library using Maven, add the following to your pom.xml file:

<repositories>
    ...
    <repository>
        <id>pele.farmbio.uu.se</id>
        <url>http://pele.farmbio.uu.se/artifactory/libs-snapshot</url>
    </repository>
    ...
</repositories>

<dependencies>
    ...
    <groupId>se.uu.farmbio</groupId>
        <artifactId>sg</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </dependency>
    ...
</dependencies>

The API include the following functions, where each function is provided in two flavours; either the straightforward way of computing directly, or by sending data together with your objects (i.e. if you wish to assign an identifier to each molecule/record). See the explination bellow.

Function name	Description
`atom2SigRecord`	Performs Signature-generation on a single molecule and returns a `Map[String,Int]` of (Signature->#occurrences)
`atom2SigRecordDecision`	Does the same thing as `atom2SigRecord`, but also requires a decision-class/regression-value for each molecule
`sig2ID`	Takes an RDD of Signature-Records and computes both the "universe of Signatures" and transform each record to use the Signature IDs instead
`sig2LP`	Transform all molecule-records that uses (Signature ID->#occurrences) into an RDD of `LabeledPoint`
`saveAsLibSVMFile`	Allows callers to save `RDD[LabeledPoint]` as LibSVM-format (simply a wrapper of the same function in Sparks `MLUtils` object)
`loadLibSVMFile`	Allows callers to read a LibSVM-formatted file into an `RDD[LabeledPoint]`(simply a wrapper of the same function in Sparks `MLUtils` object)
`saveSign2IDMapping`	Allows callers to save the "Signature universe" to file
`loadSign2IDMapping`	Allows callers to load a "Signature universe" from file
`atom2LP`	Allows callers to compute a `LabeledPoint` from an `IAtomContainer` using an existing Signature Universe (i.e. useful for testing models with new data
`atoms2LP`	Allows callers to compute several `LabeledPoint` objects form an RDD of `IAtomContainers`, this method is made for creating test-data using an existing "Signature Universe"
`atoms2LP_UpdataSignMapCarryData`	This function allows the caller to compute `LabeledPoint` objects for several molecules at the same time, and simultaneously adding new signatures (that not previously has been seen) to the Signature Universe. This function can be used instead of using the standard fashion `atom2SigRecord`, `sig2ID` and `sig2LP` after each other (pass `null` as value for parameter `old_signMap`)
`atoms2Vectors`	Allows callers to compute `Vector`s using an existing "Signature Universe", to create testing data

Note that all types defined can be found in the uu.farmbio.sg.types-object.

Usage

The default-way of using this API is the following:

// Creating LabeledPoint-records for ML
val mols: RDD[IAtomContainer] = ... 
val moleculesAfterSG = mols.map{molecule => SGUtils.atom2SigRecordDecision(molecule, molecule.decision_class, h_start=1, h_stop=3)};
val (sig2ID_records, sig2ID_universe) = SGUtils.sig2ID(moleculesAfterSG);
val resultAsLP: RDD[LabeledPoint] = SGUtils.sig2LP(sig2ID_records);
...
// Creating Vector-records to predict on
val testMols: RDD[IAtomContainer] = ...
val test: RDD[Vector] = SGUtils.atoms2Vectors(testMols, sig2ID_universe, h_start=1, h_stop=3);
...

where molecule should be an IAtomContainer created by a reader from CDK. The resultAsLP can then be used in ML-analysis in later steps, and the sig2ID_universe is the linking between a certain signature to the signature ID that is used in the LabeledPoint-records.

Assigning data to your records

If you wish to assign some molecule ID or further data together wit your records, you can use the same _carryData-version of the API-function. Each of those functions will let you send data of type RDD[(T, requiredType)] where the requiredType is the type of the 'normal' API-function. The type T can be anything (following the Spark-requirement of it being serializable, either by standard Java Serialization or by Kryo, depending on your settings). If you wish to send more than one type of data with your record, simply put it into a tuple: (name, year) for instance.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
parsers		parsers
sg		sg
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark cheminformatics utils

Parsers

Usage

Signature Generation

Usage

Assigning data to your records

About

Releases

Packages

Languages

License

StaffanArvidsson/spark-cheminformatics

Folders and files

Latest commit

History

Repository files navigation

Spark cheminformatics utils

Parsers

Usage

Signature Generation

Usage

Assigning data to your records

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages