Skip to content

Commit

Permalink
feat(distances): add hamming, jaro and jaro-winkler
Browse files Browse the repository at this point in the history
- add new metrics: hamming, jaro and jaro-winkler
- add utils and functional api
  • Loading branch information
BenTenmann committed Jun 5, 2022
1 parent 87783fe commit d9f7154
Show file tree
Hide file tree
Showing 17 changed files with 1,315 additions and 17 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -181,5 +181,8 @@ cython_debug/
# scripts
scripts

### IDE
# IDE
.idea

# notes
notes/
47 changes: 42 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,40 @@ As the header suggests, `setriq` is a no-frills Python package for fast computat
a focus on immunoglobulins. It is a declarative framework and borrows many concepts from the popular `torch` library. It
has been optimized for parallel compute on CPU architectures.

It can **only** perform pairwise, all-v-all distance computations. This decision was made to maximize consistency and
cohesion.
Available distance functions:
* CDRdist
* Levenshtein
* TCRdist
* Hamming
* Jaro
* Jaro-Winkler

These distance functions are available either through the object-based API (as seen above), which provides the CPU-based
parallelism, or the functional API in `setriq.single_dispatch`. Unlike the object-based API, the functional API does a
single comparison between two sequences for every call, i.e. it exposes the `C++` distance functions without the
parallelism wrapper. This can be useful for integration of `setriq` with other tools such as `PySpark`. For example:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

from setriq import single_dispatch as sd

spark = SparkSession \
.builder \
.appName("setriq-spark") \
.getOrCreate()

df = spark.createDataFrame([('CASSLKPNTEAFF',), ('CASSAHIANYGYTF',), ('CASRGATETQYF',)], ['sequence'])
df = df.withColumnRenamed('sequence', 'a').crossJoin(df.withColumnRenamed('sequence', 'b'))

lev_udf = udf(sd.levenshtein, returnType=DoubleType()) # single dispatch levenshtein distance
df = df.withColumn('distance', lev_udf('a', 'b'))
df.show()
```

It is important to note, that for `setriq.single_dispatch` the returned value is always a single float value.

## Requirements
A `Python` version of 3.7 or above is required, as well as a `C++` compiler equipped with OpenMP. The package has been
Expand All @@ -62,8 +94,13 @@ brew install libomp llvm
1. Dash, P., Fiore-Gartland, A.J., Hertz, T., Wang, G.C., Sharma, S., Souquette, A., Crawford, J.C., Clemens, E.B.,
Nguyen, T.H., Kedzierska, K. and La Gruta, N.L., 2017. Quantifiable predictive features define epitope-specific T cell
receptor repertoires. Nature, 547(7661), pp.89-93. (https://doi.org/10.1038/nature22383)
2. Levenshtein, V.I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In
2. Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa,
Florida. Journal of the American Statistical Association, 84(406), pp.414-420.
3. Levenshtein, V.I., 1966, February. Binary codes capable of correcting deletions, insertions, and reversals. In
Soviet physics doklady (Vol. 10, No. 8, pp. 707-710).
3. python-Levenshtein (https://github.com/ztane/python-Levenshtein)
4. Thakkar, N. and Bailey-Kellogg, C., 2019. Balancing sensitivity and specificity in distinguishing TCR groups by CDR
4. python-Levenshtein (https://github.com/ztane/python-Levenshtein)
5. Thakkar, N. and Bailey-Kellogg, C., 2019. Balancing sensitivity and specificity in distinguishing TCR groups by CDR
sequence similarity. BMC bioinformatics, 20(1), pp.1-14. (https://doi.org/10.1186/s12859-019-2864-8)
6. Van der Loo, M.P., 2014. The stringdist package for approximate string matching. R J., 6(1), p.111.
7. Winkler, W.E., 1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record
linkage.
22 changes: 22 additions & 0 deletions include/setriq/metrics/Hamming.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
//
// Created by Benjamin Tenmann on 20/02/2022.
//

#ifndef SETRIQ_HAMMING_H
#define SETRIQ_HAMMING_H

#include "utils/type_defs.h"

namespace metric {
class Hamming {
private:
double mismatch_score_{};

public:
explicit Hamming(const double &mismatch_score) : mismatch_score_{mismatch_score} {};

double forward(const std::string &a, const std::string &b) const;
};
}

#endif //SETRIQ_HAMMING_H
27 changes: 27 additions & 0 deletions include/setriq/metrics/Jaro.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
//
// Created by Benjamin Tenmann on 05/03/2022.
//

#ifndef SETRIQ_JARO_H
#define SETRIQ_JARO_H

#include <array>
#include "utils/type_defs.h"

typedef std::array<double, 3> jaro_weighting_t;

namespace metric {
class Jaro {
private:
jaro_weighting_t weights_ = {1. / 3, 1. / 3, 1. / 3};

public:
Jaro() = default;

explicit Jaro(jaro_weighting_t weights) : weights_(weights) {};

double forward(const std::string &a, const std::string &b) const;
};
}

#endif //SETRIQ_JARO_H
27 changes: 27 additions & 0 deletions include/setriq/metrics/JaroWinkler.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
//
// Created by Benjamin Tenmann on 21/02/2022.
//

#ifndef SETRIQ_JAROWINKLER_H
#define SETRIQ_JAROWINKLER_H

#include "utils/type_defs.h"
#include "metrics/Jaro.h"

namespace metric {
class JaroWinkler {
private:
double p_ = 0.;
size_t max_l_ = 4;
Jaro jaro_{};

public:
JaroWinkler() = default;

explicit JaroWinkler(const double &p, const size_t &max_l, Jaro jaro) : p_{p}, max_l_{max_l}, jaro_{jaro} {};

double forward(const std::string &a, const std::string &b) const;
};
}

#endif //SETRIQ_JAROWINKLER_H
20 changes: 20 additions & 0 deletions src/setriq/_C/metrics/Hamming.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
//
// Created by Benjamin Tenmann on 20/02/2022.
//

#include "metrics/Hamming.h"

double metric::Hamming::forward(const std::string &a, const std::string &b) const {
/*!
* Compute the Hamming distance between two input strings.
*
* @param a: an input string to be compared
* @param b: an input string to be compared
*/
auto&& distance = 0.;
for (auto i = 0ul; i < a.size(); i++) {
if (a[i] != b[i])
distance += this->mismatch_score_;
}
return distance;
}
81 changes: 81 additions & 0 deletions src/setriq/_C/metrics/Jaro.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
//
// Created by Benjamin Tenmann on 05/03/2022.
//

#include <cmath>
#include "metrics/Jaro.h"

#define either_zero(x, y) (x == 0) || (y == 0)
#define max(x, y) x > y ? x : y
#define min(x, y) x > y ? y : x


void collapse_into_match_str(const std::string& sequence, const std::vector<size_t>& matches_idx, char* match_str) {
auto&& j = 0ul;
for (const auto& idx : matches_idx) {
if (idx){
match_str[j] = sequence[idx - 1];
j++;
}
}
}

double metric::Jaro::forward(const std::string &a, const std::string &b) const {
/*!
* Compute the Jaro distance between two input strings.
* Adapted from https://github.com/markvanderloo/stringdist/blob/master/pkg/src/jaro.c
*
* @param a: an input string to be compared
* @param b: an input string to be compared
*/
const auto& s_i = a.size();
const auto& s_j = b.size();
if (either_zero(s_i, s_j))
// if one of the strings is of length 0 and the other isn't, then the distance is maximal (1)
// if both are length 0, then the distance is minimal, i.e. 0
return (double) ((s_i > 0) || (s_j > 0));

const auto& max_len = s_i > s_j ? s_i : s_j;
const auto& max_match_distance = (int) std::floor(max_len / 2) - 1;
if (max_match_distance < 0)
// catch the case when both strings are of length == 1
return a[0] == b[0] ? 0.0 : 1.0;

auto&& matches_s_i = std::vector<size_t>(s_i, 0);
auto&& matches_s_j = std::vector<size_t>(s_j, 0);

auto&& n_matches = 0ul;
for (auto i = 0; i < s_i; i++) {
const auto& left = max((i - max_match_distance), 0);
const auto& right = min((i + max_match_distance) + 1, s_j);
// can we collapse this in some way?
for (auto j = left; j < right; j++) {
if ((a[i] == b[j]) && (matches_s_j[j] == 0)) {
n_matches++;
matches_s_i[i] = i + 1;
matches_s_j[j] = j + 1;
break;
}
}
}
if (n_matches == 0)
return 1.0;

char *match_str_i = new char[n_matches];
char *match_str_j = new char[n_matches];

collapse_into_match_str(a, matches_s_i, match_str_i);
collapse_into_match_str(b, matches_s_j, match_str_j);

auto&& t = 0.0;
for (auto k = 0ul; k < n_matches; k++) {
if (match_str_i[k] != match_str_j[k])
t += 0.5;
}
delete []match_str_i;
delete []match_str_j;

const auto& m = (double) n_matches;
// allow arbitrary weighting
return 1 - (this->weights_[0] * (m / s_i) + this->weights_[1] * (m / s_j) + this->weights_[2] * ((m - t) / m));
}
27 changes: 27 additions & 0 deletions src/setriq/_C/metrics/JaroWinkler.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
//
// Created by Benjamin Tenmann on 21/02/2022.
//

#include "metrics/JaroWinkler.h"

size_t min_sequence_len(const std::string& a, const std::string& b) {
const auto& length_a = a.size();
const auto& length_b = b.size();
return length_a < length_b ? length_a : length_b;
}

double metric::JaroWinkler::forward(const std::string &a, const std::string &b) const {
/*!
* Compute the Jaro-Winkler distance between two input strings.
*
* @param a: an input string to be compared
* @param b: an input string to be compared
*/
const auto& jaro_distance = this->jaro_.forward(a, b);
const auto& min_length = min_sequence_len(a, b);

auto&& l = 0ul;
while ((a[l] == b[l]) && (l < min_length) && (l < this->max_l_))
l++;
return jaro_distance * (1 - l * this->p_);
}

0 comments on commit d9f7154

Please sign in to comment.