# Getting Started with StringCompare

**StringCompare** is a Python package providing efficient string comparison functions, such as [edit distances](https://en.wikipedia.org/wiki/Edit_distance), the [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance), and token-based similarity functions (e.g. the [Jaccard similarity index](https://en.wikipedia.org/wiki/Jaccard_index)).

The package's backend is implemented in C++ for efficiency. Pure Python implementations are also provided for testing purposes.

## Installation

The development version of **StringCompare** can be installed from Github using [pip](https://pypi.org/project/pip/):

```bash
    pip install setuptools pybind11
    git clone https://github.com/OlivierBinette/StringCompare.git
    pip install -e ./StringCompare
```

You can test the installation by running:

```bash
    python -c "import stringcompare"
```

If you don't see any error message, then Python was sucessfully able to load **StringCompare**.

### Installation Notes

We recommended to use **StringCompare** with Python version at least 3.6. If you encounter installation issues, you may have to verify that [pybind11](https://pybind11.readthedocs.io/en/stable/) has been installed correctly for your system.

On some linux systems, pybind11 is not compatible with the provided version of gcc. On Ubuntu 21.10, you have to run the following commands before installing **StringCompare**:

```bash
    sudo apt install gcc-9 g++-9
    export CC=gcc-9 CXX=g++-9
```

You can then reinstall **StringCompare**:
```bash
    make clean
    pip install --force-reinstall -e ./StringCompare
```

If you have persistent installation issues, then you should consider running your code within a [Docker](https://www.docker.com/) container. I recommend using the python:3.7.9 base image. For example, after installing docker, you can launch an interactive bash session and install **StringCompare** as follows:
```bash
    sudo docker run -it python:3.7.9 bash
    git clone https://github.com/OlivierBinette/StringCompare.git
    pip install -e ./StringCompare
    python
    >>> import stringcompare
```

Please report all installation issues [here](https://github.com/OlivierBinette/StringCompare/issues).

## Structure of the Package

**StringCompare** curently has one main module, the **distance** module, which contains string distances functions.

Each string distance function is implemented as a class which can be instanciated with a certain set of parameters. The `compare()` function can then be used to compare strings. For example:

In [6]:
from stringcompare import Levenshtein

cmp = Levenshtein()
cmp.compare("Olivier", "Olivia")

0.26666666666666666

There are also vectorized forms of the comparison function:

In [4]:
cmp.elementwise(["Olivier", "Oliver"], ["Olivia", "Binette"])

array([0.26666667, 0.63157895])

In [5]:
cmp.pairwise(["Olivier", "Oliver"], ["Olivia", "Binette"])

array([[0.26666667, 0.66666667],
       [0.28571429, 0.63157895]])

### Available Functions

The following string distance functions are currently implemented:

- Levenshtein distance (`Levenshtein`)
- Damerau-Levenshtein distance (`DamereauLevenshtein`)
- Longest common substring (LCS) distance (`LCSDistance`)
- Jaro distance (`Jaro`)
- Jaro-Winkler distance (`JaroWinkler`)
- Character difference distance (`CharacterDifference`)

Refer to the [API documentation](https://olivierbinette.github.io/StringCompare/source/stringcompare.html) for full information.