<h1 style="border:2px solid Purple;text-align:center">Domain Understanding</h1>

In this notebook we will go through the *problem domain* related concepts, namely:

1. **International Chemical Identifier** (InChI)
2. **Levenshtein distance** - The problem metric

**As usual, please do remember to upvote if you like the content :)**

<h1 style="border:2px solid Purple;text-align:center">1.  International Chemical Identifier (InChI)</h1>

The IUPAC International Chemical Identifier (InChITM) is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations. It was developed under IUPAC Project 2000-025-1-800 during the period 2000-2004.

The identifiers describe chemical substances in terms of layers of information — the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information.Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application.

In order to avoid generating different InChIs for tautomeric structures, before generating the InChI, an input chemical structure is normalized to reduce it to its so-called core parent structure. This may involve changing bond orders, rearranging formal charges and possibly adding and removing protons. Different input structures may give the same result; for example, acetic acid and acetate would both give the same core parent structure, that of acetic acid. A core parent structure may be disconnected, consisting of more than one component, in which case the sublayers in the InChI usually consist of sublayers for each component, separated by semicolons (periods for the chemical formula sublayer.) One way this can happen is that all metal atoms are disconnected during normalization; so, for example, the InChI for tetraethyllead will have five components, one for lead and four for the ethyl groups.

The first, main, layer of the InChI refers to this core parent structure, giving its chemical formula, non-hydrogen connectivity without bond order (/c sublayer) and hydrogen connectivity (/h sublayer.) The /q portion of the charge layer gives its charge, and the /p portion of the charge layer tells how many protons (hydrogen ions) must be added to or removed from it to regenerate the original structure. If present, the stereochemical layer, with sublayers /b, /t, /m and /s, gives stereochemical information, and the isotopic layer /i (which may contain sublayers /h, /b, /t, /m and /s) gives isotopic information. These are the only layers which can occur in a standard InChI.


Every InChI starts with the string "InChI=" followed by the version number, currently 1. If the InChI is standard, this is followed by the letter S for standard InChIs, which is a fully standardized InChI flavor maintaining the same level of attention to structure details and the same conventions for drawing perception. The remaining information is structured as a sequence of layers and sub-layers, with each layer providing one specific type of information. The layers and sub-layers are separated by the delimiter "/" and start with a characteristic prefix letter (except for the chemical formula sub-layer of the main layer). The six layers with important sublayers are:

**Main layer**

1. Chemical formula (no prefix). This is the only sublayer that must occur in every InChI.
    a. Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones.
    b. Hydrogen atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms.

2. Charge layer
    a. charge sublayer (prefix: "q")
    b. proton sublayer (prefix: "p" for "protons")

3. Stereochemical layer
    a. double bonds and cumulenes (prefix: "b")
    b. tetrahedral stereochemistry of atoms and allenes (prefixes: "t", "m")
    c. type of stereochemistry information (prefix: "s")

4. Isotopic layer (prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry)

5. Fixed-H layer (prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChI

6. Reconnected layer (prefix: "r"); contains the whole InChI of a structure with reconnected metal atoms; never included in standard InChI


**Examples**

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/a477cdf40019fca7d2f885b88e040b631fe421e4)


InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/L-ascorbic_acid_with_InChI_numbering.svg/330px-L-ascorbic_acid_with_InChI_numbering.svg.png)

InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-8,10-11H,1H2/t2-,5+/m0/s1

<h1 style="border:2px solid Purple;text-align:center">2.  Levenshtein distance</h1>

1. Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

2. If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
   If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

3. The greater the Levenshtein distance, the more different the strings are.
   Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

4. The Levenshtein distance algorithm has been used in:

    a. Spell checking

    b. Speech recognition

    c. DNA analysis

    d. Plagiarism detection


**The Algorithm**

**Step 1**                   

Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.

**Step 2** 

Initialize the first row to 0..n.
Initialize the first column to 0..m.

**Step 3** 

Examine each character of s (i from 1 to n).

**Step 4** 

Examine each character of t (j from 1 to m).

**Step 5** 

If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.

**Step 6**

Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.

**Step 7** 

After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].


**Example**

![](https://miro.medium.com/max/866/1*1Eo8zD0vesOJraQq_VsEaA.jpeg)

![](https://miro.medium.com/max/448/1*xyoq20suqByW8wzlKe9O-A.png)

The distance is in the lower right hand corner of the matrix, i.e. **3**.

Please upvote if you liked the content and comment below if you have some inputs :)

**Sources**

1. https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0
2. https://people.cs.pitt.edu/~kirk/cs1501/assignments/editdistance/Levenshtein%20Distance.htm