# AIMSim Demo
This notebook demonstrates the key uses of _AIMSim_ as a graphical user interface, command line tool, and scripting utility. For detailed explanations and to view the source code for _AIMSim_, visit our online documentation.

## Installing _AIMSim_
For users with Python already in use on their devices, it is _highly_ recommended to first create a virtual environment before installing _AIMSim_. This package has a large number of dependencies with only a handful of versions supported, so conflicts are likely unless a virtual environment is used.

For new Python users, the authors recommended installing anaconda navigator to manage dependencies for _AIMSim_ and make installation easier overall. Once anaconda navigator is ready, create a new environment with Python 3.7, open a terminal or command prompt in this environment, and follow the instructions below. 

We reccomend installing _AIMSim_ using the commands shown below (omit exclamation points and the %%capture, unless you are running in a Jupyter notebook):

In [4]:
%%capture
!pip install aimsim

Now, start the _AIMSim_ GUI by typing `python -m aimsim` or simply `aimsim` into the command line.

## Graphical User Interface Walkthrough
For most users, the Graphical User Interface (GUI) will provide access to all the key functionalities in _AIMSim_. The GUI works by serving the user with drop downs and text fields which represent settings that would otherwise need to be configured in a file by hand. This file is written to the disk by the GUI as part of execution so that the file can be used as a 'starting point' for more advanced use cases.

**Important Note**: Jupyter Notebook _cannot_ run _AIMSim_ from Binder. In order to actually run the _AIMSim_ GUI alongside this tutorial, you will need to download this notebook and run it from a local installation of Jupyter, or follow the installation instructions above and start _AIMSim_ from there. You can install Jupyter [here](https://jupyter.org/install).
<div>
<img src="attachment:image-6.png" width="250"/>
</div>



### A. Database File
This field accepts a file or directory path containing an input set of molecules in one of the accepted formats: SMILES strings, Protein Data Bank files, and excel files containing these data types.

Example:

`/Users/chemist/Desktop/SMILES_database.smi`

#### A1. Similarity Plots
Checking this box will generate a similarity distribution with _AIMSim's_ default color scheme and labels. To customize this plot further, edit the configuration file produced by _AIMSim_ by clicking `Open Config`, then re-submit the file through the command line interface.

Example:
<div>
<img src="attachment:image-4.png" width="200"/>
</div>

In addition to the similarity distribution, this will create a heatmap showing pairwise comparisons between the two species. As above, edit the configuration file to control the appearance of this plot.

Example:
<div>
<img src="attachment:image-5.png" width="200"/>
</div>

#### A2. Property Similarity Checkboxes
Like in the previous two examples, checking this box will create a plot showing how a provided molecular property varies according to the chosen molecular fingerprint. For this to work, data must be provided in a comma-separated value format (which can be generated using Excel with Save As... -> CSV) where the rightmost column is a numerical value (the property of interest).

Example:

| SMILES | Boiling Point |
|--------|---------------|
| C      | -161.6        |
| CC     | -89           |
| CCC    | -42           |


### B. Target Molecule
Provide a SMILES string representing a single molecule for comparison to the provided database of molecules. In the screenshot above, the provided molecule is "CO", methanol. Any valid SMILES strings are accepted, and any errors in the SMILES string will not affect the execution of other tasks.

#### B1. Similarity Heatmap
Like the similarity heatmap shown above, this checkbox will generate a similarity distribution for the single target molecule specified above to the entire molecular database. This is particularly useful when considering a new addition to a dataset, where _AIMSim_ can help in determining if the provided molecule's structural motif's are already well represented in the data.

### C. Similarity Measure
This dropdown includes all of the similarity metrics currently implemented in _AIMSim_. The default selected metric is likely a great starting point for most users, and the additional metrics are provided for advanced users or more specific use cases. 

Available Similarity Measures are automatically updated according to the fingerprint currently selected. Not all metrics are compatible with all fingerprints, and _AIMSim_ recognizes will only allow the user to select valid combinations.

Below is a complete list of all similarity measures currently implemented in _AIMSim_.

| #  | Name                   | Input Aliases                                                                                                                                  |
| -- | ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| 1  | l0\_similarity         | \-                                                                                                                                             |
| 2  | l1\_similarity         | manhattan\_similarity, taxicab\_similarity,                           city\_block\_similarity,                               snake\_similarity |
| 3  | l2\_similarity         | euclidean\_similarity                                                                                                                          |
| 4  | cosine                 | driver-kroeber, ochiai                                                                                                                         |
| 5  | dice                   | sorenson, gleason                                                                                                                              |
| 6  | dice\_2                | \-                                                                                                                                             |
| 7  | dice\_3                | \-                                                                                                                                             |
| 8  | tanimoto               | jaccard-tanimoto                                                                                                                               |
| 9  | simple\_matching       | sokal-michener, rand                                                                                                                           |
| 10 | rogers-tanimoto        | \-                                                                                                                                             |
| 11 | russel-rao             | \-                                                                                                                                             |
| 12 | forbes                 | \-                                                                                                                                             |
| 13 | simpson                | \-                                                                                                                                             |
| 14 | braun-blanquet         | \-                                                                                                                                             |
| 15 | baroni-urbani-buser    | \-                                                                                                                                             |
| 16 | kulczynski             | \-                                                                                                                                             |
| 17 | sokal-sneath           | sokal-sneath\_1                                                                                                                                |
| 18 | sokal-sneath\_2        | sokal-sneath-2, symmetric\_sokal\_sneath, symmetric-sokal-sneath,                                                                              |
| 19 | sokal-sneath\_3        | sokal-sneath-3                                                                                                                                 |
| 20 | sokal-sneath\_4        | sokal-sneath-4                                                                                                                                 |
| 21 | jaccard                | \-                                                                                                                                             |
| 22 | faith                  | \-                                                                                                                                             |
| 23 | michael                | \-                                                                                                                                             |
| 24 | mountford              | \-                                                                                                                                             |
| 25 | rogot-goldberg         | \-                                                                                                                                             |
| 26 | hawkins-dotson         | \-                                                                                                                                             |
| 27 | maxwell-pilliner       | \-                                                                                                                                             |
| 28 | harris-lahey           | \-                                                                                                                                             |
| 29 | consonni−todeschini\_1 | consonni−todeschini-1                                                                                                                          |
| 30 | consonni−todeschini\_2 | consonni−todeschini-2                                                                                                                          |
| 31 | consonni−todeschini\_3 | consonni−todeschini-3                                                                                                                          |
| 32 | consonni−todeschini\_4 | consonni−todeschini-4                                                                                                                          |
| 33 | consonni−todeschini\_5 | consonni−todeschini-5                                                                                                                          |
| 34 | austin-colwell         | \-                                                                                                                                             |
| 35 | yule\_1                | yule-1                                                                                                                                         |
| 36 | yule\_2                | yule-2                                                                                                                                         |
| 37 | holiday-fossum         | fossum, holiday\_fossum                                                                                                                        |
| 38 | holiday-dennis         | dennis, holiday\_dennis                                                                                                                        |
| 39 | cole\_1                | cole-1                                                                                                                                         |
| 40 | cole\_2                | cole-2                                                                                                                                         |
| 41 | dispersion             | choi                                                                                                                                           |
| 42 | goodman-kruskal        | goodman\_kruskal                                                                                                                               |
| 43 | pearson-heron          | pearson\_heron                                                                                                                                 |
| 44 | sorgenfrei             | \-                                                                                                                                             |
| 45 | cohen                  | \-                                                                                                                                             |
| 46 | peirce\_1              | peirce-1                                                                                                                                       |
| 47 | peirce\_2              | peirce-2                                                                                                                                       |

### D. Molecular Descriptor
This dropdown includes all of the molecular descriptors, mainly fingerprints, currently implemented in _AIMSim_:

|#|Fingerprint|
|---|---|
|1|morgan|
|2|topological|
|3|daylight|

Each of these fingerprints should be generally applicable for chemical problems, though they are all provided to serve as an easy way to compare the results according to fingerprinting approach.

Additional descriptors are included with _AIMSim_ which are not mathematically compatible with some of the similarity measures. When such a descriptor is selected, the corresponding similarity measure will be removed from the dropdown.

#### D1. Show Experimental Descriptors
This checkbox adds additional molecular descriptors into the `Molecular Descriptor` dropdown. These are marked as _experimental_ because they are generated using third-party libraries over which we have very little or no control. The descriptors generated by these libraries should be used only when the user has a very specific need for a descriptor as implemented in one of the packages below:
 - [ccbmlib](https://doi.org/10.12688/f1000research.22292.2): All molecular fingerprints included in the `ccbmlib` library have been reproduced in _AIMSim_. Read about these fingerprints [in the `ccbmlib` repository](https://github.com/vogt-m/ccbmlib).
 - [mordred](https://doi.org/10.1186/s13321-018-0258-y): All 1000+ descriptors included in `mordred` are available in _AIMSim_, though as of Januray 2022 it seems that `mordred` is no longer being maintained and has a significant amount of bugs. Use at your own risk.
 - [PaDELPy](https://doi.org/10.1002/jcc.21707): [This package](https://github.com/ecrl/padelpy) provides access to all of the molecular descriptors included as part of the PaDEL-Descriptor standalone Java program.
 
### E. Run
Pressing this button will call a number of input checkers to verify that the information entered into the fields above is valid, and then the tasks will be passed into _AIMSim_ for execution. Additional input to _AIMSim_ needed for some tasks may be requested from the command line.

For large collections of molecules with substantial run times, your operating system may report that _AIMSim_ has stopped responding and should be closed. This is likely not the case, and _AIMSim_ is simply executing your requested tasks. If unsure, try checking the `Verbose` checkbox discussed below, which will provide near-constant output while _AIMSim_ is running.

### F. Open Config
Using your system's default text editor, this button will open the configuration file generated by _AIMSim_ after pressing the run button. This is useful for fine-tuning your plots or re-running the exact same tasks in the future. This configuration file can also access additional functionality present in _AIMSim_ which is not included in the GUI, such as the sampling ratio for the data (covered in greater depth in the __Command Line and Configuration Files__ section below). To use this configuration file, include the name of the file after your call to _AIMSim_ on the command line, i.e.:

`aimsim aimsim-ui-config.yaml` or `python -m aimsim aimsim-ui-config.yaml`

Because of the way Python install libraries like _AIMSim_, this file will likely be saved somewhere difficult to find among many other internal Python files. It is highly recommended to make a copy of this file in a more readily accessible location, or copy the contents of this file into another one. The name of the file can also be changed to something more meaningful (i.e., JWB-Solvent-Screen-123.yaml) as long as the file extension (.yaml) is still included.

### G. Verbose
Selecting this checkbox will cause _AIMSim_ to emit near-constant updates to the command line on its status during execution. This is useful to confirm that _AIMSim_ is executing and has not crashed, and also to provide additional information about errors in the input data.

For large datasets, this may generate a _significant_ amount of command line output. A pairwise comparison of 10,000 molecules would require 100,000,000 (10,000 \* 10,000) operations, generating at least that many lines of text in the console.

Example __Verbose__ output:

```
Reading SMILES strings from C:\path\to\file\small.smi
Processing O=S(C1=CC=CC=C1)(N2CCOCC2)=O (1/5)
Processing O=S(C1=CC=C(C(C)(C)C)C=C1)(N2CCOCC2)=O (2/5)
Processing O=S(C1=CC=C(C2=CC=CC=C2)C=C1)(N3CCOCC3)=O (3/5)
Processing O=S(C1=CC=C(OC)C=C1)(N2CCOCC2)=O (4/5)
Processing O=S(C1=CC=C(SC)C=C1)(N2CCOCC2)=O (5/5)
Computing similarity of molecule num 1 against 1
Computing similarity of molecule num 2 against 1
Computing similarity of molecule num 3 against 1
Computing similarity of molecule num 4 against 1
Computing similarity of molecule num 5 against 1
Computing similarity of molecule num 1 against 2
```

### H. Outlier Check
Checking this will have _AIMSim_ create an Isolation Forest (read more about this in [Sklearn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)) to identify possible outliers in the input database of molecules. The results from this approach are _non-deterministic_ because of the underlying algorithm driving the Isolation Forest, so this feature is intended to be a "sanity check" rather than a quantitative measure of 'outlier-ness'. To truly determine how different a single example molecule is to a set of molecules, use the `Compare Target Molecule` functionality discussed above.

### I. Enable Multiple Workers
This checkbox will enable multiprocessing, speeding up execution time on the data. By default, _AIMSim_ will use __all__ physical cores available on your machine, which may impact performance of other programs.

The user should only enable this option with datasets off a few hundred or more molecules. This is because there is additional processing time associated with creating and destroying multiple processes, so for small data sets it is faster to simply execute the comparisons serially.

## Command Line and Configuration Files
For users who prefer to use _AIMSim_ without a user interface, a command line interface is provided. This requires the user to manually write configuration files, but allows access to more granular control and some additional features which are not included in the GUI. This can be invoked by typing `aimsim config.yaml` into your terminal or command window, where `config.yaml` is a configuration file you have provided or copied from the _AIMSim_ repository.

Below is a 'maximum specification' file to be used with _AIMSim_, showing all possible settings and tasks which _AIMSim_ can ingest. Any overall settings which are left out will be inferred by _AIMSim_, and any tasks which are not included will simply not be executed. Each field used in the file is explained afterward.

### Maximum Specification File
```
is_verbose: # bool
molecule_database: # excel / csv/ text file
molecule_database_source_type: # str
similarity_measure: # str. Set to determine if auto identification required
fingerprint_type: # str. Set to determine if auto identification required
measure_id_subsample: # [0, 1] Subsample used for measure search
sampling_ratio: # [0, 1] Subsample used for all tasks
n_workers: # [int, 'auto'] number of processes, or let AIMSim decide
global_random_seed: # int or 'random'
    
tasks:
  compare_target_molecule:
    target_molecule_smiles:
    similarity_plot_settings:
        plot_color: # Set a color recognized by matplotlib
        shade: # bool
        plot_title: 
    log_file_path:
    
  visualize_dataset:
      heatmap_plot_settings:
        cmap: # matplotlib recognized cmap (color map) used for heatmap.
        plot_title: # str
        annotate: # bool
      similarity_plot_settings:
        plot_color: # str
        shade: # bool
          xticklabels: # str
          yticklabels: # str
    
  see_property_variation_w_similarity:
    log_file_path: # str
    property_plot_settings:
      plot_color: # Set a color recognized by matplotlib
  
  identify_outliers:  
    random_state: # integer
    output: # filepath or "terminal" to control where results are shown
    plot_outliers: # True or False
    pair_similarity_plot_settings: # Only meaningful if plot_outliers is True
      plot_color: # Set a color recognized by matplotlib
  
  cluster:
    n_clusters: # int
    clustering_method: # str
    log_file_path: # str
    cluster_file_path: # str
    cluster_plot_settings:
      cluster_colors: # list. Ensure len(list) >= n_cluster
      embedding:
        method: # str
        random_state: # int
```

#### Overall _AIMSim_ Settings
These settings impact how all tasks run by _AIMSim_ will be executed.
 - `is_verbose`: Must be either `True` or `False`. When `True`, _AIMSim_ will emit text updates of during execution to the command line, useful for debugging.
 - `molecule_database`: A file path to an Excel workbook, text file containing SMILES strings, or PDB file surrounded by single quotes, i.e. `'/User/my_user/smiles_database.smi'`. Can also point to a directory containing a group of PDB files, but the file path must end with a '/' (or '\' for Windows).
 - `molecule_database_source_type`: The type of data to be input to _AIMSim_, being either `text`, `excel`, or `pdb`.
 - `similarity_measure`: The similarity measure to be used during all tasks, chosen from the list of supported similarity measures. Automatic similarity measure determination is also supported, and can be performed by specifying `determine`.
 - `fingerprint_type`: The fingerprint type or molecular descriptor to be used during all tasks, chosen from the list of supported descriptors. Automatic determination is also supported, and can be performed by specifying `determine`.
 - `measure_id_subsample`: A decimal number between 0 and 1 specifying what fraction of the dataset to use for automatic determination of similarity measure and fingerprint. For a dataset of 10,000 molecules, setting this to `0.1` would run only 1000 randomly selected molecules, dramatically reducing runtime. This field is only needed if `determine` is used in either of the prior fields.
 - `sampling_ratio`: A decimal number between 0 and 1 specifying what fraction of the dataset to use for tasks. For a dataset of 10,000 molecules, setting this to `0.1` would run only 1000 randomly selected molecules, dramatically reducing runtime.
  - `n_workers`: Either an integer or the string 'auto'. With an integer, _AIMSim_ will create that many processes for its operation. This number should be less than or equal to the number of _physical_ CPU cores in your computer. Set this option to 'auto' to let _AIMSim_ configure multiprocessing for you.
 - `global_random_seed`: Integer to be passed to all non-deterministic functions in _AIMSim_. By default, this value is 42 to ensure consistent results between subsequent executions of _AIMSim_. This seed will override the random seeds provided to any other _AIMSim_ tasks. Alternatively, specify 'random' to allow _AIMSim_ to randomly generate a seed.

#### Task-Specific Settings
The settings fields below dictate the behavior of _AIMSim_ when performing its various tasks.

##### Compare Target Molecule
Generates a similarity distribution for the dataset compared to an individual molecule.
 - `target_molecule_smiles`: SMILES string for the molecule used in comparison to the dataset.
     - `similarity_plot_settings`: Controls the appearance of the distribution.
         - `plot_color`: Can be any color recognized by the _matplotlib_ library.
         - `shade`: `True` or `False`, whether or not to shade in the area under the curve.
         - `plot_title`: String containing text to be written above the plot.
     - `log_file_path`: String specifying a file to write output to for the execution of this task. Useful for debugging.

##### Visualize Dataset
Generates a pairwise comparison matrix for all molecules in the dataset.
 - `heatmap_plot_settings`: Control the appearance of the plot.
         - `cmap`: _matplotlib_ recognized cmap (color map) used for heatmap.
         - `plot_title`: String containing text to be written above the plot.
         - `annotate`: `True` or `False`, controls whether or not _AIMSim_ will write annotations over the heatmap.
       - `similarity_plot_settings`: Controls the appearance of the distribution.
         - `plot_color`: Can be any color recognized by the _matplotlib_ library.
         - `shade`: `True` or `False`, whether or not to shade in the area under the curve.
           - `xticklabels`: String containing text to write for the x-axis labels.
           - `yticklabels`: String containing text to write for the y-axis labels.

##### Property Variation Visualization
Creates a plot of how a given property in the input molecule set varies according to the structural fingerprint chosen.
 - `log_file_path`: String specifying a file to write output to for the execution of this task. Useful for debugging or retrospection.
     - `property_plot_settings`: Control the appearance of the plot.
       - `plot_color`: Any color recognized by the _matplotlib_ library.
  
##### Identify Outliers
Trains an [IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) on the input data to check for potential outliers.
 - `random_state`: An integer to pass through to random_state in sklearn. _AIMSim_ sets this to 42 by default.
 - `output`: A string which specifies where the output of the outlier search should go. This can be either a filepath or "terminal" to write the output directly to the terminal.
 - `plot_outliers`: Set this to `True` to generate a 2D plot of which molecules are potential outliers.
 - `pair_similarity_plot_settings`: Only meaningful if plot_outliers is True, allows access to plot settings.
     - `plot_color`: Any color recognized by the _matplotlib_ library.
       
##### Cluster
Use a clustering algorithm to make groups from the database of molecules.
 - `n_clusters`: The number of clusters to group the molecules into.
     - `clustering_method`: Optional string specifying a clustering method implemented in `sklearn`, one of `kmedoids`, `ward`, or `complete_linkage`. `complete_linkage` will be chosen by default if no alternative is provided.
     - `log_file_path`: String specifying a file to write output to for the execution of this task. Useful for debugging.
     - `cluster_file_path`: String specifying a file path where _AIMSim_ will output the result of clustering. Useful for comparing multiple clustering approaches or saving the results of large data sets.
     - `cluster_plot_settings`: Control the appearance of the clustering plot.
       - `cluster_colors`: A list of strings, each of which is a color recognized by _matplotlib_ to use for the clusters. Must specify at least as many colors as there are clusters. Additional colors will be ignored.
       - `embedding`: Specify a dimensionality reduction method for the resulting clusters of data.
         - `method`: String specifying a clustering method implemented in `sklearn`, either `pca`, `tsne`, or `mds`.
         - `random_state`: Optional integer which will be passed through to the sklearn dimensionality reduction call. Makes subsequent runs reproducible by fixing the outcome of the dimensionality reduction.

## Writing Scripts with _AIMSim_
Advanced users may wish to use _AIMSim_ to create their own descriptors, use the descriptor's provided in _AIMSim_ for something else entirely, or utilize the various similarity scores. Brief explanations for how to access the core functionalities of _AIMSim_ from a Python script are shown below.

### Making Custom Descriptors
Any arbitrary numpy array can be provided as a molecular descriptor, though correct function with the similarity metrics provided with _AIMSim_ is not guaranteed.

In [2]:
from aimsim.ops.descriptor import Descriptor
desc = Descriptor()

With the `Descriptor` class instantiated, one can then call the methods to set the value(s) of the descriptor.

In [14]:
import numpy as np
custom_desc = np.array([1, 2, 3])
desc.set_manually(custom_desc)
desc.numpy_

array([1, 2, 3])

This same function can be achieved by passing in a numpy array for the keyword argument `value` in the constructor for `Descriptor`, as shown below:

In [15]:
desc = Descriptor(custom_desc)
desc.numpy_

array([1, 2, 3])

The above code is useful for individually changing a descriptor for one molecule in a `MoleculeSet` but is obviously not practical for bulk custom descriptors. To assign descriptors for an entire set of molecules at once, instantiate the `MoleculeSet` class and call the `_set_descriptor` method passing in a 2-dimensional numpy array of descriptors.

```
from AIMSim.chemical_datastructures.molecule_set import MoleculeSet
molset = MoleculeSet(
    '/path/to/databse/smiles.txt',
    'text',
    False,
    'tanimoto'
)
molset._set_descriptor([[1, 2, 3], [4, 5, 6]])
```

### Generating Descriptors with _AIMSim_
Because _AIMSim_ is able to generate such a wise variety of molecular fingerprints and descriptors from only the SMILES strings, you may want to avoid re-inventing the wheel and use the descriptors that it generates. There are two general approaches to doing this, and the approach used depends on what other code you already have in place:
 1. If you have only SMILES strings to turn into fingerprints/descriptors, the `Molecule` class should be used to handle generating the molecule object and generating the descriptors.
 2. If you have already created a molecule using `RDKit`, you must provide the existing molecule in your call to the constructor in `Molecule`.
These approaches are covered in this order below.

In [1]:
# with a SMILES string
smiles = "CO"
from aimsim.chemical_datastructures.molecule import Molecule
mol = Molecule(mol_smiles=smiles)
mol.set_descriptor(fingerprint_type="atom-pair_fingerprint")
mol.get_descriptor_val()

array([0, 0, 0, ..., 0, 0, 0], dtype=int8)

In [17]:
# with an RDKit molecule
from rdkit import Chem

mol_graph = Chem.MolFromSmiles(smiles)

mol = Molecule(mol_graph=mol_graph)
mol.set_descriptor(fingerprint_type="mordred:nAtom")
mol.get_descriptor_val()

array([6])

### Acessing _AIMSim_ Similarity Metrics
As of January 2022, _AIMSim_ implements 47 unique similarity metrics for use in comparing two numbers and/or two sets of numbers. These metrics were pulled from a variety of sources, including some original implementations, so it may be of interest to use this code in your own work.

All of the similarity metrics can be accessed through the `SimilarityMeasure` class, as shown below.

In [3]:
from aimsim.ops.similarity_measures import SimilarityMeasure
from rdkit.Chem import MolFromSmiles

sim_mes = SimilarityMeasure("driver-kroeber")
desc_1 = Descriptor()
desc_1.make_fingerprint(
    MolFromSmiles("COC"),
    "morgan_fingerprint",
)
desc_2 = Descriptor()
desc_2.make_fingerprint(
    MolFromSmiles("CCCC"),
    "morgan_fingerprint",
)
out = sim_mes(
    desc_1,
    desc_2,
)
out

0.22360679774997896

A complete list of supported similarity measures and the names by which _AIMSim_ recognizes them is listed in the GUI walkthrough section.