In [None]:
!./init.sh

# Assignments

The following assignments will test your knowledge in unsupervised learning. Note that this data is already nontrivial and some tasks can prove difficult. However, the point of the assignment is to practice and experiment with the techniques you learned. Therefore, do not get discouraged if you do not get the results you expect with one single method. Instead, try a few different approaches and see if you can spot any common patterns in the results.

## Tools

Tools and concepts that might have not been discussed previously, but could be useful for these assignments:

- [RDKit](https://www.rdkit.org/docs/index.html)
    - [Molecular Descriptors](https://stackoverflow.com/questions/64141686/calculate-descriptors-with-rdkit)
    - [Molecular Fingerprints](https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints)
- [Feature Scaling](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
- [molplotly](https://github.com/wjm41/molplotly)
- [Getting Clusters out of Hierarchical Clustering](https://stackoverflow.com/questions/44428512/get-a-list-of-clusters-formed-from-dendrogram-in-python)
- [pChEMBL Values](https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions#what-is-pchembl)

Feel free to also use resources online and scientific literature. Also pay attention to the various columns in the data sets. Some of them might be useful for your analysis and point you to the relevant literature. Making yourself familiar with the data itself can then help you better understand the results of your analysis.

# 1. Visualizing Chemical Space

In this assignment, you will use your previous knowledge about molecular descriptors and unsupervised learning to cluster molecules. The data set you will be working with is saved in the `molecules.csv` file. It provides only the structures of molecules as SMILES strings. These molecules can be clustered in multiple ways. Select the best set of descriptors and a clustering method that in your opinion captures the relationships between the molecules in the data set in the most meaningful way. Note that this assignment does not have a single correct answer. Therefore, do not be afraid to explore and present multiple distinct solutions.

In the discussion of your results, focus mainly on answering the following questions and include code and data to support your answers:

- How many distinct groups of compounds do you think there are in the data set and why?
- Was the dimensionality reduction method you chose able to group structurally similar compounds together?
- What do you think are the main advantages and disadvantages of the dimensionality reduction method you chose?
- Do you obtain similar results with multiple dimensionality reduction methods? If yes, which ones? If no, why do you think that is?

# 2. Clustering Active and Inactive Molecules

The `Q99685_papyrus.tsv` file contains bioactivity data for 700 compounds from the [Papyrus](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x) data set. The molecules were measured for inhibition of monoglyceride lipase (MGLL, [Q99685](https://www.uniprot.org/uniprotkb/Q99685/entry)). Use your knowledge of clustering methods to cluster the molecules in the data set.

Try to find answers to the following questions and support your answers with data and code:

- Can you find clusters associated with higher ratio of more active molecules? Discuss a few examples.
- What do you think is the most potent chemical series? List the measured bioactivities for all compounds in the found series.
- What do the most potent compounds have in common?
- If you use a dimensionality reduction method to plot the chemical space, do you obtain similar results as with the clustering method? Do you see compounds belonging to one cluster close to one another in the chemical space depiction as well?