#  The Basic Tools of the Deep Life Sciences
Welcome to DeepChem's introductory tutorial for the deep life sciences. This series of notebooks is a step-by-step guide for you to get to know the new tools and techniques needed to do deep learning for the life sciences. We'll start from the basics, assuming that you're new to machine learning and the life sciences, and build up a repertoire of tools and techniques that you can use to do meaningful work in the life sciences.

**Scope:** This tutorial will encompass both the machine learning and data handling needed to build systems for the deep life sciences.

## Colab

This tutorial and the rest in the sequences are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/The_Basic_Tools_of_the_Deep_Life_Sciences.ipynb)


## Why do the DeepChem Tutorial?

**1) Career Advancement:** Applying AI in the life sciences is a booming
industry at present. There are a host of newly funded startups and initiatives
at large pharmaceutical and biotech companies centered around AI. Learning and
mastering DeepChem will bring you to the forefront of this field and will
prepare you to enter a career in this field.

**2) Humanitarian Considerations:** Disease is the oldest cause of human
suffering. From the dawn of human civilization, humans have suffered from pathogens,
cancers, and neurological conditions. One of the greatest achievements of
the last few centuries has been the development of effective treatments for
many diseases. By mastering the skills in this tutorial, you will be able to
stand on the shoulders of the giants of the past to help develop new
medicine.

**3) Lowering the Cost of Medicine:** The art of developing new medicine is
currently an elite skill that can only be practiced by a small core of expert
practitioners. By enabling the growth of open source tools for drug discovery,
you can help democratize these skills and open up drug discovery to more
competition. Increased competition can help drive down the cost of medicine.

## Getting Extra Credit
If you're excited about DeepChem and want to get more involved, there are some things that you can do right now:

* Star DeepChem on GitHub! - https://github.com/deepchem/deepchem
* Join the DeepChem forums and introduce yourself! - https://forum.deepchem.io
* Say hi on the DeepChem gitter - https://gitter.im/deepchem/Lobby
* Make a YouTube video teaching the contents of this notebook.


## Prerequisites

This tutorial sequence will assume some basic familiarity with the Python data science ecosystem. We will assume that you have familiarity with libraries such as Numpy, Pandas, and TensorFlow. We'll provide some brief refreshers on basics through the tutorial so don't worry if you're not an expert.

## Setup

The first step is to get DeepChem up and running. We recommend using Google Colab to work through this tutorial series. You'll also need to run the following commands to get DeepChem installed on your colab notebook. We are going to use a model based on tensorflow, because of that we've added [tensorflow] to the pip install command to ensure the necessary dependencies are also installed

In [4]:
%pip install --pre deepchem[tensorflow]

^C
Note: you may need to restart the kernel to use updated packages.


You can of course run this tutorial locally if you prefer. In this case, don't run the above cell since it will download and install Anaconda on your local machine. In either case, we can now import the `deepchem` package to play with.

In [5]:
import deepchem as dc
dc.__version__

'2.6.1'

# 使用DeepChem训练第一个模型

深度学习可以用来解决许多类型的问题，但基本工作流程通常是相同的。以下是你可以遵循的典型步骤。.

1. 选择你用来训练模型的数据集(训练集，如果没有合适的现有数据集，则建立一个新的数据集)。
2. 建立模型。
3. 使用训练集训练模型。
4. 使用独立于训练集的验证集评估模型。
5. 使用该模型对新数据进行预测。

有了DeepChem，每一个步骤都可以少到只有一两行Python代码。在本教程中，我们将通过一个基本示例演示解决现实科学问题的完整工作流。

我们要解决的问题是根据小分子的化学公式预测其溶解度。这是药物开发中一个非常重要的特性：如果一种拟用药物的溶解性不够，很可能进入患者血液的药物的量不够，从而产生不了治疗效果。我们需要的第一个东西是真实分子溶解度的数据集。DeepChem的核心组件之一，MoleculeNet，是多样化学分子数据集的集合。对于本教程，我们可以使用Delaney溶解度数据集。此数据集中的溶解度数据是以log（溶解度）表示，其中溶解度以摩尔/升为单位进行测量。

In [6]:
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

我现在不会对这段代码解释太多。我们将在后面的教程中看到许多类似的例子。有两个细节我想让你注意一下。首先，注意传递给' load_delaney() '函数的' featurizer '参数。分子可以用多种方式表示。因此，我们告诉它我们想要使用哪种方式表示，或者用更专业的语言来说，如何“特征器（featurizer）”数据。其次，注意我们实际上得到了三个不同的数据集:训练集、验证集和测试集（分别对应 train_dataset, valid_dataset, test_dataset ）。在标准的深度学习工作流程中，每一个数据集都有不同的功能。

现在我们有了数据集，下一步是建立一个模型。我们将使用一种特殊的模型，称为“图卷积网络（graph convolutional network）”，简称为“graphconv”。

In [7]:
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)

在这里我也不会过多地介绍上述代码。后面的教程将会提供更多关于“GraphConvModel”，以及DeepChem提供的其他类型的模型的信息。

我们现在需要使用训练集训练模型。我们只需给它一个数据集，然后告诉它要进行多少次训练(也就是说，要多少遍完整的使用训练集)。

In [8]:
model.fit(train_dataset, nb_epoch=100)



0.10950069427490235

如果一切进展顺利，我们现在应该有一个完全训练好的模型了!但是这个模型靠谱吗?为了找出答案，我们必须使用验证集评估模型。我们通过选择一个评估指标并在模型上调用“evaluate()”来做到这一点。对于这个例子，让我们使用皮尔逊相关（the Pearson correlation），也称为r<sup>2</sup>，作为我们的评估指标。我们可以在训练集和测试集上对它进行评估。

In [9]:
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print("Training set score:", model.evaluate(train_dataset, [metric], transformers))
print("Test set score:", model.evaluate(test_dataset, [metric], transformers))

Training set score: {'pearson_r2_score': 0.9236002953506689}
Test set score: {'pearson_r2_score': 0.6644477884769016}


注意，它在训练集上的得分高于测试集。对比于测试集，模型通常在训练集表现得更好。这被称为“过拟合”，这也是为什么使用独立的测试集评估模型是至关重要的原因。

我们的模型在测试集上仍然有相当不错的表现。作为比较，一个产生完全随机输出的模型的相关性为0，而一个做出完美预测的模型的相关性为1。我们的模型做得很好，所以现在我们可以用它来预测其他我们关心的分子。

因为这只是一个教程，我们没有任何其他的分子我们特别想要预测，让我们只使用测试集中的前十个分子进行预测。对于每一个分子，我们打印出其化学结构(使用SMILES字符串表示)，真实的log(溶解度)值，和预测的log(溶解度)值。

In [10]:
solubilities = model.predict_on_batch(test_dataset.X[:10])
for molecule, solubility, test_solubility in zip(test_dataset.ids, solubilities, test_dataset.y):
    print(solubility, test_solubility, molecule)

[-1.7492032] [-1.60114461] c1cc2ccc3cccc4ccc(c1)c2c34
[0.9326974] [0.20848251] Cc1cc(=O)[nH]c(=S)[nH]1
[-0.39125377] [-0.01602738] Oc1ccc(cc1)C2(OC(=O)c3ccccc23)c4ccc(O)cc4 
[-1.9638431] [-2.82191713] c1ccc2c(c1)cc3ccc4cccc5ccc2c3c45
[-1.6334522] [-0.52891635] C1=Cc2cccc3cccc1c23
[1.6670077] [1.10168349] CC1CO1
[-0.42416394] [-0.88987406] CCN2c1ccccc1N(C)C(=S)c3cccnc23 
[-1.2786667] [-0.52649706] CC12CCC3C(CCc4cc(O)ccc34)C2CCC1=O
[-1.0821681] [-0.76358725] Cn2cc(c1ccccc1)c(=O)c(c2)c3cccc(c3)C(F)(F)F
[0.64967585] [-0.64020358] ClC(Cl)(Cl)C(NC=O)N1C=CN(C=C1)C(NC=O)C(Cl)(Cl)Cl 


# Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

## Join the DeepChem Gitter
The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

## Citing This Tutorial
If you found this tutorial useful please consider citing it using the provided BibTeX. 

In [None]:
@manual{Intro1, 
 title={The Basic Tools of the Deep Life Sciences}, 
 organization={DeepChem},
 author={Ramsundar, Bharath}, 
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/The_Basic_Tools_of_the_Deep_Life_Sciences.ipynb}}, 
 year={2021}, 
} 