A collection of resources useful for leveraging big data and AI for drug discovery. It mainly serves as an orientation for new lab folks.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Introduction to Bioinformatics/Cheminformatics

(***) An Introduction to Statistical Learning by Robert Tibshirani.
If you have a little statistical background, read this book first.

(***) Machine Learning course at Coursera by Andrew Ng.
A required course for any people who are interested in machine learning.

(***) R & Bioconductor Manual.
Ensure you have run the code before taking any bioinformatics project.

(***) HT Sequence Analysis with R and Bioconductor.
Ensure you have run the code before taking any NGS project.

(***) ChemmineR: Cheminformatics Toolkit for R.
Suggest to run the code before taking any cheminformatics project.

(**) Step by Step to practice deep learning.
PyTorch tutorial for deep learning.

(***) HarvardX Biomedical Data Science Open Online Training.
comprehensive tutorial on data science with code from rafalab.

Fundamental papers

These papers I read at least 10 times, including supplementary materials! All of them are three stars!

Field review

(***) Hallmarks of Cancer: The Next Generation, by Robert A. Weinberg.
Fundamental to understand cancer.

(***) Tumor Metastasis: Molecular Insights and Evolving Paradigms, by Robert A. Weinberg.
Fundamental to understand cancer metastasis.

(***) Cancer genome landscapes, by Bert Vogelstein.
Fundamental to understand cancer genomics.

(***) Cancer transcriptome profiling at the juncture of clinical translation, by Arul M. Chinnaiyan.
review on cancer transcriptomics.

(***) Ewing sarcoma: historical perspectives, current state-of-the-art, and opportunities for targeted therapy in the future..
A typical review on the therapeutic discovery of one cancer.

(***) Opportunities and challenges in phenotypic drug discovery: an industry perspective.
Our drug discovery approach is one type of phenotypic screening.

(***) Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges, by Purvesh Khatri.
A very nice summary of method development in pathway analysis.

(***) Deep learning by Yann LeCun, Yoshua Bengio & Geoffrey Hinton.
Deep learning review.

Statistical method development

(***) Significance analysis of microarrays applied to the ionizing radiation response, by Robert Tibshirani.
Development of SAM, a popular method to perform differential expression analysis using microarray data.

(***) limma: Linear Models for Microarray Data.
Development of LIMMA, another popular method to perform differential expression analysis using microarray data.

(***) Differential expression analysis for sequence count data, by Simon Anders.
Development of DEseq, a popular method to perform differential expression analysis using RNA-SEQ data.

(***) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, by Jill P. Mesirov.
Development of GSEA, the most popular gene set enrichment analysis method and the fundamental to understand our drug discovery method.

(***) Adjusting batch effects in microarray expression data using Empirical Bayes methods.
Development of Combat, a method to correct batch effects.

(***) Emergence of Scaling in Random Networks by Albert-László Barabási.
Discovery of scale-free networks.

(***) Pathsim: Meta path-based top-k similarity search in heterogeneous information networks by Jiawei Han.
A typical machine learning approach to mining heterogeneous networks.

(***) MuSiC: Identifying mutational significance in cancer genomes, by Li Ding.
Development of MuSic, a popular method to identify mutations.

Informatics method development and application

(***) The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. by Justin Lamb.
The first paper to describe our drug discovery approach.

(***) Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data from Atul's lab.
The basic of our drug discovery work, and a great demonstration of writing a computational paper (from method development to experimental validation).

(***) Relating protein pharmacology by ligand chemistry by Michael J Keiser and Brian K Shoichet.
The development of SEA, a method to predict drug-target interactions, and another great demonstration of writing a computational paper.

(***) Characterization of drug-induced transcriptional modules: towards drug repositioning and functional understanding by Peer Bork.
start with data analysis and end with a few biological experiments.

(***) Cross-Species Regulatory Network Analysis Identifies a Synergistic Interaction between FOXM1 and CENPF that Drives Prostate Cancer Malignancy by Andrea Califano.
start with data analysis and end with a few biological experiments.

(***) Elucidating compound mechanism of action by network perturbation analysis by Andrea Califano.
start with data analysis and end with a few biological experiments.

(***) Discovery of drug mode of action and drug repositioning from transcriptional responses.
start with data analysis and end with a few biological experiments.

(***) Imagenet classification with deep convolutional neural networks.
Development of convolutional neural networks (CNN), the popular deep learning method.

Computational analysis

(***) Drug-target network by Barabási.
Network analysis of drug-target interactions.

(***) Comprehensive molecular portraits of human breast tumours from TCGA.
A typical genomic analysis paper from TCGA.

(***) Mutational landscape and significance across 12 major cancer types., by Li Ding.
A phenomenal paper on pan-cancer genomic analysis.

(***) Comprehensive Characterization of Molecular Differences in Cancer between Male and Female Patients, by Han Liang.
A phenomenal paper on pan-cancer genomic analysis.

(***) Genetics of rheumatoid arthritis contributes to biology and drug discovery. by Robert M. Plenge.
great work using genetics for drug discovery.

(***) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.
phenomenal work using cell line data to discover biomarkers.

(***) A comprehensive time-course–based multicohort analysis of sepsis and sterile inflammation reveals a robust diagnostic gene set by Purvesh Khatri.
Phenomenal work using public microarray data to discover biomarkers.

(***) Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases by Jeremy Jenkins.
A typical machine learning paper in cheminformatics.

(***) Do structurally similar molecules have similar biological activity.
A typical data analysis paper in cheminformatics.

Shape our future

(***) Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma.
Application of single cell in a cancer study.

(***) Single-cell transcriptomics uncovers distinct molecular signatures of stem cells in chronic myeloid leukemia.
Application of single cell analysis toward personalized cancer therapy.

(***) Brown Adipogenic Reprogramming Induced by a Small Molecule by Sheng Ding.
Using small molecules to control cell development.

(***) Correlating chemical sensitivity and basal gene expression reveals mechanism of action. from Stuart Schreiber.
Usage of pharmacogenomics data to understand drug mechanisms.

(***) A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles from Todd R. Golub.
LINCS, the dataset we primarily used for drug discovery.

(***) Integrative clinical genomics of metastatic cancer.
We have lots of experience working on primary cancer, now it's time to place our interest to metastatic cancer, which the majority of patients die from.

(***) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.
Using deep learning GAN to realize domain knowledge translation.

Outstanding tools and datasets for translational drug discovery

use liver cancer as an example, can be applied to other cancers, only list outstanding tools/datasets for liver cancer drug discovery


(***) ClinicalTrials.gov.
Search drugs used in liver cancer clinical trials.

(**) Cancer Today (Globocan): Data visualization tools that present current national estimates of cancer incidence, mortality, and prevalence.
Search liver cancer incidence.

(*) UK Biobank. UK Biobank Engine.
Search public clinical/molecular data for liver cancer and search genetic variants for liver cancer via Stanford Biobank Engine.

(**) COSMIC.
Search somatic mutation for liver cancer.

Target Discovery

(***) cBioPortal.
Search molecular alterations for liver cancer from public datasets including TCGA.

(***) GTEx.
Search gene expression in normal liver tissues.

(***) The Human Protein Atlas.
Search protein expression and pathology for liver cancer.

(***) Cancer Cell Line Encyclopedia.
Search gene expression in liver cancer cell lines.

(*) Project Achilles.
Search essential genes in liver cancer cells.

(***) GEO.
Search functional genomics data for liver cancer, requiring additional computational analysis to create a liver cancer signature.

(***) Enrichr.
Search enriched TS/pathways/biological processes/cell types given a list of genes.

(***) STRING DB.
Visualize protein-protein interactions.

Drug Discovery

(***) PubChem.
Everything needed to know about a compound/drug.

(**) DrugBank.
Search drug-target-indication.

(**) SEA.
Predict targets of a given compound.

(***) LINCS.
Predict drugs given a liver cancer signature.

(**) ChemMine.
very useful for chemical structure enrichment analysis.

NGS analysis

(***) RNASEQ blog.
A great collection of RNA-SEQ analysis methods/applications.

(*) RPKM, FPKM and TPM, clearly explained

(*) RNA-seq workflow: gene-level exploratory analysis and differential expression

Python packages

(***) anaconda Suggest using anaconda to manage python packages.

(***) scikit: a popular python machine learning packages.

(**) rdkit.
free python library to process chemical structures.

(***) PyTorch.
Deep learning framework.

R/Bioconductor packages

(**) ggplot cheatsheet. A must read to visualize data using R.

(**) ChemmineR: Cheminformatics Toolkit for R

(**) biomaRt.
A great package to map IDs.

(***) GEOquery.
Search and download data from GEO.

(**) cgdsr. API to access cBioportal data.

(**) pheatmap.
Visualize heatmap.

(*) Easy Way to Mix Multiple Graphs on The Same Page.