Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

There are two parts to be completed in this notebook:

1. **Project information**:  The title of the project, a project description, assumed student background, etc.
2. **Project introduction**: The three first text and code cells that will form the introduction of your project.

When you are happy with this document, include the file as an attachment in an email to me as well as the datasets used. If you have any questions, feel free to reach out to me at any time!

David Venturi<br>
david.venturi@datacamp.com

# 1. Project information

**Project title**: DrugBank: A Temporal Analysis.

**Name:** Your full name.

**Email address associated with your DataCamp account:** You can find this email [here](https://www.datacamp.com/profile/account_settings) if you have a DataCamp account.

**Project description**: Widely used in drug discovery, DrugBank is a publicly available resource that stores high-quality information on drugs and their targets. Since its incpetion in 2005, DrugBank has been frequently used by pharmaceutical companies, medicinal chemists, students and the general public. In this project, we will be performing a temporal analysis of DrugBank. That is, we will observe how the DrugBank database has evolved over time as this will implicitly give us a bird's eye view on the progress of the field of drug discovery as a whole as well as any trends that may have emerged. We will be exploring the dataset of this project via data visualization packages such as *ggplot* and *ggvis*.

**Dataset(s) used**: The dataset is a subset of the full information stored in the DrugBank database. The dataset consists of information on thousands of drugs including their commercial names, indications (i.e. diseases/symptoms they are intended to treat), targets, drug manufacturers and so on.

**Assumed student knowledge**: Familiariaty with the *tidyverse* set of packages would be helpful but not strictly necessary. Specifically, we will be using the *dplyr* and *ggplot* packages as well as the pipe operator, %>%. We also make use of the *ggvis* and *???* packages to produce some interactive visualizations.

# 2. Project introduction

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://authoring.datacamp.com/projects/projects-format.html). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. *Drugs* ... Past and Present

DrugBank is a publicly free database that is widely used in the field of drug discovery. DrugBank stores extensive information on drugs and their targets. Established in 2005, DrugBank has since been adding so-called "drug cards" which are essentially drug records that contain supplementary information on the drugs (e.g. chemical structure, commercial names, ADMET properties, etc.) as well as their known targets (usually proteins). These drug cards are regularly curated by experts who ensure that the data inserted into DrugBank is accurate and up-to-date. These data are useful for pharmaceutical scientists and biochemists who are involved drug development. DrugBank may be accessed at: https://www.drugbank.ca/

The dataset we are using here is a subset of the full DrugBank database that has the information we need for our analysis. Specifically, the analysis we'll be doing here is a temporal analysis of some the aspects of the DrugBank database. We'll observe how the DrugBank database has evolved over time in terms of the numbers of drugs, targets and indications stored.

For starters, let's have a look at the *drugs* of the DrugBank database and how their numbers have increased over time. Let's display the following statistics over time since the inception of DrugBank:
- The number of small-molecule drugs
- The number of biotech drugs
- The number of drugs
- The number of approved drugs


***Add logo of drugbank (or a royalty-free drug-related image) to the right hand side of this context cell. Make sure that the image you use has a [permissive license](https://support.google.com/websearch/answer/29508?hl=en) and display them using [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#images).***

In [2]:
# Code for the first task
# It should consist of up to 10 lines of code (not including comments)
# and take at most 5 seconds to execute on an average laptop.


# ************
# CODE
# 
# You will create a cumulative plot starting from "2005-06-13" all the way till 
# the current day. It will be a line graph containing four different plots for 
# the four different statistics (X-axis = 'created' column of the 'drugs' data 
# frame, Y-axis = the different stats mentioned above). The values of the 
# different statistics at the current date should match those displayed at the 
# following link:
# https://www.drugbank.ca/stats
# 
# Wanna do something different?
# Figures 1c/1d of the paper, [2007] Drug-Target Network, offer an alternative 
# way of displaying the same above information; a bar graph for each year with 
# the bars being filled by subcategories. In this case, you will split the above
# cumulative graph into two bar plots because while the drugs are either 
# small-molecule or biotech (i.e. mutually exclusive), the 'approved' categories
# intersects with both (i.e. there are approved and non-approved biotech drugs 
# as well as approved and non-approved small-molecule drugs):
# - plot 1 --> bar length: #drugs in this year, each bar containing two 
#   sub-categories (small-molecule, biotech)
# - plot 2 --> bar length: #drugs in this year, each bar containing two 
#   sub-categories (approved, non-approved)
# 
# Wanna do something fancy?
# Let it be an interactive plot! Here's what it may look like. Depending on the 
# position of the mouse cursor, display a vertical line there along with a tool 
# tip that shows this info:
# - The value on the x-axis (i.e. the creation date)
# - The values of all the different stats at this date
# - (optional) log-scale y-axis?
# Need inspiration?  -->  check out these links:
# - https://www.r-graph-gallery.com/129-use-a-loop-to-add-trace-with-plotly/
# - https://www.r-graph-gallery.com/interactive-charts/
# - https://moderndata.plot.ly/interactive-r-visualizations-with-d3-ggplot2-rstudio/
# - https://www.r-graph-gallery.com/get-the-best-from-ggplotly/
# - https://www.datacamp.com/courses/ggvis-data-visualization-r-tutorial
# - https://www.statmethods.net/advgraphs/interactive.html
# 
# Wanna do something sneaky? (keep as last resort, we may not need this)
# Let tasks 1 and 2 be the same visualization, but the first is static and the 
# second is interactive. In the second, you can add check boxes to toggle on/off 
# some of the stats if the viz is getting crowded. You may also consider adding 
# more stats than what I mentioned above (e.g. #targets, #different subcategories 
# of targets, enzymes/carriers/transporters, #manufacturers, etc.). Again, check 
# the stats on the DrugBank website to make sure that the values of these stats 
# at the current date is correct (or close enough).
# 
# Finally
# If you're feeling creative, feel free to add your own touch. One other thing 
# that is worth mentioning is that there is some information that I couldn't 
# find in the files on the Git repo (most obvious is the drugs' market release 
# dates). What we have in the files is only the creation dates of their respective 
# records in the DrugBank database. For example, a drug released in 1982 would have 
# a creation date of "2005-06-13" as this is the date that the DrugBank database 
# was established. This is why, with the current data in the repo, we couldn't 
# display the visualization in Fig 1 of the paper: [2007] Drug-Target Network
# ************

## 2. Now, It's The *Targets*' Turn

From the previous plot, we observed blah, blah, blah. This is interesting information because yada, yada, yada.

Now that we've had a look at some drug statistics, we'll now shift our attention to the drug *targets*. This task will involve us showing some target statistics and how these evolved over the lifetime of the DrugBank database. Let's go!

***Character Limit: 800, Paragraph Limit: 3***

In [3]:
# Code for the second task
# It should consist of up to 10 lines of code (not including comments)
# and take at most 5 seconds to execute on an average laptop.


# ************
# CODE
# 
# You will create a cumulative plot similar to the previous one, but with different 
# stats on the Y-axis. They are:
# - #targets
# - #membrane targets
# - #cytoplasm targets
# - #nucleus targets
# - #organelles targets
# - #exterior targets
# - #other targets?
# If one or two categories are particularly difficult/time-consuming to get, put 
# into "other targets". The information of which subcategories the targets belong 
# to will likely be in the Gene Ontology information of the targets. Unfortunately, 
# this information is probably external to DrugBank. If the above subcategories 
# are to tough to deal with, use the stats below instead:
# - #targets
# - #enzymes
# - #transporters
# - #carriers
# 
# Again...
# You may make this into an interactive plot if you wish (e.g. ggvis)
# 
# By the way...
# The following paper (Table 1) contains an evolution of the different stats of DrugBank over time:
# https://academic.oup.com/nar/article/46/D1/D1074/4602867
# ************

## 3. Next Up, Drug-Target *Interactions*

At this point, we have now observed *drugs* and *targets* over time. We now look at drug-target *interactions*.

***Character Limit: 800, Paragraph Limit: 3***

In [5]:
# Code for the third task
# It should consist of up to 10 lines of code (not including comments)
# and take at most 5 seconds to execute on an average laptop.


# ************
# CODE
# 
# You will create a network visualization similar to Fig 2 of the paper: 
# [2007] Drug-Target Network
# 
# I realize this may be too much to ask, but it would be cool if... (and you can 
# ignore this if it is too much):
# - the outputted visualization is modifiable by the user; i.e. there is a slider 
#   that the user can use to slide the time back and forth.
# - the user can drag components in the visualization as they please.
# 
# If it is more convenient for you, you can:
# - exclude drugs that are not approved (along with their targets)
# - keep only targets of a specific category (e.g. membrane proteins) and their 
# interacting drugs
# - exclude drugs for which "created < 01-01-2010" to make data size more manageable
# - ...
# 
# Again...
# You may make this into an interactive plot if you wish (or not).
# ************

*Stop here! Only the three first tasks. :)*

***Next tasks will involve diseases, manufacturers, a case study of a drug (2 tasks), a case study of a disease (2 tasks).***

***When you submit, tell them you may use extra data (that won't exceed their data size limits) in the full project (if it gets accepted, that is). Also mention that the project topic may change slightly: instead of "evolutionary analysis", it may be "exploratory analysis". If you think this will be a problem, maybe call the analysis "exploratory" in the first place?***