<hr>

# <font color='DarkBlue'>DETECTING POSITIVE SELECTION PRACTICAL NOTEBOOK</font>

<hr>
    
Dr Graham S. Sellers *g.sellers@hull.ac.uk*

![manhattan](images/manhattan_skyline2.jpg)

## <font color='DarkBlue'>Overview</font>

This practical follows up on Domino's *Detecting positive selection* lecture.  
Ideally you will have attended/viewed Domino's lecture and so understand the background of this practical.  





## <font color='DarkBlue'>Genome-wide association studies (GWAS)</font>
   
   
GWAS can be used to detect positive selection, discovering genes that perform a significant part in a particular trait. The sequenced genomes of two groups differing in a trait of interest are "scanned" for single nucleotide polymorphisms (SNPs) and compared to detect any that associate with the trait. The GWAS outputs each SNP shared between all genomes and its probability of error (*p-value*) for association with the trait of interest.

In this practical we will look at how to visualise GWAS data in the form of Manhattan plots.  
We will interpret the results and look at the function of genes under selection.  

First, we shall learn the process on a small test dataset and then move on to applying it to some real-world large data.  

This practical will give you:  
* Experience in visualising big GWAS data
* Knowledge of how to interpret a Manahattan plot
* Understanding of how genes under positive selection can be detected




# <font color='DarkBlue'>Introduction to Jupyter Notebooks</font>

<hr>

For this practical we are using **Jupyter Notebook**. This document is a Jupyter Notebook. It is a web browser based text editor that is also able to execute code. This allows us to provide both the instructions and the command line exercises together for this practical.

This is an *R* based session in a Jupyter notebook. You are probably familiar running *R* in something like *RStudio*. Jupyter Notebooks operate slightly differently but otherwise the *R* code is the same.

The actual terminal where the code is written and run are the grey cells (boxes) prefixed by `[ ]:`

Here is one:


You can run code in a cell by clicking in it, typing the code, and then Shift-Enter to run it. You can also click the triangular `Run` button in the menu bar.

Try it below, the 'hello world' `print` command has been entered for you.

In [None]:
print('hello world')

As you can see the output for the code run is presented directly below the cell. In the case above we have "hello world".

The same goes for any plots you make. As a demononstration, run the cell below:

In [None]:
plot(1, 1)

### <font color='DarkBlue'>Well Done!</font>

You now have the basic skills required to use a Jupyter Notebook. Use these skills to progress through the rest of the practical.

**Any questions, just ask a demonstrator.**

# <font color='DarkBlue'>Section 1: Manhattan plot creation</font>

<hr>

In this section we will learn how to deal with GWAS data and how to make a Manhattan plot in *R*.

We will get to grips with the process on the small test dataset provided.  

### STEP 1: Import the data

<hr>

#### Read in the data:

The input file is a tab-delimited text file (i.e. each column is separated by <-TAB->).  
So here we use the `read.csv()` function and include `sep = '\t'` to make *R* read it in correctly as tab-delimited.  
The `header = TRUE` bit keeps the column names as headers - we'll kinda need these.  


In [None]:
# import the data:

test = read.csv('data/test_gwas.txt', sep = '\t', header = TRUE)

### STEP 2: Explore the data

<hr>

#### Lets have a look at the data:
Use `head()` to eyeball the first 6 rows of our data:

In [None]:
head(test)

Now use `names()` to see the names of each column:

In [None]:
names(test)

How big is our dataframe? `dim()` gives the dimensions in two numbers:  
*(The first is number of rows, second is columns)*

In [None]:
dim(test)

#### What is it we have?

We have a dataframe of SNPs from a GWAS analysis.  
Each row is data representing a SNP.  
So, 21,751 SNPs in total (*see `dim()` above*).  
  
The 4 columns are:  
"chrom", "bp", "pvalue", and "gene"

"chrom": the chromosome the SNP is located on.  
"bp": the position (base pair) at which the SNP occurs on the chromosome.  
"pvalue": the probability of the GWAS association for the SNP (*p-value*).  
"gene": the gene the SNP is located in on the chromosome.  
  
We are specifically interested in the probability of the GWAS per SNP, so lets quickly `plot()` the *p-values*:

In [None]:
plot(test$pvalue)

**This makes no sense at all! So, lets look at it properly...**  
  
When dealing with GWAS data, there are often many orders of magnitude difference in probability. To account for this, and to normalise the data, we negative log 10 transform the *p-values*, i.e. -log10(*p-value*).

A fixed genome wide *p-value* of 5 × 10−8 is widely used to identify SNP association in GWAS. This translates into -log10(5e-8) (~7.3) for the threshold used in context of a Manhattan plot. Log transformed *p-values* must exceed this threshold before they are considered "significant". This means that a SNP is only considered if it has a *p-value* ≤ 0.00000005.  

So let's `plot()` the -log10 *p-values* and add in a dashed line at y = -log10(5e-8) using `abline()`:

In [None]:
plot(-log10(test$pvalue))

abline(h = -log10(5e-8), lty = 2)

**OK, now we have something!**  

There are clearly some points above the threshold.  
But what does this mean?  
Where are the chromosomes and which genes are important?  

So many questions (no really, I'm Sure... riveting isn't it)

### STEP 3: Import the function
<hr>

*R* is a very powerful analytic tool. It, like many others, has the ability to create functions.  
Functions are small "programs" that execute a set of commands.  
Here we will use a premade function that will perform the heavy lifting for this practical.

**!DO NOT PANIC!** This is as simple as point-and-click. Once clicked, forever in memory. We're good to go :)
  \
  \
  We are going to import the *R* code for the function from the "function" directory that forms part of this practical:

In [None]:
# import the man.plot() function:

source('functions/man.plot.R')

The function is now loaded into the *R* session and will persist until you close the notebook
#### The function and how to use it:
`man.plot()` firstly plots the figure in a clear manner:  
The chromosomes are separated and labelled, the threshold is indicated and SNPS above it are highlighted.  
The gene with the highest significant SNP on each chromosome is labelled.

Additionally, it generates an output of all significant genes detected.  


Use `man.plot()` to plot the data:  
*Note: we need to assign an R object, here called "test_plot", to the output of the function*

In [None]:
# plot the data:

test_plot = man.plot(test)

This is not very easy to view in this manner. So lets make it better.

### STEP 4: Save the plot as an image

<hr>

Plots viewed in the cells of a notebook are not ideal. It can be difficult to read text or see points clearly, and you can't (*easily*) add them to a document. For these reasons it is better to save them out as an image file.  

In _R_ this can be done with `png()`, a basic function that allows you to customise the name and dimensions of your plot. Inside the `png()` function you need to run the code for the plot, in our case the `man.plot()` code from above.  

Lets save our plot as a png:  


In [None]:
# save the plot as an image:

png('plots/test.png',  # path to file
    units = 'px',
    height = 600, width = 2000,
    pointsize = 30)

# here is the basic man.plot() code from above:
test_plot = man.plot(test)

dev.off()  # finishes the command to save the image

The image is now saved in the "plots" directory. Go there and look at the image that has been created. Much easier to view it as an image like this right?  
  
**Question:**  
How would you rename the image?  


### STEP 5: View the significant genes

<hr>

Looking at the `man.plot()` function's description above there should be some useful information in the *R* object we created with the plot.

Now let's call back the *R* object we created and have a look at what we have: 

In [None]:
# view the man.plot() output:

test_plot

The gene column contains the names of all the significant genes detected. If this were real data there would be some actual gene names that we could search for on Google. We shall do precisely that.

### Task: Google the genes

<hr>

Instead of having no real genes to google, lets google for **IGF1**, an important gene across the animal kingdom, and discover it's function.

Google search "IGF1 gene".  
Wikipedia is a good starting point (*yes, I actually said that*)  

What have we discovered about the gene's function?  
Do you think it matters that there may not be a record of the gene's function for your species?

Discuss with a demonstrator.


## <font color='DarkBlue'>Outcomes of Section 1</font>

<hr>

**So far we have learned:**
* how to import GWAS data
* simple ways to explore the data
* create and interpret a Manhattan plot
* how to look at genes under selection and discover their functions
* export the plot as an image

### *For the next section of the practical you will modify existing code to generate the relevant outputs* ###

You have already used the relevant code above and will just need to modify it accordingly. This next section therefore needs little explanation, but does require you to read and pay attention to the code cells you run. 

<hr>

# <font color='DarkBlue'>Section 2: Manhattan plots from real-world BIG data</font>

<hr>

![dogs](images/dogs_header.jpg)

## <font color='DarkBlue'>BIG DOG DATA</font>

<hr>

Dogs are labeled as "man's best friend". Humans and dogs have a relationship stretching back at least 15,000 years. Around two centuries ago there was an explosion of dog breeds. Man's best friend suddenly underwent some strong, artificial selection for particular traits. It is the signal of these traits we will look for and discover the genes under selection for said trait.  

The data we are going to use is from Plassais et al. (2019). Lead author, Jocelyn Plassais, kindly shared it with us for the purposes of this practical. We will be reproducing some of the Manhattan plots they produced in their paper. Please have a read of it, it's a good one.

In this section you will use the skills learned from **Section 1** and apply them to the large GWAS data from two particular dog breed traits:

#### <font color='DarkBlue'>1. Face furnishings</font>
#### <font color='DarkBlue'>2. Breed height</font>

*Please note*: this is genuinely big data and the code cells you run on it may take a little time.  
Be patient, ask a demonstrator if you have any concerns.

### IMPORTANT! A note on the threshold

As we have learned above, many studies use a threshold set at -log10(5e-8).

However, in the data we are using for this section of the practical, the *p-values* were Bonferroni corrected. This is a common practice in GWAS and is done to reduce false positives. Once corrected it gave a slightly higher, more stringent threshold value. We will use this corrected threshold value for all the dog data.

**dog data threshold = 8.46**

To use this we simply add the  `man.plot()` function's `threshold` argument like this:

`my_plot = man.plot(my_data, threshold = 8.46)`


## <font color='DarkBlue'>Study 1: Face furnishing</font>

<hr>

**We start with dog beards!**  

The data file for face furnishing is located in the "**data**" directory and is called "**furnish_gwas.txt**".  
*cheeky hint: replace test... with the new file name*

Using your skills, modify the code in the following cell to:
1. import the correct data file
2. give it a meaningful name


In [None]:
# import the data:

test = read.csv('data/test_gwas.txt', sep = '\t', header = TRUE)

Use the empty cell below to explore the data:

In [None]:
# explore the data:






### Big data warning!

Here, you would usually make the plot in the cell to quickly view it. However, due to the size of the data we are plotting, this is not really feasible. It takes a very, very long time. Plus, it isn't really clear to view in the cell.

It takes much less time (but still some time) to simply save the plot as an image. You will have to do just that. Then view it after (it should be in the "plots" directory called the name you gave it).

Now save the plot as an image:
1. give the image a meaningful name (make sure it is in the plots directory)
2. add the `man.plot()` code
3. give this a meaningful name
4. make it take in the newly imported data (from above)
5. add in the `threshold` argument and give it the correct value

In [None]:
# save the plot as an image:

png('plots/test.png',  # modify name
    height = 600, width = 2000,
    units = 'px',
    pointsize = 30)
# man.plot() code - modify to correct data, don't forget to add the threshold (see above)
test_plot = man.plot(test)

dev.off()

Then, look at the *R* object generated by `man.plot()` and see what genes there are:  
(*refer to the plot image you have just saved to see which stand out*)

In [None]:
# view the man.plot() output:

test_plot

Finally, Google some of genes and determine their functions.  
Consider their functions and how this relates to the face furnishing trait.

Discuss your findings with a demonstrator.

## <font color='DarkBlue'>Study 2: Breed height</font>

<hr>

**Now do it all again with dog breed height**

The data file for breed height is located in the "**data**" directory and is called "**height_gwas.txt**".

In [None]:
# import data:

test = read.csv('data/test_gwas.txt', sep = '\t', header = TRUE)

In [None]:
# explore data here:






In [None]:
# save the plot as an image:

png('plots/test.png',  # modify name
    height = 600, width = 2000,
    units = 'px',
    pointsize = 30)

# man.plot() code - modify to correct data, don't forget to add the threshold (see above)
test_plot = man.plot(test)

dev.off()

In [None]:
# view the man.plot() output:

test_plot

As before, Google some of genes, consider their functions and how this relates to the breed height trait.

Discuss your findings with a demonstrator.

## <font color='DarkBlue'>Outcomes of Section 2</font>

<hr>

**You will now have:**
* interpretted two seperate large GWAS datasets
* discovered the functions of significant genes
* considered how these genes relate to the traits of interest




## <font color='DarkBlue'>Congratulations! You have completed Practical 1: Detecting positive selection.</font>

<hr>

**You should now have the skills to revisit this code and use it on a new dataset, see the assignment outline below.**

<hr>

# <font color='DarkBlue'>Assignment</font>

<hr>

You will need to return to this notebook for part of the MCQ assignment for this section of the module. You will be required to run the relevant code on a new dataset, generate a plot and interpret the outcome.

#### <font color='DarkBlue'>You will need to do this completely independently.</font>  
No help will be given so having understood the practical will greatly improve your chances of getting a good grade.

#### <font color='DarkBlue'>This is worth 20% of the module.</font>

When the MCQ assignment is available, log on to Jupyter Lab as per the instructions given. Open this Jupyter notebook, run the all the relevant code in the cells below and then answer the questions.





# <font color='DarkBlue'>MCQ section</font>

All the code you need to generate the outputs for the MCQ you have already covered in this notebook. As it was in the practical, all that needs to be done is to modify the code to do the right thing for each step.

This is dog data so remember the **threshold** needs to be added to `man.plot()`.

The data file for an unknown trait is located in the "**data**" directory and is called "**unknown_trait_gwas.txt**". 

In [None]:
# import the data:



In [None]:
# explore the data:



In [None]:
# import the man.plot() function:



In [None]:
# save the plot as an image:



In [None]:
# view the man.plot() output:



### After completing the steps above, keep this notebook open as you do the MCQ. You will need it. ###

### <font color='DarkBlue'>References</font>

Plassais J, Kim J, Davis BW, Karyadi DM, Hogan AN, Harris AC, Decker B, Parker HG, Ostrander EA (2019) Whole genome sequencing of canids reveals genomic regions under selection and variants influencing morphology. *Nature communications* 10:1489.