# Manahttan Plot Practical
In this practical we will look at how to visiualise GWAS data in the form of a Manhattan plot.  
We will also look at interpretting the results and what they mean in the big picture.  
First, we shall learn the process on a small test data and then move on to applying it to some real-world large data.

## Section 1: Manahttan plot creation
In this section we will learn how to deal with GWAS data and how to make a Manhattan plot.  
We will learn the process on the small test data provided.  

### Step 1: Import the data

#### Read in the data:

The input file is a tab-delimited text file (i.e. each column is separated by <-TAB->).  
So here we use the `read.csv()` function and include `sep = '\t'` to make R read it in correctly as tab-delimited.  
The `header = TRUE` bit keeps the column names as headers - we'll kinda need these.  


In [None]:
test = read.csv('data/test_gwas.txt', sep = '\t', header = TRUE)

### STEP 2: Explore the data

#### Lets have a look at the data:
Use `head()` to eyeball the first 6 rows of our data:

In [None]:
head(test)

Now use `names()` to see the names of each column:

In [None]:
names(test)

How big is our dataframe? `dim()` gives the dimensions in two numbers:  
*(The first is number of rows, second is columns)*

In [None]:
dim(test)

#### What is it we have?

We have a dataframe of SNPs (single nucleotide polymorphisms, pronounced "snip") from GWAS analysis.  
Each row is data representing a SNP  
So, 21,751 SNPs in total!  
  
The 4 columns are:  
"chrom", "bp", "pvalue", and "gene"

"chrom": chromosome the SNP is located on.  
"bp": base pair position at which the SNP occurs on the chromosome.  
"pvalue": significance of the GWAS association for the SNP.  
"gene": the gene the SNP is located in on the chromosome.  
  
We are specifically interested in the significance of the GWAS per SNP, so lets quickly `plot()` the *pvalues*:

In [None]:
plot(test$pvalue)

This makes no sense at all! Now, lets look at it properly...  
  
When dealing with GWAS data we take the -log10 of the *pvalue*.  \
This highlights the stronger distinctions in the association.  
The log transformed *pvalue* must exceed a given threshold before it is considered "true".  
There is a standardised value of -log10(5e-8) (~7.3) for the threshold.  
This means that significance is only considered with a raw *pvalue* <= 0.00000005!!!  

So let's `plot()` the -log10 pvalues and add in a dashed line at y =  -log10(5e-8) using `abline()`:

In [None]:
plot(-log10(test$pvalue))

abline(h = -log10(5e-8), lty = 2)

Oh, now we have something!  \
But what does this mean? How can we make this make sense? Where are the chromosomes? Which chromosome and which gene is important?  
So many questions (no really, I'm Sure... riveting isn't it)

### STEP 3: Make the data make sense
R is a very powerful analytic tool. It, like many others, has the ability to create functions.  
Functions are small "programs" that execute a set of commands.  
Here we will use a premade function that will perform the heavy lifting for this practical.

**!DO NOT PANIC!** This is as simple as point-and-click. Once clicked, forever in memory. We're good to go :)
  \
  \
  We are going to import the code for the function from the "function" directory that forms part of this practical:

In [None]:
source('functions/man.plot.R')

The function is now loaded into the R session and will persist until you close the notebook  
#### The function and how to use it:
The function, `man.plot()`, plots the data clearly.  
It importantly plots the figure but also:  
* gives the transformed *pvalue* i.e -log10()
* gives relative basepair position on the chromosome (*shows the closeness of the SNPs*)
* seperates, labels and colours the chromosomes (*to better dinstinguish them*)
* adds on the threshold line at -log10(5e-8)
* highlights significant SNPs  
* labels significant SNPs with the relevant gene they are located on
* generates an output dataframe of all significant genes  


Use `man.plot()` to plot the data:  
*NB: we need to assign an R object, here called "test_plot", to the output of the function*

In [None]:
test_plot = man.plot(test)

Now let's call back the R object we created to see all the significant genes:  

In [None]:
test_plot

What can we do with this information (*if it were not just test data*)?  
What would you expect there to be at this point?

### Tweaking the plot
There are some small aesthetic adjustments you can make to the plot. We can change the colour and size of points, zoom in on specific chromosomes. these small tweaks can make a plot feel like your own, plus it's a bit of practice in itself.  
  
Take a moment to look at the arguments we can feed in to the function:  

`man.plot(df, chroms, threshold, highlight, point.cex, point.cols, line.col)`

| Argument | Description |
| :----------- | :----------- |
| **df**     | a gwas dataframe | 
| **chroms** | a list of chromosomes to isolate
| **threshold** | value for the significance threshold for SNPs | 
| **highlight** | colour for highlighting significant SNPs | 
| **point.cex** | point size | 
| **point.cols** | a list of colours for alternating chromosomes | 
| **line.col** | colour for the significance threshold line |
  
  
In the cell below there is the code to look at chromosomes 5, 13 and Y in the test data. The threshold value is changed to 7.5, the chromosomes are coloured lightblue, pink and purple. SNPS above the threshold line (now orange) are highlighted blue and the plot points are much larger.  
  
Please, run the code and then tweak the plot to what you wish:

In [None]:
test_plot = man.plot(test,
                     chroms = c('5', '13', 'Y'),
                     threshold = 7.5,
                     highlight = 'blue',
                     point.cex = 1.5,
                     point.cols = c('lightblue', 'pink', 'purple'),
                     line.col = 'orange')

### Exporting the plot

Plots viewed in the cells of a notebook are not ideal. It can be difficult to read text or see points clearly, and you can't add them to a document. For these reasons it is better to save them out as an image file.  

In _R_ this can be done with `png()`, a basic function that allows you to customise the name and dimensions of your plot.  

Lets save the basic test plot as a png.  


In [None]:
png('plots/test.png',  # path to file
    units = 'px',  # units of the image (in this case pixels)
    height = 600, width = 2000,  # height and width of plot in pixels (see units)
    pointsize = 30)  # size of the lines, text and points

# here is the basic man.plot() code:
test_plot = man.plot(test)

dev.off()  # finishes the command to save the image

The image is now saved in the plot directory. Go there and look at the image that has been created.  
  
How would you rename the image?  
What if you wanted the image to be more square?  
Where would you add the code for your altered plot from above?  

Which values would you need to change to make the above happen?

## BIG DOG DATA


In [None]:
height = read.csv('data/height_gwas.txt', sep = '\t', header = TRUE)

In [None]:
height_plot = man.plot(height, threshold = 8.46)

In [None]:
png('plots/height.png', height = 600, width = 2000, units = 'px', pointsize = 30)
par(mar = c(3, 1.5, 0, 1))

height_plot = man.plot(height, threshold = 8.46)

dev.off()

In [None]:
height_plot