<a href="https://colab.research.google.com/github/Shayros/Developing-hypotheses/blob/master/Assessing_possible_implications_of_your_hypothesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessing possible implications of your hypothesis
___ 

**Looking at correlations with disease dataset**

*OncoLnc*

We will be using a cancer dataset. For this example we will be using [OncoLnc](https://www.oncolnc.org), it is a site that allows you to explore survival correlations and download clinical dara from TCGA. 

![OncoLnc](figures/OncoLnc.png)

First select your mRNA, microRNA or LncRNA of interest. For the pupose of this example we will continue with Timp2. Once you select Timp2 it would give you a list of cancer types that have data available for Timp2. In task 1 we look at Timp2 during lung development and in task 2 we assess relationship with lung adenocarcinoma. Therefore, we are going to be selecting lung adenocarcinoma (LUAD).

![LUAD](figures/Data%20of%20interest.png)

Once you select the type of cancer, the page will ask you how do you want to divide the data. For this example I selected top 25% and low 25%. 

![Percentage](figures/Timp2_25_percentile.png)

At the bottom of the resulting graph you will have the ability to download the data.

**Cleaning data**

The data will be downloaded as a csv file, similar to the picture on the bottom (*note: that I change the Dead values to 0 and the alive values to 1).

![TCGA csv](figures/OncoLnc%20TCGA.png)


In [0]:
# To upload the data to python
import pandas as pd
import os
filename= os.path.abspath(os.path.join('Desktop', 'LUAD_7077_25_25_1.csv'))
fin= open(filename)
readCSV= pd.read_csv(fin)
readCSV.head()

# Getting read of columns
readCSV.drop(["Patient", "Expression"], axis=1, inplace=True)
readCSV.head()

print("Number of Observations:", readCSV.shape[0])

**Building graphs**

Now that you clean the data. Let's check whether Timp2 have an implication in cancer survival by performing a survival plot.

In [0]:
from lifelines import KaplanMeierFitter
kmf=KaplanMeierFitter()

C= readCSV['Status']
T= readCSV['Days']
kmf.fit(T,C)
groups=readCSV['Group']
ix = (groups == 'Low', "High")

for r in readCSV ['Group'].unique():
    ix=readCSV ['Group'] ==r

kmf.fit(T[~ix], C[~ix], label='Low Timp2 Expression')
ax = kmf.plot()

kmf.fit(T[ix], C[ix], label='High Timp2 Expression')
kmf.plot(ax=ax)

![Timp2_Survival_plot](figures/Timp2%20survival%20plot.png)

# Summary

**These three task have teach you:**

1.   How to look up data of molecules that you are interested
2.   How to look the relationship of those molecules with others or with disease/function
3. How your molecule is related to cancer survival(you can look at dataset of the condition that you are interest to also do survival analysis)

**With these three task you are answering:**

What are you interested?
What is the relationship of that with previous studies? or What previous studies say about your molecule?
Why it is important to study the implication of that molecule?


# References

**Cleaning data**: https://github.com/johnrandazzo/surv_nflrb

**Graphs**: https://lifelines.readthedocs.io/en/latest/Quickstart.html#kaplan-meier-and-nelson-aalen
