<a href="https://colab.research.google.com/github/Shayros/Developing-hypotheses/blob/master/Assessing_possible_implications_of_your_hypothesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessing possible implications of your hypothesis
___ 

**Looking at correlations with disease dataset**

*OncoLnc*

We will be using a cancer dataset. For this example we will be using [OncoLnc](https://www.oncolnc.org), it is a site that allows you to explore survival correlations and download clinical dara from TCGA. 

![OncoLnc](https://github.com/Shayros/Developing-hypotheses/blob/master/figures/OncoLnc.png?raw=1)

First select your mRNA, microRNA or LncRNA of interest. For the pupose of this example we will continue with Timp2. Once you select Timp2 it would give you a list of cancer types that have data available for Timp2. In task 1 we look at Timp2 during lung development and in task 2 we assess relationship with lung adenocarcinoma. Therefore, we are going to be selecting lung adenocarcinoma (LUAD).

![LUAD](https://github.com/Shayros/Developing-hypotheses/blob/master/figures/Data%20of%20interest.png?raw=1)

Once you select the type of cancer, the page will ask you how do you want to divide the data. For this example I selected top 25% and low 25%. 

![Percentage](https://github.com/Shayros/Developing-hypotheses/blob/master/figures/Timp2_25_percentile.png?raw=1)

At the bottom of the resulting graph you will have the ability to download the data.

**Cleaning data**

The data will be downloaded as a csv file, similar to the picture on the bottom (*note: that I change the Dead values to 0 and the alive values to 1).

![TCGA csv](https://github.com/Shayros/Developing-hypotheses/blob/master/figures/OncoLnc%20TCGA.png?raw=1)


In [0]:
# To upload the data to python
import pandas as pd
import os
filename= os.path.abspath(os.path.join('Desktop', 'LUAD_7077_25_25_1.csv'))
fin= open(filename)
readCSV= pd.read_csv(fin)
readCSV.head()

# Getting read of columns
readCSV.drop(["Patient", "Expression"], axis=1, inplace=True)
readCSV.head()

print("Number of Observations:", readCSV.shape[0])

**Building graphs**

Now that you clean the data. Let's check whether Timp2 have an implication in cancer survival by performing a survival plot.

In [0]:
from lifelines import KaplanMeierFitter
kmf=KaplanMeierFitter()

C= readCSV['Status']
T= readCSV['Days']
kmf.fit(T,C)
groups=readCSV['Group']
ix = (groups == 'Low', "High")

for r in readCSV ['Group'].unique():
    ix=readCSV ['Group'] ==r

kmf.fit(T[~ix], C[~ix], label='Low Timp2')
ax = kmf.plot()

kmf.fit(T[ix], C[ix], label='High Timp2')

plt.title('Survival of lung adenocarcinoma depending on Timp2 expression')
kmf.plot(ax=ax)
plt.show()

![Timp2_Survival_plot](https://github.com/Shayros/Developing-hypotheses/blob/master/figures/Timp2%20survival%20curve.png?raw=1)

In [0]:
# Calculating Cox regression
import pandas as pd
import os

filename = os.path.abspath(os.path.join('Desktop', 'LUAD_7077_25_25_1.csv'))

fin = open(filename)
readCSV = pd.read_csv(fin)
readCSV.head()

# Getting rid of columns
readCSV.drop(["Patient"],axis=1,inplace=True)
readCSV.head()

groups = readCSV['Group']
ix = (groups == 'Low', "High")

for r in readCSV['Group'].unique():
    ix = readCSV['Group'] == r

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(readCSV[['Expression','Days','Status']], duration_col="Days", event_col="Status")

cph.print_summary()

In [0]:
#Output:
<lifelines.CoxPHFitter: fitted with 246 observations, 96 censored>
      duration col = Days
         event col = Status
number of subjects = 246
  number of events = 150
    log-likelihood = -661.129
  time fit was run = 2018-12-16 19:45:53 UTC
---
             coef  exp(coef)  se(coef)      z      p  lower 0.95  upper 0.95   
Expression 0.0000     1.0000    0.0000 1.3608 0.1736     -0.0000      0.0000   
---
Signif. codes: 0 '***' 0.0001 '**' 0.001 '*' 0.01 '.' 0.05 ' ' 1
Concordance = 0.529
Likelihood ratio test = 1.751 on 1 df, p=0.18574

In the second task we performed a venn diagram that showed that ITGB1 is implicated in both lung development and disease. ITGB1 is part of the "integrin family members which are membrane receptors involved in cell adhesion and recognition in a variety of processes including embryogenesis, hemostasis, tissue repair, immune response and metastatic diffusion of tumor cells."-(https://www.genecards.org/cgi-bin/carddisp.pl?gene=ITGB1)

In [0]:
import pandas as pd
import os
filename= os.path.abspath(os.path.join('Desktop', 'LUAD_3688_25_25_1.csv'))

fin= open(filename)
readCSV= pd.read_csv(fin)
readCSV.head()

readCSV.drop(["Patient", "Expression"], axis=1, inplace=True)
readCSV.head()

print("Number of Observations:", readCSV.shape[0])

import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
kmf=KaplanMeierFitter()

C= readCSV['Status']
T= readCSV['Days']
kmf.fit(T,C)
groups=readCSV['Group']
ix = (groups == 'Low', "High")

for r in readCSV ['Group'].unique():
    ix=readCSV ['Group'] ==r

kmf.fit(T[~ix], C[~ix], label='Low ITGB1')
ax = kmf.plot()

kmf.fit(T[ix], C[ix], label='High ITGB1')

plt.title('Survival of lung adenocarcinoma depending on ITGB1 expression')
kmf.plot(ax=ax)
plt.show()

![ITGB1](https://github.com/Shayros/Developing-hypotheses/blob/master/figures/Figure_2.png?raw=1)

In [0]:
# Cox regression
import pandas as pd
import os

filename = os.path.abspath(os.path.join('Desktop', 'LUAD_3688_25_25_1.csv'))

fin = open(filename)
readCSV = pd.read_csv(fin)
readCSV.head()

# Getting rid of columns
readCSV.drop(["Patient"],axis=1,inplace=True)
readCSV.head()

groups = readCSV['Group']
ix = (groups == 'Low', "High")

for r in readCSV['Group'].unique():
    ix = readCSV['Group'] == r

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(readCSV[['Expression','Days','Status']], duration_col="Days", event_col="Status")

cph.print_summary()

In [0]:
#output:
<lifelines.CoxPHFitter: fitted with 246 observations, 104 censored>
      duration col = Days
         event col = Status
number of subjects = 246
  number of events = 142
    log-likelihood = -626.287
  time fit was run = 2018-12-16 19:53:54 UTC
---
              coef  exp(coef)  se(coef)       z      p  lower 0.95  upper 0.95   
Expression -0.0000     1.0000    0.0000 -0.3175 0.7509     -0.0000      0.0000   
---
Signif. codes: 0 '***' 0.0001 '**' 0.001 '*' 0.01 '.' 0.05 ' ' 1
Concordance = 0.515
Likelihood ratio test = 0.102 on 1 df, p=0.74951

# Summary

**These three task have teach you:**

1.   How to look up data of molecules that you are interested
2.   How to look the relationship of those molecules with others or with disease/function
3. How your molecule is related to cancer survival(you can look at dataset of the condition that you are interest to also do survival analysis)

**With these three task you are answering:**

**Task 1:** What are you interested?

**Task 2.** What is the relationship of that with previous studies? or What previous studies say about your molecule?

**Task 3:** Why it is important to study that molecule? 


# References

**Cleaning data**: https://github.com/johnrandazzo/surv_nflrb

**Graphs**: https://lifelines.readthedocs.io/en/latest/Quickstart.html#kaplan-meier-and-nelson-aalen
