# Case-Control Finder  
## Finds cases and controls for a given condition within the Sequence Read Archive

__Import dependencies and load data__

In [None]:
%load_ext rpy2.ipython

In [None]:
%%bash
wget https://cran.r-project.org/src/contrib/rjson_0.2.20.tar.gz
R CMD INSTALL rjson_0.2.20.tar.gz

In [None]:
%%R
library(rjson)

In [None]:
import json
import pandas as pd
from functions import *
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
%%R
metadata_file_tsv <- read.table(file = "./data/experiment_to_terms.tsv", header = FALSE, sep = "\t")

__1. Do you have a set of samples that you would like to restrict the retrieval to?__ 

These may be SRA samples that you have preprocessed and/or have access to expressiond data. If so, point `available_data_f` to a JSON file containing a list of SRA experiment accessions. See `./data/experiments_in_hackathon_data.json` for an example.

In [None]:
available_data_f = None  ## <-- INPUT HERE

r = load_metadata(available_data_f)
sample_to_terms = r[0]
term_name_to_id = r[1]
sample_to_type = r[2]
sample_to_study = r[3]
sample_to_runs = r[4]

__2. Enter the term you are looking for (in place of `'glioblastoma multiforme'`)__

In [None]:
term = 'glioblastoma multiforme' ## <-- INPUT HERE

__3. List terms to remove from control set__ 

In the example below, `'disease'` and `'disease of cellular proliferation'` will be removed from the controls.  

In [None]:
blacklist_terms = set([
    'disease', 
    'disease of cellular proliferation'
]) ## <-- INPUT HERE

__4. Create case and controls__

In [None]:
case, control = term_to_run(sample_to_terms, term)
ret = match_case_to_controls(term, control, case, sample_to_terms,
    sample_to_study, blacklist_terms, term_name_to_id, sample_to_type,
    filter_poor=True, filter_cell_line=True, filter_differentiated=True,
    sample_to_runs=sample_to_runs, by_run=False)
df = ret[0]
control_confound = ret[1]
case_confound = ret[2]
tissue_intersections = ret[3]

create_summary_plots(df)
plt.show()

__5. Browse other metadata terms that are associated with cases and controls.__ 

Enter whether you want to view cases or controls. Assign the following variable to `True` to view cases or `False` to view controls.

In [1]:
view_cases = False ## <-- INPUT HERE

Enter the tissue or cell type on which to subset your samples:

In [None]:
term = 'brain' ## <-- INPUT HERE

if view_cases:
    condition = 'case'
else:
    condition = 'control'
view_exps = select_case_control_experiment_set(df, condition, term)
with open('./data/term-in.json', 'w') as f:
    json.dump(view_exps, f)

The following plots the proportion of metadata terms for those terms that appear in at least 10% of the samples in the current subset:

In [None]:
%%R
source("./Metadata_plot.R")
bp

In [None]:
%%R
source("./Metadata_table.R")
query_disease_metadata_top10_table

In [None]:
%%R
source("./Metadata_piecharts.R")
query_cell_line

__5. Create output file__

Enter the filename for which you would like to output these cases or controls:

In [None]:
output_file = 'cases_vs_controls.csv' ## <- OUTPUT FILE HERE

df.to_csv(output_file)