# Data-Analysis

Import and analyze data

In [1]:
import pandas as pd
import numpy as np

In [112]:
%%time
prep_path = 'data/TreehouseCompendiumSamples_LibraryPrep.tsv'
prep_df = pd.read_csv(prep_path, sep='\t', index_col=0)

CPU times: user 24 ms, sys: 4 ms, total: 28 ms
Wall time: 26.1 ms


In [3]:
%%time
corr_path = 'data/v5_all_by_all.2018-02-04.tsv'
corr_df = pd.read_csv(corr_path, sep='\t', index_col=0)

CPU times: user 2min 26s, sys: 22.4 s, total: 2min 48s
Wall time: 2min 49s


In [116]:
%%time
meta_path = 'data/clinical.tsv'
meta_df = pd.read_csv(meta_path, sep='\t', index_col=0)

CPU times: user 132 ms, sys: 0 ns, total: 132 ms
Wall time: 129 ms


In [82]:
%%time
type_path = 'data/DiseaseAnnotations_2018-04_Labels.csv'
type_df = pd.read_csv(type_path, sep=',', index_col=1)

CPU times: user 60 ms, sys: 12 ms, total: 72 ms
Wall time: 66.1 ms


In [10]:
print(prep_df['libSelType'].value_counts())
print('NA', len(prep_df['libSelType']) - prep_df['libSelType'].count())

polyASelection            11505
riboDepletion               191
presumed riboDepletion       32
unknown                       9
exomeSelection                8
Name: libSelType, dtype: int64
NA 1003


In [51]:
print(len(prep_df), len(meta_df), len(corr_df))

12748 11340 11340


### [?] Duplicate Entries in library prep tsv

In [61]:
prep_df.loc['TCGA-AB-2860-03']

Unnamed: 0_level_0,UMENDcount,libSelType
THid,Unnamed: 1_level_1,Unnamed: 2_level_1
TCGA-AB-2860-03,,polyASelection
TCGA-AB-2860-03,,


In [76]:
print(sum(prep_df.index.duplicated()))
print(sum(prep_df.index.duplicated(False)))
assert 1666 == 2 * 833

833
1666


There are 833 rows which are duplicated exactly once in the prep tsv.

### Method to get correlations from TSV or JSON

In [46]:
def get_correlations(sample):
    if sample in corr_df.index:
        corr_row = corr_df.loc[sample]
        corr_dict = {s: c for s, c in zip(corr_df.index, corr_row)}
    else:
        with open('data/%s/2.0.json' % sample, 'r') as f:
            corr_dict = json.loads(f.read())['correlations_vs_focus_sample']
    return corr_dict

# Experiments/Results

In [49]:
sample_A = 'TH01_0121_S01'
sample_B = 'TH01_0123_S01'

# 1.

Recheck results: are most similar samples still AML in v5?

### TH01_0121_S01

https://jupyter.treehouse.gi.ucsc.edu/user/kellerjordan/files/TURG-Research/data/TH01_0121_S01/Summary.html

Of `6` most correlated samples, `3/6 = 0.5` were AML.

### TH01_0123_S01

https://jupyter.treehouse.gi.ucsc.edu/user/kellerjordan/files/TURG-Research/data/TH01_0123_S01/Summary.html

Of `6` most correlated samples, `6/6 = 1.0` were AML.

# 2.

For each, what fraction of top 95% (corr > 0.87) were AML?

In [101]:
def top_samples(sample, top_95=0.87):
    correlations = get_correlations(sample)
    top_samples = []
    for s, c in correlations.items():
        if c > top_95:
            top_samples.append(s)
    return top_samples

def get_prefix(sample):
    last_pos = max(sample.rfind('_'), sample.rfind('-'))
    return sample[:last_pos]

def get_prefixes(samples):
    return [get_prefix(s) for s in samples]

def fraction_aml(sample):
    corr_samples = top_samples(sample)
    prefix_corr_samples = get_prefixes(corr_samples)
    corr_diseases = type_df.loc[prefix_corr_samples]['Diagnosis/Disease']
    disease_counts = corr_diseases.value_counts()
    return disease_counts['acute myeloid leukemia'] / len(corr_samples)

In [104]:
print('Results:')
print(sample_A, fraction_aml(sample_A))
print(sample_B, fraction_aml(sample_B))

Results:
TH01_0121_S01 0.74358974359
TH01_0123_S01 0.590476190476


# 3.

Are the top most correlated samples still places in the AML cluster on tumormap?

https://tumormap.ucsc.edu/?p=CKCC/v5&node=TH26_0657_S04,TH26_0657_S05,TH26_0657_S03,TCGA-AB-2915-03,TCGA-AB-2868-03,TCGA-AB-2952-03&x=24,25,24,346,346,344&y=323,321,322,40,42,40


https://tumormap.ucsc.edu/?p=CKCC/v5&node=TCGA-AB-2934-03,TCGA-AB-2890-03,TCGA-AB-2927-03,TCGA-AB-2936-03,TCGA-AB-2915-03,TCGA-AB-2868-03&x=340,349,350,349,346,346&y=44,43,44,48,40,42

For the top six most correlated samples in each case, all that were labeled AML are placed in the AML cluster. Those that are labeled ALL are not.

# 4.

How correlated are other ALL samples with AML? Specifically, what impact does RiboD preparation have on their correlation? Use the `type_df` to find ALL samples of interest.

This section is the only one where preparation type matters, as in every other case I search only in the correlation matrix which contains only PolyA-prepared samples.

# 5.

Find the samples of any type that are most correlated with AML (but not AML themselves).