# Data-Exploration

Import and explore data

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import json
import glob

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from utils import *

In [3]:
%%time
corr_df = load_tsv('/data/archive/compendium/v5/v5_all_by_all.2018-02-04.tsv')

CPU times: user 1min 47s, sys: 30.7 s, total: 2min 17s
Wall time: 2min 18s


In [4]:
%%time
meta_df = load_tsv('/data/archive/compendium/v5/clinical.tsv')

CPU times: user 152 ms, sys: 64 ms, total: 216 ms
Wall time: 152 ms


In [5]:
%%time
prep_df = load_tsv('~/work/TURG/resources/TreehouseCompendiumSamples_LibraryPrep.tsv')

CPU times: user 20 ms, sys: 4 ms, total: 24 ms
Wall time: 21.8 ms


In [6]:
%time
type_df = pd.read_csv('~/work/TURG/resources/DiseaseAnnotations_2018-04_Labels.csv', sep=',', index_col=1)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.06 µs


## Dataset Info

#### `corr_df` : all by all correlation matrix

Dataset of the intercorrelations between 11,340 samples measured by pearson (?) correlation

Indexed by Treehouse Sample ID.

In [7]:
len(corr_df)

11340

#### `meta_df`

Dataset of the same 11,340 samples as in `corr_df`: contains 45 columns including `disease` and `source`.

Indexed by Treehouse Sample ID.

In [8]:
len(meta_df)

11340

#### `prep_df`

Dataset of 12,748 samples (a superset of those in `corr_df` and `meta_df`). Contains 833 duplicate rows, always single duplicates with `libSelType = NaN`.

Indexed by Treehouse Sample ID.

In [9]:
print('Total samples:', len(prep_df), '\n')
print('Value counts:')
print(prep_df['libSelType'].value_counts(), '\n')
print('Number of NA values:', len(prep_df['libSelType']) - prep_df['libSelType'].count(), '\n')
print('Example of duplicate row:')
print(prep_df.loc['TCGA-AB-2860-03'], '\n')
print('Total number of duplicate rows:', sum(prep_df.index.duplicated()))

Total samples: 12748 

Value counts:
polyASelection            11505
riboDepletion               191
presumed riboDepletion       32
unknown                       9
exomeSelection                8
Name: libSelType, dtype: int64 

Number of NA values: 1003 

Example of duplicate row:
                 UMENDcount      libSelType
THid                                       
TCGA-AB-2860-03         NaN  polyASelection
TCGA-AB-2860-03         NaN             NaN 

Total number of duplicate rows: 833


#### `type_df`

Dataset of 11,586 donors and their associated disease. I do not know the relationship between these 11,586 samples and those in other datasets.

Many of the donor ids present in this dataset cannot be found in either the correlation dataframe or the all-v-all matrix.

Indexed by Treehouse Donor ID (which makes up the first part of Treehouse Sample ID).

In [10]:
print('Number of donors:', len(type_df), '\n')
type_df.head()

Number of donors: 11586 



Unnamed: 0_level_0,TH Unique Donor Key,Diagnosis/Disease,Diagnostic group,Histology
Treehouse Donor ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TH03_0003,3,sarcoma,Sarcoma other (all other types),undifferentiated
TH03_0004,4,hepatoblastoma,Liver,fetal and embryonal elements
TH03_0005,5,nasopharyngeal carcinoma,Other,non-keratinizing
TH03_0006,6,rhabdomyosarcoma,Sarcoma other (all other types),anaplastic
TH03_0007,7,acute myeloid leukemia,Hematopoietic,


In [11]:
# for example
donor_id = 'TH01_0660'
print('Donor id in type_df:', donor_id in type_df.index)
print('Donor id in file structure:', len(glob.glob('/data/archive/downstream/%s*' % donor_id)) != 0)

Donor id in type_df: True
Donor id in file structure: False


# Thoughts

When retrieving correlations, we can only get correlations against the samples in `corr_df`, aka the reference set (?). This is somewhat problematic, as it means that 

If there were any systematic effects in which samples were missing from the file structure, that creates issues.

I chose to get only a single sample for each donor in the `type_df`. But right now, that might double up with samples from `corr_df` that come from the same donor.