- TCGA (cancer genomic data with associated clinical data) hosted on google cloud and accessible via python. Example notebooks and scripts in the [github repository](https://github.com/isb-cgc/examples-Python).

- [Google Genomics](https://cloud.google.com/genomics/docs/quickstart)

- [List of publically available clinical datasets](https://github.com/EpistasisLab/ClinicalDataSources) from the Epistasis lab at UPenn

- [Open data](https://avillach-lab.hms.harvard.edu/access-data/open-data) from the Avillach lab at harvard.

- [Columbia Open Health Data](https://www.nature.com/articles/sdata2018273)

- [openFDA](https://open.fda.gov/)

- [Public datasets on github](https://github.com/awesomedata/awesome-public-datasets)

- [eICU MIT clinical data](https://github.com/MIT-LCP/eicu-code)

- [MGH/MF Waveform data](https://alpha.physionet.org/content/mghdb/1.0.0/)

- [PhysioNet clinical datasets](https://alpha.physionet.org/about/database/#image)

## Columbia Open Health Data

Ta, Casey N.; Dumontier, Michel; Hripcsak, George; P. Tatonetti, Nicholas; Weng, Chunhua (2018): Columbia Open Health Data, a database of EHR prevalence and co-occurrence of conditions, drugs, and procedures. figshare. Collection.

In [13]:
import wget
# more dataset links at https://figshare.com/collections/Columbia_Open_Health_Data_a_database_of_EHR_prevalence_and_co-occurrence_of_conditions_drugs_and_procedures/4151252/1
url_dict = {
'5year_paired_concepts' : 'https://ndownloader.figshare.com/files/12272924',
'ifetime_paired_concept_deviations': 'https://ndownloader.figshare.com/files/13154816',
'5year_single_concept_deviations' : 'https://ndownloader.figshare.com/files/13154819',
'lifetime_single_concept_deviations' : 'https://ndownloader.figshare.com/files/13154756'
}
output_directory = 'data'
filename = wget.download(url_dict['5year_paired_concepts'], out=output_directory)

In [23]:
import pandas as pd
pd.read_csv(filename,sep='\t',
            names=['concept_id_1','concept_id_2', 'prevalence']
           ).head()

Unnamed: 0,concept_id_1,concept_id_2,prevalence
8507,8515,8907,0.004975
8507,8516,43498,0.024295
8507,8522,148,8.3e-05
8507,8527,182994,0.102207
8507,8552,14273,0.007972


## [Quantitative Dehydration Estimation](https://alpha.physionet.org/content/qde/1.0.0/)

In [3]:
import wget
output_directory = 'data'
filename = wget.download('https://alpha.physionet.org/files/qde/1.0.0/dehydration_estimation.csv?download',out=output_directory)

In [5]:
import pandas as pd
f = pd.read_csv(filename,sep=',')
display(f.shape)
f.head()

(90, 33)

Unnamed: 0,id,age [years],height [cm],running speed [km/h],running interval,weight measured using Kern DE 150K2D [kg],weight measured using InBody 720 [kg],total body water using InBody 720 [l],impedance right arm at 1000kHz [Ohm],impedance left arm at 1000kHz [Ohm],...,temperature lower leg [degree C],sweat chloride [mmol/l],sweat osmolality [mmol/kg],salivary amylase [units/l],salivary chloride [mmol/l],salivary cortisol [ng/ml],salivary cortisone [ng/ml],salivary osmolality [mmol/kg],salivary potassium [mmol/l],salivary protein concentration [mg/l]
0,1,29.0,190.0,8.0,0,85.515,85.9,54.0,231.27,232.63,...,31.9,,,,28.0,,,76.0,,576.8
1,1,,,,1,85.275,85.56,53.7,234.75,240.08,...,31.3,56.0,146.0,,29.0,,,78.0,,544.1
2,1,,,,2,84.895,85.32,54.0,230.81,233.95,...,31.2,55.0,134.0,111700.0,35.0,1.24,11.8,84.0,37.0,537.3
3,1,,,,3,84.54,84.9,54.0,231.96,236.32,...,30.9,53.0,123.0,,38.0,0.947,10.6,95.0,38.0,595.6
4,1,,,,4,84.185,84.48,53.9,227.03,232.07,...,31.7,36.0,185.0,154110.0,37.0,0.727,9.64,91.0,32.0,541.3


## Tappy Keystroke Data

In [73]:
import wget
output_directory='data'
filename = wget.download('https://alpha.physionet.org/static/published-projects/tappy/tappy-keystroke-data-1.0.0.zip',out=output_directory)

In [74]:
filename

'data/tappy-keystroke-data-1.0.0.zip'

In [75]:
import zipfile
with zipfile.ZipFile(filename, 'r') as zip_ref:
    zip_ref.extractall(output_directory)

In [76]:
extracted_dir = filename.split('/')[1].split('.z')[0]
full_extracted_dir = output_directory+'/'+extracted_dir

In [77]:
import os
zip_files = os.listdir(full_extracted_dir)[:2]

In [78]:
for f in zip_files:
    with zipfile.ZipFile(full_extracted_dir+'/'+f, 'r') as zip_ref:
        zip_ref.extractall(full_extracted_dir)

In [79]:
os.listdir(full_extracted_dir)

['Archived-users.zip',
 'Archived users',
 'Tappy Data',
 'Archived-Data.zip',
 'SHA256SUMS.txt']

In [89]:
data_folders = os.listdir(full_extracted_dir)[1:3]
data_folders

['Archived users', 'Tappy Data']

In [90]:
data_files = {}
for d in data_folders:
    data_files[d] = [full_extracted_dir+'/'+d+'/'+x for x in os.listdir(full_extracted_dir+'/'+d)]

In [91]:
data_files['Archived users'][0]

'data/tappy-keystroke-data-1.0.0/Archived users/User_PJU53Y7KVB.txt'

In [92]:
import yaml

In [93]:
with open(data_files['Archived users'][0], 'r') as stream:
    data_loaded = yaml.safe_load(stream)

In [94]:
data_loaded

{'BirthYear': None,
 'Gender': 'Male',
 'Parkinsons': True,
 'Tremors': True,
 'DiagnosisYear': None,
 'Sided': 'Left',
 'UPDRS': "Don't know",
 'Impact': 'Mild',
 'Levadopa': True,
 'DA': False,
 'MAOB': False,
 'Other': False}

In [95]:
data_files['Tappy Data'][0]

'data/tappy-keystroke-data-1.0.0/Tappy Data/NMMGWRY6SO_1703.txt'

In [96]:
import pandas as pd

In [97]:
f = pd.read_csv(data_files['Tappy Data'][0],
                sep='\t',header=None,
                names=['UserKey','Date','Timestamp','Hand',
                       'Hold_time','Direction','Latency_time',
                       'Flight_time'])
f.head()

Unnamed: 0,UserKey,Date,Timestamp,Hand,Hold_time,Direction,Latency_time,Flight_time
NMMGWRY6SO,170301,08:45:42.125,L,187.5,LL,421.9,281.3,
NMMGWRY6SO,170301,08:45:42.422,L,203.1,LL,281.3,93.8,
NMMGWRY6SO,170301,08:48:29.031,L,203.1,LL,296.9,125.0,
NMMGWRY6SO,170301,08:48:29.266,L,218.8,LL,218.8,15.6,
NMMGWRY6SO,170301,08:48:29.484,R,187.5,LR,250.0,31.3,


In [98]:
#https://stackoverflow.com/questions/18857352/python-remove-very-last-character-in-file

import os


def truncate_utf8_chars(filename, count, ignore_newlines=True):
    """
    Truncates last `count` characters of a text file encoded in UTF-8.
    :param filename: The path to the text file to read
    :param count: Number of UTF-8 characters to remove from the end of the file
    :param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
    """
    with open(filename, 'rb+') as f:
        last_char = None

        size = os.fstat(f.fileno()).st_size

        offset = 1
        chars = 0
        while offset <= size:
            f.seek(-offset, os.SEEK_END)
            b = ord(f.read(1))

            if ignore_newlines:
                if b == 0x0D or b == 0x0A:
                    offset += 1
                    continue

            if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
                # This is the first byte of a UTF8 character
                chars += 1
                if chars == count:
                    # When `count` number of characters have been found, move current position back
                    # with one byte (to include the byte just checked) and truncate the file
                    f.seek(-1, os.SEEK_CUR)
                    f.truncate()
                    return
            offset += 1

In [100]:
import csv
filename=data_files['Tappy Data'][0]
truncate_utf8_chars(filename, 1)
col_names=['UserKey','Date','Timestamp','Hand','Hold_time','Direction','Latency_time','Flight_time','']
all_rows = []
with open(filename, newline = '') as rows:
    row_reader = csv.reader(rows, delimiter='\t')
    for row in row_reader:
        all_rows.append(row)
df = pd.DataFrame(all_rows,columns=col_names).drop('',axis=1)
print(df.shape)
df.head()

(6547, 8)


Unnamed: 0,UserKey,Date,Timestamp,Hand,Hold_time,Direction,Latency_time,Flight_time
0,NMMGWRY6SO,170301,08:45:42.125,L,187.5,LL,421.9,281.3
1,NMMGWRY6SO,170301,08:45:42.422,L,203.1,LL,281.3,93.8
2,NMMGWRY6SO,170301,08:48:29.031,L,203.1,LL,296.9,125.0
3,NMMGWRY6SO,170301,08:48:29.266,L,218.8,LL,218.8,15.6
4,NMMGWRY6SO,170301,08:48:29.484,R,187.5,LR,250.0,31.3
