# Importing and downloading datasets

The longitudinalCLL package is designed to help you access clinical, blood cell count, proteomic and metabolic data for a longitudinal chronic lymphocytic leukemia study. Some of this data is password protected, therefore if you would like to have full access to all this package has to offer, please contact the Payne lab at BYU.

Here are some examples on how to download the datasets initially:

In [1]:
import longitudinalCLL

metab = longitudinalCLL.get_metabolic()
metab.load_dataset() 

Unnamed: 0,subject,date,Na mmol/L,K mmol/L,Cl mmol/L,ECO2 mmol/L,AGAP mmol/L,AHDL mg/dL,VLDL mg/dL,LDL mg/dL,...,A/G,GLUC mg/dL,BUN mg/dL,CA mg/dL,CRE2 mg/dL,BN/CR,eGFR NonAfrican Am,eGFR African,bilirubin mg/dL,Testing
0,1,2019-10-25,,,,,,43.0,,,...,,95,,,,,,,,BYU SHC
1,1,2019-11-22,,,,,,40.0,,,...,,94,,,,,,,,BYU SHC
2,1,2019-12-19,138.0,4.1,103.0,26.1,13.0,42.0,,,...,1.2,95,18.0,8.8,1.0,18.0,,,,BYU SHC
3,1,2020-01-31,140.0,3.7,104.0,26.2,13.5,40.0,,,...,1.3,95,14.0,8.6,1.0,14.0,,,,BYU SHC
4,1,2020-02-14,140.0,3.9,102.0,27.1,14.8,42.0,,,...,1.5,99,13.0,8.6,0.9,14.4,,,,BYU SHC
5,1,2020-02-21,139.0,3.8,102.0,26.4,14.4,47.0,,,...,1.4,97,16.0,8.8,0.9,17.8,,,,BYU SHC
6,1,2020-03-26,139.0,4.4,100.0,22.0,,,,,...,2.3,89,17.0,10.0,1.0,17.0,92.0,,0.6,LabCorp
7,1,2020-06-29,141.0,4.0,103.0,21.0,,40.0,39.0,139.0,...,2.4,85,18.0,9.3,1.21,15.0,73.0,84.0,0.03,LabCorp
8,1,2020-07-29,140.0,4.2,100.0,21.0,,42.0,58.0,120.0,...,2.4,85,12.0,9.6,0.98,12.0,94.0,109.0,0.3,LabCorp
9,1,2020-09-17,140.0,4.2,102.0,25.0,,39.0,60.0,133.0,...,2.4,90,14.0,9.6,0.98,14.0,94.0,109.0,0.4,LabCorp


This is an example of how to load a dataset. This can be repeated for get_proteomic, get_clinical and get_cbc for their respective data. These getters return an object of type Metabolic, Proteomic, Clinical or CBC, depending on which getter you use. The load_dataset function returns a pandas data frame object while also saving the data frame as a member variable of said object. Thus we can access the dataframe after downloading with the following command:

In [2]:
metab.data_frame

Unnamed: 0,subject,date,Na mmol/L,K mmol/L,Cl mmol/L,ECO2 mmol/L,AGAP mmol/L,AHDL mg/dL,VLDL mg/dL,LDL mg/dL,...,A/G,GLUC mg/dL,BUN mg/dL,CA mg/dL,CRE2 mg/dL,BN/CR,eGFR NonAfrican Am,eGFR African,bilirubin mg/dL,Testing
0,1,2019-10-25,,,,,,43.0,,,...,,95,,,,,,,,BYU SHC
1,1,2019-11-22,,,,,,40.0,,,...,,94,,,,,,,,BYU SHC
2,1,2019-12-19,138.0,4.1,103.0,26.1,13.0,42.0,,,...,1.2,95,18.0,8.8,1.0,18.0,,,,BYU SHC
3,1,2020-01-31,140.0,3.7,104.0,26.2,13.5,40.0,,,...,1.3,95,14.0,8.6,1.0,14.0,,,,BYU SHC
4,1,2020-02-14,140.0,3.9,102.0,27.1,14.8,42.0,,,...,1.5,99,13.0,8.6,0.9,14.4,,,,BYU SHC
5,1,2020-02-21,139.0,3.8,102.0,26.4,14.4,47.0,,,...,1.4,97,16.0,8.8,0.9,17.8,,,,BYU SHC
6,1,2020-03-26,139.0,4.4,100.0,22.0,,,,,...,2.3,89,17.0,10.0,1.0,17.0,92.0,,0.6,LabCorp
7,1,2020-06-29,141.0,4.0,103.0,21.0,,40.0,39.0,139.0,...,2.4,85,18.0,9.3,1.21,15.0,73.0,84.0,0.03,LabCorp
8,1,2020-07-29,140.0,4.2,100.0,21.0,,42.0,58.0,120.0,...,2.4,85,12.0,9.6,0.98,12.0,94.0,109.0,0.3,LabCorp
9,1,2020-09-17,140.0,4.2,102.0,25.0,,39.0,60.0,133.0,...,2.4,90,14.0,9.6,0.98,14.0,94.0,109.0,0.4,LabCorp


The load_dataset function has various arguments that can be passed in. Two of these are the subjects to pull form the data frame and if you would like to redownload the data to fix any errors that might have arisen during initial download. Here are some examples of each of these parameters in action:

In [3]:
metab.load_dataset(redownload = True)

Unnamed: 0,subject,date,Na mmol/L,K mmol/L,Cl mmol/L,ECO2 mmol/L,AGAP mmol/L,AHDL mg/dL,VLDL mg/dL,LDL mg/dL,...,A/G,GLUC mg/dL,BUN mg/dL,CA mg/dL,CRE2 mg/dL,BN/CR,eGFR NonAfrican Am,eGFR African,bilirubin mg/dL,Testing
0,1,2019-10-25,,,,,,43.0,,,...,,95,,,,,,,,BYU SHC
1,1,2019-11-22,,,,,,40.0,,,...,,94,,,,,,,,BYU SHC
2,1,2019-12-19,138.0,4.1,103.0,26.1,13.0,42.0,,,...,1.2,95,18.0,8.8,1.0,18.0,,,,BYU SHC
3,1,2020-01-31,140.0,3.7,104.0,26.2,13.5,40.0,,,...,1.3,95,14.0,8.6,1.0,14.0,,,,BYU SHC
4,1,2020-02-14,140.0,3.9,102.0,27.1,14.8,42.0,,,...,1.5,99,13.0,8.6,0.9,14.4,,,,BYU SHC
5,1,2020-02-21,139.0,3.8,102.0,26.4,14.4,47.0,,,...,1.4,97,16.0,8.8,0.9,17.8,,,,BYU SHC
6,1,2020-03-26,139.0,4.4,100.0,22.0,,,,,...,2.3,89,17.0,10.0,1.0,17.0,92.0,,0.6,LabCorp
7,1,2020-06-29,141.0,4.0,103.0,21.0,,40.0,39.0,139.0,...,2.4,85,18.0,9.3,1.21,15.0,73.0,84.0,0.03,LabCorp
8,1,2020-07-29,140.0,4.2,100.0,21.0,,42.0,58.0,120.0,...,2.4,85,12.0,9.6,0.98,12.0,94.0,109.0,0.3,LabCorp
9,1,2020-09-17,140.0,4.2,102.0,25.0,,39.0,60.0,133.0,...,2.4,90,14.0,9.6,0.98,14.0,94.0,109.0,0.4,LabCorp


In [4]:
metab.load_dataset(subjects=[1])

Unnamed: 0,subject,date,Na mmol/L,K mmol/L,Cl mmol/L,ECO2 mmol/L,AGAP mmol/L,AHDL mg/dL,VLDL mg/dL,LDL mg/dL,...,A/G,GLUC mg/dL,BUN mg/dL,CA mg/dL,CRE2 mg/dL,BN/CR,eGFR NonAfrican Am,eGFR African,bilirubin mg/dL,Testing
0,1,2019-10-25,,,,,,43.0,,,...,,95,,,,,,,,BYU SHC
1,1,2019-11-22,,,,,,40.0,,,...,,94,,,,,,,,BYU SHC
2,1,2019-12-19,138.0,4.1,103.0,26.1,13.0,42.0,,,...,1.2,95,18.0,8.8,1.0,18.0,,,,BYU SHC
3,1,2020-01-31,140.0,3.7,104.0,26.2,13.5,40.0,,,...,1.3,95,14.0,8.6,1.0,14.0,,,,BYU SHC
4,1,2020-02-14,140.0,3.9,102.0,27.1,14.8,42.0,,,...,1.5,99,13.0,8.6,0.9,14.4,,,,BYU SHC
5,1,2020-02-21,139.0,3.8,102.0,26.4,14.4,47.0,,,...,1.4,97,16.0,8.8,0.9,17.8,,,,BYU SHC
6,1,2020-03-26,139.0,4.4,100.0,22.0,,,,,...,2.3,89,17.0,10.0,1.0,17.0,92.0,,0.6,LabCorp
7,1,2020-06-29,141.0,4.0,103.0,21.0,,40.0,39.0,139.0,...,2.4,85,18.0,9.3,1.21,15.0,73.0,84.0,0.03,LabCorp
8,1,2020-07-29,140.0,4.2,100.0,21.0,,42.0,58.0,120.0,...,2.4,85,12.0,9.6,0.98,12.0,94.0,109.0,0.3,LabCorp
9,1,2020-09-17,140.0,4.2,102.0,25.0,,39.0,60.0,133.0,...,2.4,90,14.0,9.6,0.98,14.0,94.0,109.0,0.4,LabCorp


As you can see the redownload option takes a little longer to run as it is redownloading the data and the second data frame only shows data from subject 1 as requested. The default values to redownload and subjects is False and an empty list ([]), respectively.

Load_dataset can also take in a version argument. The versioning scheme is specific to each data set with the dataset containing the most versions being the proteomic dataset. We will now use the proteomic dataset to show this next functionality. To see what the versions exist of a dataset you will need to call the following function after getting a dataset object:

In [5]:
prot = longitudinalCLL.get_proteomic()
prot.versions()

AttributeError: 'Proteomic' object has no attribute 'index_path'

Here it lists all of the possible datasets that you can download. By default load_dataset will load the most current version of each dataset. To specify what version of the dataset you would like to download you would call the load_dataset function like so:

In [None]:
prot.load_dataset(version = "July_noMBR_FP")