# Loading methylation datasets

In [1]:
import methylcheck
from pathlib import Path # a tidier (pythonic) way to refer to files and folder paths

previously, I processed a batch of IDAT files in the command line using this command in the folder:
```
python -m methylprep process -d . --all
```
that creates a bunch of new python pickle `.pkl` files in the folder:

```
beta_values.pkl
poobah_values.pkl
control_probes.pkl
m_values.pkl
noob_meth_values.pkl
noob_unmeth_values.pkl
sample_sheet_meta_data.pkl
```

## Loading beta values
The default data format for `methylcheck` functions is a Pandas DataFrame (like a spreadsheet with columns and rows), but the `methylcheck.load` function will let you load a variety of file formats.

In [2]:
filepath = Path('/Volumes/LEGX/Barnes/one')
df = methylcheck.load(filepath)
# where the first argument is the path or file you want to load. This also works, 
# if you start your jupyter notebook in the same folder
df = methylcheck.load(Path(filepath,'beta_values.pkl'))
# by default, probes that failed p-value detection are removed from the dataframe, 
# unless you used the `--no_poobah` option during processing.

Files: 100%|██████████| 1/1 [00:00<00:00, 50.68it/s]
INFO:methylcheck.load_processed:loaded data (127852, 1) from 1 pickled files (0.038s)
Files:   0%|          | 0/1 [00:00<?, ?it/s]INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
Files: 100%|██████████| 1/1 [00:00<00:00,  9.42it/s]
INFO:methylcheck.load_processed:loaded data (127852, 1) from 1 pickled files (0.187s)


#### Gotcha: pointing to a folder with multiple sets of files
 If you point to a folder with MULTIPLE batches of samples, of different array types, you'll get an error.
 all samples in the path you choose need to have the same number of probes (same array type)
 ERROR; don't do this:

In [3]:
# multifilepath = Path('/Volumes/LEGX/<somefolder>')
# df = methylcheck.load(multifilepath)
# the error says the probe counts don't match.

> ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 226618 and the array at index 1 has size 244827

instead, move the files from different array-types into separate folders and point to each one separately.

## Open Sesame
`methylcheck.load` works on the CSVs produced the the `sesame` `R` package.

You must specify an additional paramter: `format='sesame'`.

```df = methylcheck.load('.', format='sesame')```


In [4]:
filepath = Path('/Volumes/LEGX/Barnes/output_test')
df = methylcheck.load(filepath, format='sesame', verbose=True)
df

INFO:methylcheck.load_processed:2 files matched
INFO:methylcheck.load_processed:203319730027_R01C01, (866553, 1) --> 1
INFO:methylcheck.load_processed:203319730027_R01C01, (866553, 1) --> 2


Unnamed: 0_level_0,203319730027_R01C01,203319730027_R01C01
Probe_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
cg00000029,,
cg00000103,,
cg00000109,0.933105,0.933105
cg00000155,0.947754,0.947754
cg00000158,0.958496,0.958496
...,...,...
rs9363764,0.056000,0.056000
rs939290,0.548828,0.548828
rs951295,0.048615,0.048615
rs966367,0.964355,0.964355


## Loading from methylprep CSV output
`methylcheck` assumes you want to load data the fastest way, using a single high-performance python3 pickled dataframe. 
But there are times when you want to load from CSV output files instead. One use case is where you want
to examine probes that were filtered out by poobah (p-value probe detection). The CSVs contain this information,
whereas the pickled dataframe has it removed by default.

In [5]:
filepath = Path('/Volumes/LEGX/Barnes/mouse_test')
df = methylcheck.load(filepath, format='beta_csv')
df

Files: 100%|██████████| 6/6 [00:00<00:00,  8.74it/s]
  from pandas import Panel
INFO:methylcheck.load_processed:merging...
100%|██████████| 6/6 [00:00<00:00, 2400.86it/s]


Unnamed: 0_level_0,204879580038_R01C02,204879580038_R02C02,204879580038_R03C02,204879580038_R04C02,204879580038_R05C02,204879580038_R06C02
IlmnID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
cg00531009_BC21,0.930176,0.912109,0.919922,0.901855,0.924805,0.943848
cg00747726_TC21,0.883789,0.869141,0.880859,0.886230,0.895996,0.905762
cg01326836_BC21,0.206055,0.239990,0.239014,0.224976,0.229004,0.239990
cg01603043_TC21,0.858887,0.858887,0.876953,0.858887,0.869141,0.898926
cg01683044_BC21,0.058990,0.058990,0.056000,0.062988,0.067993,0.080994
...,...,...,...,...,...,...
ch985669618_BC21,0.816895,0.833984,,,,
ch985669618_BC21,0.816895,0.833984,,,,
ch985669618_BC21,0.816895,0.833984,,,,
ch991719372_BC21,,,,,,


## Other file formats supported by `methylcheck.load`:
formats: ('beta_value', 'm_value', 'meth', 'meth_df', 'noob_df', 'sesame', 'beta_csv')
where 'beta_value' is the default.


### 'meth_df' format
Returns two dataframes, representing the raw methylated and unmethylated fluorescence intensity values for all probes (in rows) and samples (in columns).

In [6]:
filepath = Path('/Volumes/LEGX/Barnes/mouse_test')
(meth,unmeth) = methylcheck.load(filepath, format='meth_df')
(meth.shape, unmeth.shape)

  from pandas import Panel
100%|██████████| 6/6 [00:00<00:00, 2850.68it/s]
100%|██████████| 6/6 [00:00<00:00, 2738.39it/s]
INFO:methylcheck.load_processed:(127852, 6) (127852, 6)


((127852, 6), (127852, 6))

### 'noob_df' format
Returns two dataframes, representing the NOOB-(background substracted) methylated and unmethylated fluorescence intensity values for all probes (in rows) and samples (in columns).

In [7]:
filepath = Path('/Volumes/LEGX/Barnes/mouse_test')
(noob_meth,noob_unmeth) = methylcheck.load(filepath, format='noob_df')
(noob_meth.shape, noob_unmeth.shape)

  from pandas import Panel
100%|██████████| 6/6 [00:00<00:00, 1549.62it/s]
100%|██████████| 6/6 [00:00<00:00, 2265.76it/s]
INFO:methylcheck.load_processed:(127852, 6), (127852, 6)


((127852, 6), (127852, 6))

### 'meth' format will return a list of Python SampleDataContainer objects. 
This input is used in some older `methylcheck` functions, but you may never need it. You can get the same data from `meth_df` format in DataFrames.

In [8]:
filepath = Path('/Volumes/LEGX/Barnes/mouse_test')
containers = methylcheck.load(filepath, format='meth')
containers

Files: 100%|██████████| 6/6 [00:01<00:00,  3.68it/s]
INFO:methylcheck.load_processed:Produced a list of Sample objects (use obj._SampleDataContainer__data_frame to get values)...


[<methylcheck.load_processed.SampleDataContainer at 0x7f82a9259b70>,
 <methylcheck.load_processed.SampleDataContainer at 0x7f82aa053470>,
 <methylcheck.load_processed.SampleDataContainer at 0x7f82a9259b38>,
 <methylcheck.load_processed.SampleDataContainer at 0x7f82a92597b8>,
 <methylcheck.load_processed.SampleDataContainer at 0x7f82a9259d68>,
 <methylcheck.load_processed.SampleDataContainer at 0x7f82a9259e80>]

### Loading CSVs using Pandas
The `methylcheck.load('.', format='beta_csv') option will only load beta_values into a single DataFrame. IF you need to load everything, use Pandas.

Here is a simple function to load all files in a folder into a list:

In [9]:
import pandas as pd
filepath = Path('/Volumes/LEGX/Barnes/mouse_test')
sample_list = []
for FILE in filepath.rglob('*.csv'):
    sample_list.append( pd.read_csv(FILE) )
len(sample_list)

9

In [10]:
sample_list[-1]

Unnamed: 0,IlmnID,noob_meth,noob_unmeth,poobah_pval,meth,unmeth,beta_value,m_value
0,cg00531009_BC21,8944,431.0,0.002,9260.0,1257.0,0.944,4.373
1,cg00747726_TC21,6108,533.0,0.007,6430.0,1479.0,0.906,3.517
2,cg01326836_BC21,1379,4255.0,0.004,1702.0,9570.0,0.240,-1.625
3,cg01603043_TC21,3255,266.0,0.015,3578.0,899.0,0.899,3.608
4,cg01683044_BC21,97,998.0,0.024,360.0,2490.0,0.081,-3.352
...,...,...,...,...,...,...,...,...
127847,ch985669618_BC21,379,37.0,0.096,702.0,230.0,0.734,3.311
127848,ch985669618_BC21,379,37.0,0.096,702.0,230.0,0.734,3.311
127849,ch985669618_BC21,379,37.0,0.096,702.0,230.0,0.734,3.311
127850,ch991719372_BC21,196,34.0,0.170,513.0,192.0,0.594,2.499


### 'm_value'
As an alternative to beta_values, `methylcheck` also provides sample `m_values`, returning a DataFrame.

In [11]:
filepath = Path('/Volumes/LEGX/Barnes/mouse_test')
df = methylcheck.load(filepath, format='m_value')
df

Files: 100%|██████████| 1/1 [00:00<00:00, 50.02it/s]
INFO:methylcheck.load_processed:loaded data (127852, 6) from 1 pickled files (0.039s)


Unnamed: 0_level_0,204879580038_R01C02,204879580038_R02C02,204879580038_R03C02,204879580038_R04C02,204879580038_R05C02,204879580038_R06C02
IlmnID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
cg00531009_BC21,4.042,3.552,3.738,3.377,3.830,4.373
cg00747726_TC21,3.119,2.854,3.064,3.146,3.307,3.517
cg01326836_BC21,-1.910,-1.637,-1.639,-1.753,-1.724,-1.625
cg01603043_TC21,2.944,2.851,3.220,2.908,3.013,3.608
cg01683044_BC21,-3.904,-3.925,-3.979,-3.807,-3.681,-3.352
...,...,...,...,...,...,...
ch985669618_BC21,3.541,3.514,,,,
ch985669618_BC21,3.541,3.514,,,,
ch985669618_BC21,3.541,3.514,,,,
ch991719372_BC21,,,,,,


## Loading both beta_values and meta_data
use `methylcheck.load_both(<path>)`.
Note that `.load_both()` looks for the meta_data file generated by `methylprep process` and will not recognize meta data from other packages, such as `sesame`.

In [12]:
filepath = Path('/Volumes/LEGX/GEO/GSE75196')
df, meta = methylcheck.load_both(filepath)
print(df.shape)
print(meta.head())

Files: 100%|██████████| 1/1 [00:00<00:00,  8.26it/s]
INFO:methylcheck.load_processed:loaded data (485512, 24) from 1 pickled files (0.162s)
INFO:methylcheck.load_processed:meta.Sample_IDs match data.index (OK)


(24, 485512)
       GSM_ID                          Sample_Name  Sentrix_ID  \
0  GSM1944936  genomic DNA from placental biopsy 1  9376561054   
1  GSM1944939  genomic DNA from placental biopsy 2  9376561054   
2  GSM1944942  genomic DNA from placental biopsy 3  9376561054   
3  GSM1944944  genomic DNA from placental biopsy 4  9376561054   
4  GSM1944946  genomic DNA from placental biopsy 5  9376561054   

  Sentrix_Position                                             source     Sex  \
0           R01C01  placental biopsy collected <30mins of delivery...    Male   
1           R02C01     placental biopsy collected <30mins of delivery    Male   
2           R03C01     placental biopsy collected <30mins of delivery  Female   
3           R04C01     placental biopsy collected <30mins of delivery    Male   
4           R05C01     placental biopsy collected <30mins of delivery  Female   

        disease gestation (wk)    tissue            description  \
0  preeclampsia             36  plac