# Example Pyllelic Use-Case Notebook

<img src="https://github.com/Paradoxdruid/pyllelic/blob/master/assets/pyllelic_logo.png?raw=true" alt="pyllelic logo" width="100"/>

## Background

This notebook illustrates the import and use of `pyllelic` in a jupyter environment.

Source code: <https://github.com/Paradoxdruid/pyllelic>

Documentation: <https://paradoxdruid.github.io/pyllelic/>

## Pre-setup / File preparation

### Obtaining fastq data

We can download rrbs (reduced representation bisulfite sequencing) data from the Encode project:
<http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/>

Those files are in unaligned fastq format.  We will need to align these to a reference human genome.

### Retrieve promoter sequence

We will download the genomic region of interest from the UCSC Genome browser and save it in a text file:

In [1]:
# Allow import of module from above notebooks directory
import sys
sys.path.append('..')

```python
from pyllelic import process
# Retrieve promoter genomic sequence
process.retrieve_seq("{prom_filename}.txt", chrom="chr5", start=1293200, end=1296000)
```

### Preparing bisultife-converted genome

To prepare the genome and align reads, we'll use [bismark](https://github.com/FelixKrueger/Bismark), bowtie2, and samtools (through its pysam wrapper).

First, we need to download a genomic index sequence: <http://hgdownload.soe.ucsc.edu/goldenPath/hg19>

```python
# Processing imports
from pathlib import Path

# Set up file paths
genome = Path("/{your_directory}/{genome_file_directory}")
fastq = Path("/{your_directory}/{your_fastq_file.fastq.gz}")
```

**WARNING:** The next command is processor, RAM, and time intensive, and only needs to be run once!

```python
# Prepare genome via bismark
process.prepare_genome(genome) # can optionally give path to bowtie2 if not in PATH
```

### Aligning reads

**WARNING:** The next command is processor, RAM, and time intensive, and only needs to be run once!

```python
# Convert fastq to bismark-aligned bam
from pyllelic import process
process.bismark(genome, fastq)
```

Notes:

* We recommend renaming your `.bam` file to encode cell-line and tissue.  Our convention is: `fh_CELLLINE_TISSUE.REGION.bam`

Next, we need to sort and index the bam file using samtools functions.

```python
# Sort the bamfile
bamfile = Path("/{your_directory}/{bam_filename}.bam")
process.sort_bam(bamfile)
```

```python
# Create an index of the sorted bamfile
sorted_bam = Path("/{your_directory}/{bam_filename}_sorted.bam")
process.index_bam(sorted_bam)
```

### Organize directory contents

* Place sorted bam files and index files (again, rename to capture cell-line and tissue info) in the `test` folder for analysis by pyllelic.
* Place the promoter sequence in your main directory.

## Set-up

In [None]:
from pyllelic import pyllelic

1. Set up your disk location:  ```base_path``` should be the directory we'll do our work in
2. Make a sub-directory under ```base_path``` with a folder named ```test``` and put the ```.bam``` and ```.bai``` files in ```test```

In [None]:
config = pyllelic.configure(
    base_path="/home/jovyan/assets/",
    prom_file="tert_genome.txt",
    prom_start=1293200,
    prom_end=1296000,
    chrom="5",
    offset=1293000,
    # viz_backend="plotly",
    # fname_pattern=r"^[a-zA-Z]+_([a-zA-Z0-9]+)_.+bam$",
    # test_dir="test",
    # results_dir="results",
)

## Main Parsing Functions

### Find files to analyze

In [None]:
files_set = pyllelic.make_list_of_bam_files(config)

In [None]:
# files_set # uncomment for debugging

### Perform full methylation analysis and generate data object

**Warning**: This step is processor and time intensive and will perform all processing and analysis of the bam file data.

In [None]:
data = pyllelic.pyllelic(config=config, files_set=files_set)

### Check main data outcomes

In [None]:
# ", ".join(data.positions)

In [None]:
data.means.head()

In [None]:
data.modes.head()

In [None]:
data.diffs.head()

In [None]:
# data.diffs.loc[:,data.diffs.any()]

In [None]:
# data.quma_results[data.means.index[0]].values.head()

In [None]:
data.individual_data.loc[data.means.index[0], data.means.columns[0]]

## Write Output

### Save entire object as pickle

In [None]:
# data.save_pickle("test_data.pickle")

### Save Quma Results to excel

In [None]:
# data.save("output.xlsx")

### Save analysis files (means, modes, diffs) to excel

In [None]:
# data.write_means_modes_diffs("Full_") # Sets the filename stub

### Reopen saved object

In [None]:
# data = pyllelic.GenomicPositionData.from_pickle("test_data2.pickle")

## Data Analysis

### View raw data of methylation percentage per read

In [None]:
df = data.individual_data

### Statistical Tests for Normality

In [None]:
summ = data.summarize_allelic_data()

In [None]:
summ.head()

In [None]:
ad_big_diffs = summ.pivot(index="position", columns="cellLine", values="ad_stat")

In [None]:
ad_big_diffs = ad_big_diffs.dropna(axis=1, how="all").count(axis=1).to_frame()

In [None]:
ad_big_diffs.head()

### Raw quma results

In [None]:
data.quma_results[data.means.index[0]].values.head()

In [None]:
data.individual_data.head()

## Visualizing Data

### Histograms of reads at a cell line and genomic position

In [None]:
CELL = data.means.index[1]
POS = data.means.columns[5]
data.histogram(CELL, POS)

### Heatmap of mean methylation values

In [None]:
data.heatmap(min_values=1, width=600, height=800, data_type="means")

### Bar chart of significant methylation differences

In [None]:
data.sig_methylation_differences()

### Bar graphs of individual reads

In [None]:
data.reads_graph()

### Find values with a methylation difference above a threshold

In [None]:
import plotly.express as px

In [None]:
import pandas as pd

In [None]:
DIFF_THRESHOLD = 0.001

In [None]:
large_diffs = data.diffs[(data.diffs >= DIFF_THRESHOLD).any(1)].dropna(
    how="all", axis=1
)

In [None]:
large_diffs.head()

In [None]:
interesting = {}
for index, row in large_diffs.iterrows():
    if row.loc[(row != 0)].dropna().any():
        interesting[index] = row.loc[(row != 0)].dropna()

In [None]:
big_diffs = pd.DataFrame.from_dict(interesting)

In [None]:
big_diffs.head()

In [None]:
big_diffs_counts = big_diffs.dropna(axis=1, how="all").count(axis=1).to_frame()

In [None]:
fig = px.bar(big_diffs_counts, template="seaborn")
fig.update_layout(xaxis_type="linear", showlegend=False, barmode="stack")
fig.update_xaxes(
    tickformat="r", tickangle=45, nticks=40, title="Position", range=[1292981, 1295979]
)
fig.update_yaxes(title="# of large differences")
fig.update_traces(width=50)
fig.show()

### Plotting read methylation distributions

In [None]:
df = data.individual_data.copy()

In [None]:
df.head()

In [None]:
def to_1D(series):
    """From https://towardsdatascience.com/dealing-with-list-values-in-pandas-dataframes-a177e534f173"""
    if isinstance(series, pd.Series):
        series = series.dropna()
        return pd.Series([x for _list in series for x in _list])

In [None]:
POS = data.means.columns[5]

In [None]:
to_1D(df.loc[:,POS]).value_counts()

In [None]:
to_1D(df.loc[:,POS])

In [None]:
to_1D(df.loc[:,POS]).mode()

#### Distribution at one cell line, position

In [None]:
px.bar(to_1D(df.loc[:,POS]).value_counts(normalize=True))

#### Distribution across a cell line

In [None]:
CELL = data.means.index[1]

In [None]:
df2 = df.loc[CELL]

In [None]:
# df2.head()

In [None]:
df2.explode().value_counts()

In [None]:
px.violin(df2.explode())

#### Distribution across all cell lines

In [None]:
all_values = []

In [None]:
for each in df.index:
    temp = pd.Series(df.loc[each].explode())
    #     print(temp)
    all_values.append(temp)
#     print(all_values)

In [None]:
flat_list = [item for sublist in all_values for item in sublist]

In [None]:
flat_series = pd.Series(flat_list)

In [None]:
flat_series.value_counts()

In [None]:
px.violin(flat_series)

### Identify large differences

In [None]:
def find_big_diffs(df, min_diff: float):
    out = {}
    for each in df.columns:
        mean = to_1D(df[each].dropna()).mean()
        mode = to_1D(df[each].dropna()).mode()
        if not mode.empty:
            diff = abs(mean - mode.values[0])
            if diff > min_diff:
                out[each] = diff

    if out:
        return out

In [None]:
big_ones = find_big_diffs(df, 0.1)
big_ones