# Example Pyllelic Use-Case Notebook

## Background

This notebook illustrates the import and use of `pyllelic` in a jupyter environment.

See https://github.com/Paradoxdruid/pyllelic for further details.

## Pre-setup

### Obtaining fastq data

We can download rrbs (reduced representation bisulfite sequencing) data from the Encode project:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeHaibMethylRrbs/

Those files are in unaligned fastq format.  We will need to align these to a reference human genome.

### Aligning reads (using command line tools)

To align reads, we'll use bowtie2 and samtools.  **TODO:** Rewrite the below to use `process.py` commands.

First, we need to download a genomic index sequence: http://hgdownload.soe.ucsc.edu/goldenPath/hg19

Typical command:
```shell
bowtie2 -p 8 -x {bowtie_index_filename_without_suffix} -U {fastq_file_name} | samtools view -bS - > out.bam
```

Notes:
* p is number of processor cores, adjust for your system
* instead of `out.bam` use a filename that encodes cell-line and tissue.  Our convention is: `fh_CELLLINE_TISSUE.TERT.bam`

Then, we need to sort the resultant bam file.

Typical command:
```shell
samtools sort -o sorted.bam out.bam
```

Finally, we need to build an index file (**pyllelic** can also do this, if missing).

Typical command:
```shell
samtools index sorted.bam
```

Now, that sorted file (again, rename to capture cell-line and tissue info) is ready to be put in the `test` folder for analysis by pyllelic!

### Aligning reads (using process.py)

In [None]:
# Processing imports
from pathlib import Path

In [None]:
# Set up file paths
index = Path(
    "/home/andrew/allellic/hg19.p13.plusMT.no_alt_analysis_set//hg19.p13.plusMT.no_alt_analysis_set"
)
fastq = Path("/home/andrew/allellic/wgEncodeHaibMethylRrbsU87HaibRawDataRep1.fastq.gz")

**WARNING:** The next command is processor, RAM, and time intensive, and only needs to be run once!

In [None]:
# pyllelic.bowtie2_fastq_to_bam(index=index, fastq=fastq)

Next, we need to sort and index the bam file using samtools functions.

In [None]:
bamfile = Path("/home/andrew/allellic/wgEncodeHaibMethylRrbsU87HaibRawDataRep1.bam")
pyllelic.samtools_sort(bamfile)

In [None]:
sorted_bam = Path("")
pyllelic.samtools_index(b)

## Set-up

In [1]:
import pyllelic

In [2]:
# set up your disk location:
# base_path should be the directory we'll do our work in
# make a sub-directory under base_path with a folder named "test"
# and put the .bam and .bai files in "test"

pyllelic.set_up_env_variables(
    base_path="/Users/abonham/documents/test_allelic/",
    prom_file="TERT-promoter-genomic-sequence.txt",
    prom_start="1293000",
    prom_end="1296000",
    chrom="5",
    offset=1298163,
)

# pyllelic.set_up_env_variables(
#     base_path="/home/andrew/allellic/",
#     prom_file="TERT-promoter-genomic-sequence.txt",
#     prom_start="1293000",
#     prom_end="1296000",
#     chrom="chr5",
# )

## Main Parsing Functions

In [3]:
files_set = pyllelic.make_list_of_bam_files()  # finds bam files

In [4]:
# Uncomment for debugging:
files_set

['fh_SW1710_URINARY_TRACT.TERT.bam', 'fh_NCIH196_LUNG.TERT.bam']

In [5]:
# index bam and creates bam_output folders/files
positions = pyllelic.index_and_fetch(files_set)

  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
# Uncomment for debugging:
positions

['1293104',
 '1293139',
 '1293561',
 '1293588',
 '1293690',
 '1293730',
 '1294004',
 '1294031',
 '1294196',
 '1294223',
 '1294235',
 '1294262',
 '1294316',
 '1294369',
 '1294419',
 '1294446',
 '1294812',
 '1294872',
 '1294945',
 '1294972',
 '1295089',
 '1295116',
 '1295246',
 '1295320',
 '1295365',
 '1295393',
 '1295430',
 '1295590',
 '1295680',
 '1295743',
 '1295770',
 '1295876',
 '1295903',
 '1295937',
 '1295979']

In [7]:
# Only needs to be run once, generates static files
# pyllelic.genome_parsing()

# Can also take sub-list of directories to process
# pyllelic.genome_parsing([pyllelic.config.bam_directory / "fh_BONHAM_TISSUE.TERT.bam"])

In [8]:
cell_types = pyllelic.extract_cell_types(files_set)

In [9]:
# Uncomment for debugging
cell_types

['SW1710', 'NCIH196']

In [10]:
# Set filename to whatever you want
df_list = pyllelic.run_quma_and_compile_list_of_df(
    cell_types, "test7.xlsx",
    run_quma=False,
)  # to skip quma: , run_quma=False)

In [11]:
# Uncomment for debugging
df_list.keys()

dict_keys(['SW1710', 'NCIH196'])

In [12]:
means = pyllelic.process_means(df_list, positions, files_set)

In [13]:
# Uncomment for debugging
means

Unnamed: 0,1293104,1293139,1293561,1293588,1293690,1293730,1294004,1294031,1294196,1294223,...,1295393,1295430,1295590,1295680,1295743,1295770,1295876,1295903,1295937,1295979
SW1710,,,,,0.8,,,1,1.0,,...,,,0.961538,,0.964286,,1.0,,,
NCIH196,,,0.0,,0.279874,,0.697674,1,0.283784,,...,,,0.619469,1.0,0.475352,,0.901099,1.0,,


In [14]:
modes = pyllelic.process_modes(df_list, positions, files_set)

In [15]:
# Uncomment for debugging
modes

Unnamed: 0,1293104,1293139,1293561,1293588,1293690,1293730,1294004,1294031,1294196,1294223,...,1295393,1295430,1295590,1295680,1295743,1295770,1295876,1295903,1295937,1295979
SW1710,,,,,1.0,,,1,1,,...,,,1.0,,1.0,,1,,,
NCIH196,,,0.0,,0.5,,1.0,1,0,,...,,,0.666667,1.0,0.25,,1,1.0,,


In [16]:
diff = pyllelic.find_diffs(means, modes)

In [17]:
# Uncomment for debugging
diff

Unnamed: 0,1293104,1293139,1293561,1293588,1293690,1293730,1294004,1294031,1294196,1294223,...,1295393,1295430,1295590,1295680,1295743,1295770,1295876,1295903,1295937,1295979
SW1710,,,,,-0.2,,,0,0.0,,...,,,-0.0384615,,-0.0357143,,0.0,,,
NCIH196,,,0.0,,-0.220126,,-0.302326,0,0.283784,,...,,,-0.0471976,0.0,0.225352,,-0.0989011,0.0,,


## Write Output to excel files

In [None]:
# Set the filename to whatever you want
pyllelic.write_means_modes_diffs(means, modes, diff, "Test8")

## Visualizing Data

In [None]:
final_data = pyllelic.pd.read_excel(
    pyllelic.config.base_directory.joinpath("Test7_diff.xlsx"),
    dtype=str,
    index_col=0,
    engine="openpyxl",
)

In [None]:
final_data

In [18]:
individual_data = pyllelic.return_individual_data(df_list, positions, files_set)

Position:   0%|          | 0/35 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

Cell Line:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# Uncomment for debugging
individual_data

In [None]:
pyllelic.histogram(individual_data, "SW1710", "1293690")

In [None]:
pyllelic.histogram(individual_data, "NCIH196", "1294004")

In [None]:
final_data.loc["SORTED"]

## Statistical Tests for Normality

In [19]:
pyllelic.summarize_allelic_data(individual_data, diff)

  w = (y - xbar) / s


Unnamed: 0,cellLine,position,ad_stat,p_crit,diff,raw
0,SW1710,1294945,3.552949,0.943,0.157143,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,SW1710,1295089,3.56016,0.954,-0.333333,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, ..."
2,SW1710,1295590,8.204686,0.978,-0.038462,"[0.6666666666666666, 0.6666666666666666, 0.666..."
3,SW1710,1295743,10.434122,0.998,-0.035714,"[0.75, 0.75, 0.75, 0.75, 0.75, 1.0, 1.0, 1.0, ..."
4,NCIH196,1293690,28.94414,1.066,-0.220126,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,NCIH196,1294004,3.844062,1.012,-0.302326,"[0.0, 0.0, 0.0, 0.3333333333333333, 0.33333333..."
6,NCIH196,1294196,11.181318,1.041,0.283784,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,NCIH196,1294316,9.396286,1.042,0.328947,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,NCIH196,1294419,10.317001,1.035,0.443182,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,NCIH196,1294945,7.18854,1.053,-0.192157,"[0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, ..."
