# Filter for cells that have over 200 reads mapping to mitochondrio, and at least one of the variants 3010G->A and 9698T>C. The paper says that 1,077 cells were found (see below), so trying to get the same number

# Similar to the paper:
For the AML datasets previously generated by 10X Genomics (Zheng et al., 2017b), cells from two patients (AML027 and AML035) were analyzed for mitochondrial genotypes. Aligned and processed .bam files were downloaded from the 10X website (https:// support.10xgenomics.com/single-cell-gene-expression/datasets/) and further processed using custom Python scripts. Cell barco- des associated with at least 200 reads uniquely aligning to the mitochondrial genome were considered for downstream analysis. Barcodes were further filtered by requiring coverage by at least one read at two specific variants at mtDNA positions 3010 and 9698. We note that we did not observe a barcode that contained a read to support both alternate alleles (3010G > A and 9698T > C). We determined that 4 out of 1,077 cells were derived from the recipient (Figure 7M), a higher estimate than in the previously reported analysis performed with nuclear genome variants (reported exactly 0%) (Zheng et al., 2017b), though these four cells were not included in the published analysis as they did not pass the author’s barcode/ transcriptome filters. We did not observe a well-covered set of variants separating the donor/ recipient pair in the AML027 dataset, and did not further analyze it for mutations but only for determining well-covered barcodes (Figures S7G and S7H

## Steps: 
1. Download the aml035 bam and index file.
2. Extract mitochondrion reads and create index file.
3. Get list of cell barcodes (corrected CB's and CR's) from the mitochondria file
4. Create text file of reads covering the two variants
5. Loop through those files and create a dictionary for the CB's that count how many times the alternative and reference are seen. 
6. Create filter for having at least 200 basepairs



## Load packages and set parameters

In [1]:
import glob
import os
import pandas as pd
from tqdm import tqdm
import numpy as np
import pysam
import time
from collections import defaultdict
import pickle
from itertools import product
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
from bam_barcodes_function import extract_barcode_info

In [3]:
ORIG = "data/aml035_post_transplant_possorted_genome_bam.bam"
BAM = "data/aml035_post_transplant_possorted_genome_bam.MT.bam"
BARCODE_INFO = "data/barcode_data_aml035_post_transplant_possorted_genome_bam.MT.p"

GENOME = "/data2/genome/human_GRCh38/cellranger/refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa"
NAME = 'aml_035'

In [4]:
nucs = ["A", "T","C", "G"]
variants = ["3010", "9698"]


In [5]:
if not os.path.exists("data"):
    os.mkdir("data")

## 1. Download files


In [None]:
cmd = "wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/aml035_post_transplant/aml035_post_transplant_possorted_genome_bam.bam -O {ORIG}"
print(cmd)
os.system(cmd)

cmd = "wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/aml035_post_transplant/aml035_post_transplant_possorted_genome_bam_index.bam.bai -O {ORIG}.bai"
os.system(cmd)
print(cmd)

## 2. Extract mitochondria reads

In [6]:
cmd = f"samtools view {ORIG} -q 30 MT -b > {BAM}"
print(cmd)
os.system(cmd)

#index file
cmd = f"samtools index {BAM}"
print(cmd)
os.system(cmd)

samtools view data/aml035_post_transplant_possorted_genome_bam.bam -q 30 MT -b > data/aml035_post_transplant_possorted_genome_bam.MT.bam
samtools index data/aml035_post_transplant_possorted_genome_bam.MT.bam


0

## 3. Get barcode info from BAM files

In [7]:
extract_barcode_info(BAM, BARCODE_INFO)

1475448it [00:17, 86474.29it/s]


(defaultdict(int,
             {'AATGCGTGTTCTCA': 553,
              'ACGGATTGATCGAC': 7,
              'ACGTGATGTGCTTT': 14,
              'ACTCTCCTATTCCT': 783,
              'ATGGGTACTATGGC': 67,
              'ATTGGTCTACCGAT': 55,
              'ATTTGCACTCGCAA': 1118,
              'CATGTTACCTAAGC': 35,
              'CCAAGTGAGGTACT': 14,
              'CGCAGGACGGTAAA': 739,
              'GAGCTCCTGGAACG': 92,
              'GAGGTACTCCGATA': 453,
              'GAGTCTGAGATACC': 158,
              'GCGCATCTAACAGA': 12,
              'TAGGCTGATATCTC': 298,
              'TATGGGTGGATGAA': 476,
              'TCTAGTTGTGACTG': 60,
              'TCTCAAACCTAGCA': 515,
              'AAGGCTACAACGAA': 334,
              'ACCGCGGATGCTCC': 5,
              'GGCTCACTCTCGAA': 1118,
              'CCCTTACTCATTCT': 2,
              'GAGGGCCGCTGTAG': 1,
              'CTCCTACTGTTGTG': 14,
              'GAGCATAGGAGGGT': 1,
              'GCAGCCGATACTGG': 7,
              'TACGACGAAGTGTC': 347,
  

### Load barcode data

In [8]:
[CR_read_number,CB_read_number,BC_read_number, barcodes, corrected_barcodes, barcode_pairs] = pickle.load(open(BARCODE_INFO,"rb"))

In [9]:
print('Number of CB (corrected) barcodes {}'.format(len(CB_read_number)))
print('Number of CR (uncorrected) barcodes {}'.format(len(CR_read_number)))
print('Number of BC (sample index) barcodes {}'.format(len(BC_read_number)))
BC_read_number

Number of CB (corrected) barcodes 44018
Number of CR (uncorrected) barcodes 172823
Number of BC (sample index) barcodes 12


defaultdict(int,
            {'CGCAGGAG': 122082,
             'GCACCAGT': 123264,
             'ATACTGAG': 130876,
             'GATGCCTC': 147658,
             'GAATACTG': 100772,
             'TAGTACCA': 115213,
             'ATTGTTTC': 100875,
             'CGCGTGCA': 152216,
             'CGGAGACT': 145618,
             'TCCTATGA': 133150,
             'ATTCGTGC': 131056,
             'TCGACAAT': 72668})

## 4. Filter for reads with more than 200 bps

In [10]:
count = 0
CB_filter = set()
for key in CB_read_number:
    if CB_read_number[key] >= 200:
        CB_filter.add(key)
        count += 1
print(count)

1426


In [11]:
nucs = ["A", "T","C", "G"]
variants = ["3010", "9698"]

CB_df = pd.DataFrame(index=CB_filter, columns=["Number of Reads"]+list(map(lambda x: "".join(x), product(variants,nucs))), dtype=int)
CB_df.loc[:,:] = 0

for i in CB_filter:
    CB_df.loc[i, "Number of Reads"] = CB_read_number[i]
CB_df

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ACCACCTGTTCCCG-3,256.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CATCGCTGTGGTAC-3,239.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CACTTAACTTATCC-1,208.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CTGGCACTTCTATC-3,335.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TATTGCTGTGCTGA-1,430.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
CTTAGACTTGAGCT-1,613.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GGACCTCTTAAGCC-2,211.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TTTCACGACTAGAC-1,351.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ACAAGAGACTCGCT-3,579.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Filter for reads with variants at 3010G and 9698T


### fillmd compares to reference and puts = if the same. The variant at the position is added to the end of the file as V:letter , and this is used to filter for cells that do not have the reference

In [12]:
cmd = f"""samtools view -b {BAM} MT:3010-3010 | samtools fillmd -e - {GENOME} | grep -v "^@"| awk -v pos="3010" 'BEGIN {{OFS = FS = "\t" }} ; {{n=split($10,a,"") ; if(a[(pos-$4)+1] != "=" ) print $0, "V:" a[(pos-$4)+1]}}' > data/{NAME}_3010_reads.txt"""
cmd
os.system(cmd)

0

In [13]:
cmd = f"""samtools view -b {BAM} MT:9698-9698 | samtools fillmd -e - {GENOME} | grep -v "^@"| awk -v pos="9698" 'BEGIN {{OFS = FS = "\t" }} ; {{n=split($10,a,"") ; if(a[(pos-$4)+1] != "=" ) print $0, "V:" a[(pos-$4)+1]}}' > data/{NAME}_9698_reads.txt"""
cmd
os.system(cmd)

0

In [27]:
nucs = ["A", "T","C", "G"]
variants = ["3010", "9698"]
ref_var = {"3010":"G", "9698":"C"}

CB_df = pd.DataFrame(index=CB_filter, columns=["Number of Reads"]+list(map(lambda x: "".join(x), product(variants,nucs))), dtype=int)
CB_df.loc[:,:] = 0

for i in CB_filter:
    CB_df.loc[i, "Number of Reads"] = CB_read_number[i]
CB_df

rm_slash=False
for v in variants:
    print(f"data/{NAME}_{v}_reads.txt")
    with open(f"data/{NAME}_{v}_reads.txt", "r") as f:
        lines = list(map(lambda x: x.strip(), f.readlines()))
    #print(lines)
    for i in lines:
        if "CB:Z:" in i:                
            if rm_slash:
                curr_bc = i.split("CB:Z:")[1].split("\t")[0].split("-")[0]
            else:
                curr_bc = i.split("CB:Z:")[1].split("\t")[0]
            
            ref_var
            if curr_bc in CB_df.index:
                if i[-1] == ":": #Then the reference should be it
                    CB_df.loc[curr_bc, v+ref_var[v]] += 1
                else:
                    CB_df.loc[curr_bc, v+i[-1]] += 1
                    
CB_df

data/aml_035_3010_reads.txt
data/aml_035_9698_reads.txt


Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ACCACCTGTTCCCG-3,256.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
CATCGCTGTGGTAC-3,239.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
CACTTAACTTATCC-1,208.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
CTGGCACTTCTATC-3,335.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0
TATTGCTGTGCTGA-1,430.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0
...,...,...,...,...,...,...,...,...,...
CTTAGACTTGAGCT-1,613.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0
GGACCTCTTAAGCC-2,211.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
TTTCACGACTAGAC-1,351.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0
ACAAGAGACTCGCT-3,579.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,0.0


In [26]:
#3010G->A and 9698T>C

## Cells with different variants

In [28]:
CB_df[(CB_df["9698T"] > 0) & (CB_df["9698C"] > 0)]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
CATCATACCCAACA-3,464.0,0.0,0.0,0.0,0.0,0.0,1.0,14.0,0.0


In [29]:
CB_df[(CB_df["9698T"] > 0)]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
CATCATACCCAACA-3,464.0,0.0,0.0,0.0,0.0,0.0,1.0,14.0,0.0


In [30]:
CB_df[(CB_df["9698C"] > 0)]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ACCACCTGTTCCCG-3,256.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
CATCGCTGTGGTAC-3,239.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
CACTTAACTTATCC-1,208.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
CTGGCACTTCTATC-3,335.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0
TATTGCTGTGCTGA-1,430.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0
...,...,...,...,...,...,...,...,...,...
CTTAGACTTGAGCT-1,613.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0
GGACCTCTTAAGCC-2,211.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
TTTCACGACTAGAC-1,351.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0
ACAAGAGACTCGCT-3,579.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,0.0


In [31]:
CB_df[(CB_df["9698C"] > 0) & (CB_df["3010G"] > 0)]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ATGAAACTCCAAGT-2,579.0,0.0,0.0,0.0,7.0,0.0,0.0,9.0,0.0
CTTAGGGACTCTAT-2,753.0,0.0,0.0,0.0,6.0,0.0,0.0,23.0,0.0
GGCCGAACTATCGG-1,287.0,0.0,0.0,0.0,3.0,0.0,0.0,9.0,0.0
TAGGTCGAAGGGTG-3,1434.0,0.0,0.0,0.0,3.0,0.0,0.0,39.0,0.0
CTAACGGAAGAAGT-2,398.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0
TCCCAGACATCTTC-2,685.0,0.0,0.0,0.0,3.0,0.0,0.0,12.0,0.0
TGACCAGATGGTGT-2,201.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
ACCCAAGATGTCAG-3,725.0,0.0,0.0,0.0,1.0,0.0,0.0,16.0,0.0
TCGAGCCTGTTGAC-1,677.0,0.0,0.0,0.0,3.0,0.0,0.0,4.0,0.0
ACGTTTACACCATG-2,240.0,1.0,0.0,0.0,1.0,0.0,0.0,6.0,0.0


In [32]:
CB_df[(CB_df["9698C"] > 0) & (CB_df["3010A"] > 0)]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
AGTACGTGAAGATG-2,234.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
GCGAAGGAGGTAGG-1,376.0,1.0,0.0,0.0,0.0,0.0,0.0,21.0,0.0
TAACAATGCTACCC-2,1237.0,2.0,0.0,0.0,0.0,0.0,0.0,22.0,0.0
GCCGAGTGTGCATG-2,999.0,2.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0
AATGATACCTGTGA-3,1009.0,2.0,0.0,0.0,0.0,0.0,0.0,25.0,0.0
CTAGTTTGAGCGTT-2,269.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
GATATCCTCCAACA-2,226.0,4.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
GGGAACGATCGACA-2,283.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
CTCGAAGACCTACC-1,455.0,4.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0
ACGTTTACACCATG-2,240.0,1.0,0.0,0.0,1.0,0.0,0.0,6.0,0.0


In [33]:
CB_df[((CB_df["9698C"] == 0) & (CB_df["9698T"] > 0)) | ((CB_df["3010G"] == 0) & (CB_df["3010A"] > 0)) |
     ((CB_df["9698C"] > 0) & (CB_df["9698T"] == 0)) | ((CB_df["3010G"] > 0) & (CB_df["3010A"] == 0))]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ACCACCTGTTCCCG-3,256.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
CATCGCTGTGGTAC-3,239.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0
CACTTAACTTATCC-1,208.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
CTGGCACTTCTATC-3,335.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0
TATTGCTGTGCTGA-1,430.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0
...,...,...,...,...,...,...,...,...,...
CTTAGACTTGAGCT-1,613.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0
GGACCTCTTAAGCC-2,211.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
TTTCACGACTAGAC-1,351.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0
ACAAGAGACTCGCT-3,579.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,0.0


In [34]:
CB_df[((CB_df["9698C"] > 0) | (CB_df["9698T"] > 0)) &  ((CB_df["3010G"] > 0) | (CB_df["3010A"] > 0))]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ATGAAACTCCAAGT-2,579.0,0.0,0.0,0.0,7.0,0.0,0.0,9.0,0.0
AGTACGTGAAGATG-2,234.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
GCGAAGGAGGTAGG-1,376.0,1.0,0.0,0.0,0.0,0.0,0.0,21.0,0.0
TAACAATGCTACCC-2,1237.0,2.0,0.0,0.0,0.0,0.0,0.0,22.0,0.0
CTTAGGGACTCTAT-2,753.0,0.0,0.0,0.0,6.0,0.0,0.0,23.0,0.0
GCCGAGTGTGCATG-2,999.0,2.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0
GGCCGAACTATCGG-1,287.0,0.0,0.0,0.0,3.0,0.0,0.0,9.0,0.0
TAGGTCGAAGGGTG-3,1434.0,0.0,0.0,0.0,3.0,0.0,0.0,39.0,0.0
AATGATACCTGTGA-3,1009.0,2.0,0.0,0.0,0.0,0.0,0.0,25.0,0.0
CTAACGGAAGAAGT-2,398.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0


In [36]:
len(CB_df[((CB_df["9698C"] > 0) | (CB_df["9698T"] > 0)) &  ((CB_df["3010G"] > 0) | (CB_df["3010A"] > 0))])

27

In [37]:
CB_df[((CB_df["9698C"] > 0) | (CB_df["9698T"] > 0) | (CB_df["9698A"] > 0) | (CB_df["9698G"] > 0)) & 
      ((CB_df["3010C"] > 0) | (CB_df["3010T"] > 0) | (CB_df["3010A"] > 0) | (CB_df["3010G"] > 0))]

Unnamed: 0,Number of Reads,3010A,3010T,3010C,3010G,9698A,9698T,9698C,9698G
ATGAAACTCCAAGT-2,579.0,0.0,0.0,0.0,7.0,0.0,0.0,9.0,0.0
ACCTCCGAGGAGGT-3,451.0,0.0,0.0,1.0,0.0,0.0,0.0,11.0,0.0
AGTTCTACTCAAGC-2,736.0,0.0,0.0,2.0,0.0,0.0,0.0,19.0,0.0
CCCGAACTCCAAGT-2,387.0,0.0,5.0,0.0,0.0,0.0,0.0,7.0,0.0
TACGCCACGTAAAG-1,554.0,0.0,0.0,1.0,0.0,0.0,0.0,8.0,0.0
AATGTCCTGTCTTT-1,509.0,0.0,0.0,1.0,0.0,0.0,0.0,22.0,0.0
AGTACGTGAAGATG-2,234.0,1.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
TAATGAACATTGGC-3,448.0,0.0,0.0,1.0,0.0,0.0,0.0,24.0,0.0
GCGAAGGAGGTAGG-1,376.0,1.0,0.0,0.0,0.0,0.0,0.0,21.0,0.0
ACCCAGCTAACGTC-1,260.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0


In [38]:
len(CB_df[((CB_df["9698C"] > 0) | (CB_df["9698T"] > 0) | (CB_df["9698A"] > 0) | (CB_df["9698G"] > 0)) & 
      ((CB_df["3010C"] > 0) | (CB_df["3010T"] > 0) | (CB_df["3010A"] > 0) | (CB_df["3010G"] > 0))])

60