## This notebook does the following
- Reads in two files containing filenames 
- Examines the lists of filenames to see which ones don't have a matching pair
- Outputs three lists: 
    - a list of L001 filenames that have a partner for merge
    - a list of L002 filenames that have a partner for merge
    - a list of filenames that don't have a partner & therefore don't need to be merged
    
**Scroll to the bottom of this notebook to see a summary of file numbers**

## Navigate to the directory with the lists

In [1]:
# navigate to parent folder of generated data
# this is should be a location on your laptop
%cd ~/Documents/work/lawson_lab/deepcelllineage/mitolin/data/gen/nguyen_nc_2018/

/Users/drb/Documents/work/lawson_lab/deepcelllineage/mitolin/data/gen/nguyen_nc_2018


In [2]:
# show folder names
!ls -1

[34m20190502-KB[m[m
[34m20190527-pairr1r2[m[m
[34m20190613-fasta-on-laptop[m[m
[34m20190613-fastas[m[m
[34m20190702-fastas-on-hpc[m[m
[34m20190809-fastq2uamgfil[m[m
[34m20190809-r1r2lists-i1-rename[m[m
[34m20190820-pairlanelists-DRB[m[m
[34m20190821-uamgfil2bqsrecal-DRB[m[m


In [3]:
# move to folder with lists
%cd ~/Documents/work/lawson_lab/deepcelllineage/mitolin/data/gen/nguyen_nc_2018/20190820-pairlanelists-DRB/

/Users/drb/Documents/work/lawson_lab/deepcelllineage/mitolin/data/gen/nguyen_nc_2018/20190820-pairlanelists-DRB


In [4]:
# check the lists you expect are there
!ls

cellslist.txt          lane2list-paired.txt   unpairedlist.txt
lane1list-paired.txt   lane2list.txt          unpairedlistnolane.txt
lane1list.txt          lanesmgdlist.txt


In [5]:
# view the first 10 lines of text in your list
!head -12 lane1list.txt

filtered-uamerged-aligned-i1-lib001-A01-L001.bam
filtered-uamerged-aligned-i1-lib001-A02-L001.bam
filtered-uamerged-aligned-i1-lib001-A03-L001.bam
filtered-uamerged-aligned-i1-lib001-A04-L001.bam
filtered-uamerged-aligned-i1-lib001-A05-L001.bam
filtered-uamerged-aligned-i1-lib001-A06-L001.bam
filtered-uamerged-aligned-i1-lib001-A07-L001.bam
filtered-uamerged-aligned-i1-lib001-A08-L001.bam
filtered-uamerged-aligned-i1-lib001-A09-L001.bam
filtered-uamerged-aligned-i1-lib001-A10-L001.bam
filtered-uamerged-aligned-i1-lib001-A11-L001.bam
filtered-uamerged-aligned-i1-lib001-A12-L001.bam


In [6]:
# create variable for input files
file1 = 'lane1list.txt'
file2 = 'lane2list.txt'

## Make sets with L001 and L002 files

Read the lines of a file into a set object

In [7]:
set1 = set()
with open(file1) as r1:
    for line in r1:
        set1.add(line)

In [8]:
# view the whole set
set1

{'filtered-uamerged-aligned-i1-lib001-A01-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A02-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A03-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A04-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A05-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A06-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A07-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A08-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A09-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A10-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A11-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A12-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-B01-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-B02-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-B03-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-B04-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-B05-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-B06-L001.bam\n',
 'filtered

In [9]:
# view only part of the set
list(sorted(set1))[:5]

['filtered-uamerged-aligned-i1-lib001-A01-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A02-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A03-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A04-L001.bam\n',
 'filtered-uamerged-aligned-i1-lib001-A05-L001.bam\n']

Same as above, but slice off "L#" and rest of end of filenames.

Do this so that filenames will be == for downstream comparison.

In [10]:
set1 = set()
with open(file1) as r1:
    for line in r1:
        set1.add(line[:line.rfind('-L')]) 

In [11]:
list(set1)[:5]

['filtered-uamerged-aligned-i1-lib001-A05',
 'filtered-uamerged-aligned-i1-lib002-B03',
 'filtered-uamerged-aligned-i1-lib001-A11',
 'filtered-uamerged-aligned-i1-lib002-H07',
 'filtered-uamerged-aligned-i1-lib002-G07']

Repeat for L2

This could be improved by making one function that loops through the lists

In [12]:
set2 = set()
with open(file2) as r2:
    for line in r2:
        set2.add(line[:line.rfind('-L')])

In [13]:
list(set2)[:5]

['filtered-uamerged-aligned-i1-lib001-A05',
 'filtered-uamerged-aligned-i1-lib002-B03',
 'filtered-uamerged-aligned-i1-lib002-G10',
 'filtered-uamerged-aligned-i1-lib002-H07',
 'filtered-uamerged-aligned-i1-lib002-C05']

Check the number of filenames in each set (list)

In [14]:
print(len(set1))
print(len(set2))

131
165


## Create lists of sample pairs

Get the filenames which are in both lists and store them in the variable "paired".

In [15]:
paired = set1.intersection(set2)
print(len(paired))

118


Suffixes need to be included in the generated lists of filenames

In [16]:
# create suffix variables
suffix1 = '-L001.bam'
suffix2 = '-L002.bam'

Using the "paired" set, generate two new lists of filenames with correct suffixes

In [17]:
# create a new writable file and open it
s1_out = open(file1.replace('.txt', '-paired.txt'), 'w')
s2_out = open(file2.replace('.txt', '-paired.txt'), 'w')

# add filenames from paired set to each file with correct suffixes
for name in sorted(paired):
    s1_out.write(name+suffix1+'\n')
    s2_out.write(name+suffix2+'\n')

# close files
s1_out.close()
s2_out.close()

In [18]:
# check new files are present
!ls

cellslist.txt          lane2list-paired.txt   unpairedlist.txt
lane1list-paired.txt   lane2list.txt          unpairedlistnolane.txt
lane1list.txt          lanesmgdlist.txt


In [19]:
!head -12 lane1list-paired.txt

filtered-uamerged-aligned-i1-lib001-A01-L001.bam
filtered-uamerged-aligned-i1-lib001-A02-L001.bam
filtered-uamerged-aligned-i1-lib001-A03-L001.bam
filtered-uamerged-aligned-i1-lib001-A04-L001.bam
filtered-uamerged-aligned-i1-lib001-A05-L001.bam
filtered-uamerged-aligned-i1-lib001-A06-L001.bam
filtered-uamerged-aligned-i1-lib001-A07-L001.bam
filtered-uamerged-aligned-i1-lib001-A08-L001.bam
filtered-uamerged-aligned-i1-lib001-A09-L001.bam
filtered-uamerged-aligned-i1-lib001-A10-L001.bam
filtered-uamerged-aligned-i1-lib001-B01-L001.bam
filtered-uamerged-aligned-i1-lib001-B02-L001.bam


In [20]:
!head -12 lane2list-paired.txt

filtered-uamerged-aligned-i1-lib001-A01-L002.bam
filtered-uamerged-aligned-i1-lib001-A02-L002.bam
filtered-uamerged-aligned-i1-lib001-A03-L002.bam
filtered-uamerged-aligned-i1-lib001-A04-L002.bam
filtered-uamerged-aligned-i1-lib001-A05-L002.bam
filtered-uamerged-aligned-i1-lib001-A06-L002.bam
filtered-uamerged-aligned-i1-lib001-A07-L002.bam
filtered-uamerged-aligned-i1-lib001-A08-L002.bam
filtered-uamerged-aligned-i1-lib001-A09-L002.bam
filtered-uamerged-aligned-i1-lib001-A10-L002.bam
filtered-uamerged-aligned-i1-lib001-B01-L002.bam
filtered-uamerged-aligned-i1-lib001-B02-L002.bam


## Generate a list of files that DON'T have a partner to be merged with

In [21]:
# count & get filenames which are in set1 but not in set2
set1_only = set1.difference(set2)
print(len(set1_only))
set1_only

13


{'filtered-uamerged-aligned-i1-lib001-A11',
 'filtered-uamerged-aligned-i1-lib001-A12',
 'filtered-uamerged-aligned-i1-lib002-A09',
 'filtered-uamerged-aligned-i1-lib002-C03',
 'filtered-uamerged-aligned-i1-lib002-C10',
 'filtered-uamerged-aligned-i1-lib002-D09',
 'filtered-uamerged-aligned-i1-lib002-F05',
 'filtered-uamerged-aligned-i1-lib002-F07',
 'filtered-uamerged-aligned-i1-lib002-G07',
 'filtered-uamerged-aligned-i1-lib002-G08',
 'filtered-uamerged-aligned-i1-lib002-H02',
 'filtered-uamerged-aligned-i1-lib002-H03',
 'filtered-uamerged-aligned-i1-lib002-H09'}

In [22]:
# count & get filenames which are in set2 but not in set1
set2_only = set2.difference(set1)
print(len(set2_only))
set2_only

47


{'filtered-uamerged-aligned-i1-lib002-A01',
 'filtered-uamerged-aligned-i1-lib002-A03',
 'filtered-uamerged-aligned-i1-lib002-A05',
 'filtered-uamerged-aligned-i1-lib002-A07',
 'filtered-uamerged-aligned-i1-lib002-A08',
 'filtered-uamerged-aligned-i1-lib002-A10',
 'filtered-uamerged-aligned-i1-lib002-A11',
 'filtered-uamerged-aligned-i1-lib002-B04',
 'filtered-uamerged-aligned-i1-lib002-B05',
 'filtered-uamerged-aligned-i1-lib002-B06',
 'filtered-uamerged-aligned-i1-lib002-B10',
 'filtered-uamerged-aligned-i1-lib002-B11',
 'filtered-uamerged-aligned-i1-lib002-B12',
 'filtered-uamerged-aligned-i1-lib002-C01',
 'filtered-uamerged-aligned-i1-lib002-C04',
 'filtered-uamerged-aligned-i1-lib002-C05',
 'filtered-uamerged-aligned-i1-lib002-C08',
 'filtered-uamerged-aligned-i1-lib002-C11',
 'filtered-uamerged-aligned-i1-lib002-D03',
 'filtered-uamerged-aligned-i1-lib002-D04',
 'filtered-uamerged-aligned-i1-lib002-D08',
 'filtered-uamerged-aligned-i1-lib002-D10',
 'filtered-uamerged-aligned-i1-l

In [24]:
# Generate a list.txt file with filenames that don't have pairs 
# remember to append .bam extension
unpaired = open('unpairedlistnolane.txt', 'w')

for name in sorted(set1_only):
    unpaired.write(name+'.bam\n')
    
for name in sorted(set2_only):
    unpaired.write(name+'.bam\n')    
    
unpaired.close()

## Summary

Before processing

- 131 "lane 1" bam files
- 165 "lane 2" bam files

After processing

- 118 "lane 1" & "lane 2" pairs
- 13 "lane 1" bam files without a "lane 2" match
- 47 "lane 2" bam files without a "lane 1" match