# Prepare the input data for nucleosome density analysis

### Aim
This notebook prepares the tables used as input for the nucleosome density analysis (jupyter notebook). 

### Input files
- IES length information: a text file containing IES IDs and the corresponding lengths (in bp)
- read counts on IESs from nucleosomal samples: text files with read counts (htseq-count output) on IESs from whole-genome sequencing conducted on DNase digested new MAC DNA. Provide experimental sample (gene-of-interest_PGM-KD) and control sample (ND7_PGM-KD).
- read counts on IESs from undigested samples: text files with read counts (htseq-count output) on IESs from whole-genome sequencing conducted on non-digested new MAC DNA. Provide experimental sample (gene-of-interest_PGM-KD) and control sample (ND7_PGM-KD). This should be exactly the same DNA that was used for DNase digest for nucleosomal sample. It is used for normalization

### Parameters 
- newfilename: enter how the output file should be called
- fns: provide names of input files 

### Output
Text file containing the following column: IES ID, IES length, read counts from the provided samples

In [1]:
import gzip

In [2]:
# enter how the new table should be saved
newfilename = 'ICOPs.merged.lengths.IES.downsampled.htseq-count.txt.gz'

rec_d = {}

# specify the files that need to be combined
fns = ['IES_length.txt.gz',
       'ND7_PGM_DNase.IES.downsampled.htseq-count.txt.gz', 
       'ND7_PGM_MAC.IES.htseq-count.txt.gz', 
       'ICOP1_2_PGM_DNase.IES.htseq-count.txt.gz', 
       'ICOP1_2_PGM_MAC.IES.downsampled.htseq-count.txt.gz']

In [3]:
for fn in fns:
    # retrieve the file name
    kd = fn.split(".")[0]
    #store IES and their values for each input file in a dictionary
    with gzip.open(fn, mode='rt') as fh:
        for line in fh.readlines():
            atoms = line.split()
            rec_d.setdefault(kd, {}) # provide file name as key
            rec_d[kd][atoms[0]] = atoms[1] # fill IES IDs and their values as keys and values

In [4]:
kds = list(rec_d.keys())
ies_names = list(rec_d["IES_length"].keys())

#generate the header line of the output table 
header_str = "IES " + " ".join([akd for akd in kds])

In [5]:
#generate the subsequent lines and store each line as an element in a list
lines = list()
lines.append(header_str)

#each IES and its values for the different samples will be one line
for ies in ies_names:
    counter = 0
    aline = ies + " "
    for kd in rec_d:
        counter += 1
        # the if statement ensures that the last line ends not on the delimiter
        # delimiter at end of line screws up pandas import
        if counter == len(rec_d):
            aline += rec_d[kd][ies]
        else:
            aline += rec_d[kd][ies] + " "
    lines.append(aline)

In [6]:
# save the lines to a file that's name is specified in newfilename
with gzip.open(newfilename, 'wt') as f:
    for i in range(len(lines)):
        f.write(lines[i]+"\n")