# W3D3: Files

## Ex1: Reading a fasta file, collect all sequence in a single variable 'dna'

Note that this was intended as an extension on the ORF solution (which, btw, was not that easy to solve). The main learning goals were twofold:
- you read stuff from a file, in stead of reading in by other means
- you can collect stuff from multiple lines, and then concatenate into a single string
- **you can add that additional piece of code to an existing script to increase the functionality.**

This is a great opportunity to demonstrate the use of tiny, supertransparent example input for developing and debugging. Consider this tiny example file, called `test.txt`.

In [None]:
%%bash
cat test.txt

In [None]:
in_file_name = 'test.txt'

infile = open(in_file_name, 'r')
infile.readline()
dna = ''
for line in infile:
    dna += line.strip('\n')
infile.close()

print dna

## Ex2: selecting lines from a file

This solution is partially complete (it does not write to outfile), but is less complex and in addition shows how to 'prototype' solutions.

First, make a small testfile:

In [None]:
%%bash
head plantsvshuman_outmft6.csv | head >test_plantsvshuman_outmft6.csv
cat test_plantsvshuman_outmft6.csv

Then, we do the 'bare minimum'. For prototyping, output is going to 'Standard Out'; we can always modify it to write to file later. 

Important: the procedure is as follows: loop through file, for each line split the individual fields on `<TAB>`. This results in a list. Then pick the necessary elements from the list. And remember, each element is of type 'string', so you have to convert to a numerical value because you need to do numerical subtraction. 

In [None]:
in_filename = 'test_plantsvshuman_outmft6.csv'
infile = open(in_filename, 'rU')

infile.readline() # read the first line, don't do anything with it

for line in infile:                   # then loop through the remaining lines
    line = line.strip('\n')           # remove newline
    line_elements = line.split('\t')  # split on tab, returns list
    end_q = int(line_elements[7])     # end_q is the 8th column (index 7!); conv. to int!
    start_q = int(line_elements[6])   # start_q is the 7th column (index 6!); conv. to int!
    end_t = int(line_elements[9])     # end_t is the 10th column (index 9!); conv. to int!
    start_t = int(line_elements[8])   # start_t is the 9th column (index 8!); conv. to int!
    if end_q - start_q == end_t - start_t: # calculate difference 
        print line

infile.close()

## Ex3: Crane data - tabular format to XML

The crane data should be familiar by now. But, as always, it's a good practice to first quickly investigate the contents before starting to process the file:

In [None]:
%%bash
head -3 first_and_last_100_lines_crane_data.txt

The output file is in XML (or, the KML variety) format. This requires writing a few default header lines. 

In [None]:
in_filename = 'first_and_last_100_lines_crane_data.txt'
out_filename = 'crane_data_v2.xml'

XML_HEAD = '<?xml version="1.0" encoding="utf-8"?>'
CRANEDATA_START = '<CraneData>'
CRANEDATA_END = '</CraneData>'

outfile = open(out_filename, 'w')

outfile.write(XML_HEAD+'\n')
outfile.write(CRANEDATA_START+'\n')


Then we open the tabular input file for reading. We read the first line first. This is the header, and we store all the field names in a variable `col_names`. Note that `col_names` is a list

In [None]:
infile = open(in_filename, 'r')
headerline = infile.readline()
col_names = headerline.strip().split('\t')

Let's see what is in `col_names`:

In [None]:
col_names

Then we loop through the rest of the lines. Every line gets split after removing newline. The index values of each list that holds the data (`col_data`) corresponds to the list that holds the names of the columns (`col_names`). Every element of the column then is inserted in the KML file. Note that 'tag name' is derived from list `col_names`, while the values are derived from the list `col_data`.
All fields of one line together form an 'Event'. Each 'Event' is opened by the `<Event>` tag and closed by the `</Event>` tag.

In [None]:
for line in infile:
    col_data = line.strip('\n').split('\t')
    outfile.write('   <Event>\n')
    for i in range(len(col_names)):
        tag = col_names[i]
        value = col_data[i]
        out_line = '      <%s>%s</%s>\n' % (tag, value, tag)
        outfile.write(out_line)
    outfile.write('   </Event>\n')

Only task remaining is writing the mandatory end-line for the XML file, and close-up shop. 

In [None]:
outfile.write(CRANEDATA_END+'\n')
outfile.close()
infile.close()

In [None]:
%%bash
head -50 crane_data_v2.xml