## Exercise 6.1: Exception handling

If you don't know them already, there are two functions to get rid of elements in a set: `remove()` and `discard()`.

A. Now that we've learned more about exception handling, explain what is happening in the script below when you run it.

B. Create a script that contains a list of at least five to-do tasks and celebrates each time you finish a task on your to-do list. Each time a task is done, it must be removed from the list and you must make a sentence indicating that the task being done is currently in progress. Then, if you call the task a second time, you will congratulate yourself on having completed the task. Try to do this using **try** and **except** statements.


In [None]:
list_of_letters = ['a', 'a', 'b', 'c','c','c','d','e']
print ('ORIGINAL')
set_of_letters = set(list_of_letters)
print (set_of_letters)
 
print ('DISCARD')
set_of_letters.discard('q')
print (set_of_letters)
 
print ('REMOVE')
set_of_letters.remove('q')
print (set_of_letters)

## Exercise 6.2: Improving the FASTA parser in the `sequence_tools.py` module (from Exercise 5.5) --
Let's improve the FASTA parser! Make it handle:
1. non-existent filenames. Return a print statement indicating the error.
2. files that do not conform to the FASTA format (i.e. >gene for IDs, and strings of A,T,G, or C for sequence). Return a print statement indicating the error.
3. sequences that are in both cases. Make sure the program can take in upper and lower cases without producing an error.

Write a short script running the FASTA parser showing it can handle the above three situations.

## Exercise 6.3: Filetypes __
The `cerevisiae_genome.fasta` file in `resources/` is very large, such that we might prefer to save space on the computer by zipping the file. Look up the **gzip** Linux command and use it to zip the fasta file. 

However, now your file cannot be read by your fasta parser. Create a new function called `open_file_by_mimetype` that identifies the filetype and returns an open filehandle without error. This will require use of the **mimetypes** module (look it up, particularly the function `mimetypes.guess_type`) and the **gzip** module (look it up, particularly the `gzip.open` function). Modify your fasta parser accordingly so it can deal with more file types. 


## Exercise 6.4: All code is bug-free until your first user. --
You have another coworker who made an AMAZING secondary structure analysis script. She used it on the protein 3GV2, inputting [the pdb file for 3GV2](http://www.rcsb.org/pdb/explore/explore.do?structureId=3GV2). She asks if you will analyze her protein, [interleukin-19](http://www.rcsb.org/pdb/explore/explore.do?structureId=1N1F), as well. (HINT: use PDB code 1N1F). Crud! This protein breaks your code. Why? Rewrite your code to work on both interleukin-19 and on the original 3GV2 HIV capsid protein.

The script she shared is below. I recommend also downloading the pdb file for 3GV2, to see what a working final output should look like.


In [None]:
import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
atoms = []
pd="/Users/myang/YangLab/PythonBootcamp/resources/"
f1 = open(pd+'1N1F.pdb' ,'r')
for next in f1:
    tmp = next.strip().split()
    if tmp[0] == 'SEQRES':
        if tmp[2] == 'A':
            full_seq.extend(tmp[4:])
    elif tmp[0] == 'HELIX':
        try:
            int(tmp[5])
        except:
            tmp[5] = tmp[5][:-1]
        helix_aa.append(tmp[:9])
    elif tmp[0] == 'SHEET':
        sheet_aa.append(tmp[:10])
    elif tmp[0] == 'ATOM':
        if len(tmp) < 12:
            begin = tmp[0:2]
            end = tmp[3:]
            middle = [tmp[2][:3], tmp[2][4:]]
            tmp = begin + middle + end
        try:
            int(tmp[5])
        except:
            continue
        atoms.append(tmp)
 
######################
 
num_helix_res = 0.0
print (f"There are {full_seq} residues in the sequence")
 
# Set up a listing of features by residue, then fill it in as we go along
feature = ['Other']*(10000)
 
for aa in helix_aa:
    # We add 1 because there are b-a+1 residues between a and b, inclusive
    num_helix_res += float(aa[8]) - float(aa[5]) + 1
    for i in range(int(aa[5]), int(aa[8])+1):
        feature[i] = 'Helix'
 
num_sheet_res = 0.0
for sheet in sheet_aa:
    num_sheet_res += float(sheet[9]) - float(sheet[6]) + 1
    for i in range(int(sheet[6]), int(sheet[9])+1):
        feature[i] = 'Sheet'
 
 
 # atom[4] == chain id
 # atom[5] == residue #
 # atom[10] == b-factor
 
 
helix_bfactors = {}
sheet_bfactors = {}
other_bfactors = {}
 
for atom in atoms:
    Chain = atom[4]
    BFactor = float(atom[10])
    ResidueNum = int(atom[5])
 
    if feature[ResidueNum] == 'Helix':
        if Chain not in helix_bfactors:
            helix_bfactors[Chain] = []
        helix_bfactors[Chain].append(BFactor)
    elif feature[ResidueNum] == 'Sheet':
        if Chain not in sheet_bfactors:
            sheet_bfactors[Chain] = []
        sheet_bfactors[Chain].append(BFactor)
    else:
        if Chain not in sheet_bfactors:
            other_bfactors[Chain] = []
        other_bfactors[Chain].append(BFactor)
print (len(list(set(full_seq))))
for chain in helix_bfactors:
    # I could have used any of the different bfactor listings
    avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    print (f'{chain}\t{avg_helix:.5f}\t{avg_sheet:.5f}\t{avg_other:.5f}')

## Exercise 6.5: GENO2FASTA - CHALLENGE --

Take the "51.2.2M" data from the resources/ file. Can you write a script where if you specify an individual, the script will use the geno/ind/snp files to turn the data into fasta format? Here are some things to keep in mind:
1. Separate each fasta sequence in your file by chromosome. 
2. Make sure to keep all information you have for the individual in the header line (">") of the fasta - use "|" to separate different bits of information (including chromosome information). 
3. The two alleles in the "snp" file are what the 0/1/2 refer to. A zero indicates none of the first allele is found. A one indicates one of the first allele is found, and a two indicates both alleles are the first allele. If you have a '1', randomly choose one of the two alleles to keep in the fasta - look into the [**random** module](https://docs.python.org/3.8/library/random.html). You are essentially treating diploid individuals as haploid calls. 
4. For missing data "9", use "N". 
5. Don't worry about adjacent positions not in the set of SNPs. That means if you have 100 SNPs in the ".snp" file belonging to chromosome 1, your fasta file sequence for chromosome 1 should have 100 As, Gs, Cs, Ts, and Ns.
6. Take the time to try stubbing and adding functions to make your code more readable. Pseudocode this and figure out everything you need before you start!

## Exercise 6.6: The timing is everything! --
In your Jupyter Notebook, try out the "run -p" command to check the timing of exercise 6.5. How fast is your code? What takes the longest? Can you think of any way to make your code run faster? It's okay if not, but if you want to discuss this, feel free to ask - sometimes you found the fastest way, sometimes (but not always - I'm not greatest with speed) I might have a few more ideas. 


In [None]:
run??

## Exercise 6.7: ANNO file

In the Reich 1240K dataset, there's an ANNO file that gives more information on each individual that's been sequenced. Explore what information is found in each column. Then, write a function that returns a unique list of the 'Group IDs' available for any particular 'Publication' column. Note that tabs are used to split columns but regular spaces do not split columns. 

How many unique Group IDs are found for McCollScience2018?

## Exercise 6.8: Doublecheck your data - CHALLENGE

A. After Exercise 6.7, let's make sure you have a script to check amount of data available. There's two ways of doing this - do both and use the Group ID for Japan_Jomon.SG to check if they match up. 
1. In the ANNO file, you can use the SNPs hit on autosomal targets.
2. In the GENO file, you can count the number of '9's present like before (or 0s, 1s, and 2s present). 

B. Note that there's only one individual with the Japan_Jomon.SG ID. But, Vietnam_BA_DongSonCulture.SG has four individuals - we might want to count these data available if even one of these individuals have data. Modify #2 so you can account for that. 