## T1 Some initial tips to access Databases with Python/Biopython

(we strongly recommend to spend some time playing around Python in other contexts aswell (["A" tutorial](https://github.com/Biocomputing-Teaching/Learning-Python-for-Biocomputing/blob/main/BasicPythonIntro.ipynb)).

To run these examples you need to have previously installed `Biopython`. To do so, in your command line run
```
conda install -c conda-forge biopython
```

Once installed, you are ready to follow this lesson.

Defining a simple sequence in python. Here, a sequence is a `string`

In [None]:
my_seq = "AGTACACTGGT"
print(my_seq)
type(my_seq)

Dealing with strings is a regular job in Python, but here we want to take profit of another type of object, the `Seq` object from the `Biopython` library. This library is extremely useful for many Bioinformatics tasks and we will make extensive use of it during the course. Let us have a look at how it does some simple jobs. Firs, let us create a `Seq` object:

In [None]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
print(my_seq)
type(my_seq)

The `Seq` object differs from the Python string in the methods it supports. You can’t do the following statements with a plain string:

In [None]:
my_seq.complement()

In [None]:
my_seq.reverse_complement()

Now, let us read a string from a file

In [None]:
myfile= open ("cftr.fa", "r")
data=myfile.readlines()
print(data)
myfile.close()
type(data)

hmmmm, not entirely satisfactory. The instruction reads all lines, yes, but puts the results in a list format. I really want a single string with all the info. One option is to delete the "\n" symbols that define the end of lines:

In [None]:
myfile= open ("cftr.fa", "r")
data = myfile.read().replace('\n', '')
print(data)
myfile.close()
type(data)

this is much nicer, and it creates a string object with all the info. Unfortunately, if I want to make sense of the informatoon, I need to do much more. So, it is better to rely in BioPython again

In [None]:
from Bio import SeqIO
for seq_record in SeqIO.parse("cftr.fa", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
type(seq_record)

This is much, much better. I just read a file in fasta format and Biopython did everthing I needed for me.
Now, the next step is trying to download directly a sequence from the NCBI database instead of having it already in my disk. To do sso, I have to call the server:

In [None]:
from Bio import Entrez
Entrez.email = "jvilla@uic.cat"  # Always tell NCBI who you are
handle = Entrez.efetch(db="protein", id="NP_000483.3", rettype="fasta", retmode="text")
print(handle.read())
type(handle)

It is better to save the file with the sequence information to avoid repeting access to the NCBI servers.

In [None]:
protID = "NP_000483.3"
handle = Entrez.efetch(db="protein", id=protID, rettype="gb", retmode="text")

filename = protID+".fa"
out_handle = open(filename, "w")
out_handle.write(handle.read())
out_handle.close()
handle.close()
print("Info Saved")

the cool thing is that you can now parse the whole information contained in the record:

In [None]:
record = SeqIO.read(filename, "genbank")
handle.close()
dir(record)

the `dir()` functions gives you a `list` of possible attributes the `object` record has. Let us see what is in them

In [None]:
print(record.id)
print(record.name)
print(record.description)
print(len(record.features))
record.seq

# working with PDB files

`Biopython`, and in particular `BioPDB`, provides a convenient way to deal with biomolecular structures that are deposited in the Protein data Bank. Here is a very short primer to get started. Check the [Biopython tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) to learn more.

In [None]:
# first, let us download a PDB file
import urllib
urllib.request.urlretrieve('http://files.rcsb.org/download/6O1V.pdb', '6o1v.pdb')

In [None]:
# and see how the file can be initially parsed. Much more to come in the upcoming sessions
from Bio.PDB.PDBParser import PDBParser as parser
parser = PDBParser(PERMISSIVE=1)
structure_id = "6o1v"
filename = "6o1v.pdb"
structure = parser.get_structure(structure_id, filename)

In [None]:
resolution = structure.header["resolution"]
keywords = structure.header["keywords"]
print(resolution)
print(keywords)

This is all for today!
Take home messages:
- BioPython allows you to easily process sequence data 
- you need to understand how to deal with local files