# First - read the manuscript

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951089/



# What Can BioRuby Do?


* Sequence and feature (e.g. gene, exon, intron) creating/reading/writing
* Sequence database retrieval (e.g. embl, genbank, swissprot…)
* Sequence Alignment (many many types: fasta, clustal, phylip…,..,..,..,..,...,...)
* Sequence and feature database creation/query
* Sequence search/analysis (e.g. blast, ‘virtual restrictions’, & many more)
* Sequence annotation
* Sequence assembly
* Physical mapping (e.g. clones, contigs, etc.)
* Taxonomy representation and database query/retrieval
* Bio-ontology manipulation/exploration (e.g. Gene Ontology)
* Phenotype representation and query (e.g. OMIM)
* Phylogeny analyses
* more more more more more more


# We will start from the command-line

BioRuby provides its own command line (i.e. its own "terminal window")

    /Lectures/$  bioruby              < NOTE WHERE WE START!  IN THE LECTURES FOLDER!!!!
    Creating directory (/home/osboxes/.bioruby/shell/session) ... done
    Creating directory (/home/osboxes/.bioruby/shell/plugin) ... done
    Creating directory (/home/osboxes/.bioruby/data) ... done

    . . . B i o R u b y   i n   t h e   s h e l l . . .

      Version : BioRuby 1.5.1 / Ruby 2.4.2

    bioruby> 



## Demo of FASTQ Sequence quality masking

It is important to mask low-quality base reads before doing a wide range of downstream operations like alignment or SNP-calling.

We will now take a sequence file (in <code> Lectures/files/SP1.fq </code>) and we will use the BioRuby command line to mask it.

The code to do this is:

---------------------------
    Bio::FlatFile.open('./files/SP1.fq').each do |entry|  # each sequence will be put into 'entry'
        hq = entry.mask(20)                               # set the quality filter to 20
        puts hq.output_fasta(entry.entry_id)              # print the FASTA out again, with the same ID as the head
    end

-----------------
In BioRuby's command line it looks like this:


<code>
bioruby> Bio::FlatFile.open('./files/SP1.fq').each do |entry|
**bioruby+**   hq = entry.mask(20)               < NOTE:  'bioruby+' means that it is waiting for more input
puts hq.output_fasta(entry.entry_id)
end
<code>    
    




# Command line as a debugger

One possible use of the BioRuby command line is as a debugger - you can explore an object "live" until you decide what command you want to use, then you put that command into your code.


# END OF  COMMAND LINE

For the rest of the course, we will use BioRuby inside of Jupyter just as we have before



----------------------

----------------------

-----------------------



# BioRuby - Documentation

http://bioruby.org/rdoc/

# Sequences and Sequence Features

We will focus on Sequences in this class.  
* how to retrieve them
* how to write them to files
* how to change their formats
* how to explore their features (e.g. genes)
* how to create databases from them to:
* make them easier to find/query

# BioRuby's Sequence objects

Like all other Ruby objects, you create a new Sequence object using "new"

First, look at the documentation for Bio::Sequence, Bio::Sequence::AA, Bio::Sequence::NA (http://bioruby.org/rdoc/Bio/Sequence.html and find the other two yourself :-) )



In [None]:
require 'bio'

puts "MYSEQ0 = Bio::Sequence.new('ACTTTGC')"
myseq0 = Bio::Sequence.new("ACTTTGC")
puts myseq0.class
puts myseq0.seq
puts myseq0.seq.class
puts

puts "MYSEQ1 = Bio::Sequence.auto('ACTTTGC')"
myseq1 = Bio::Sequence.auto("ACTTTGC")
puts myseq1.class
puts myseq1.seq
puts myseq1.seq.class
puts

puts "MYSEQ2 = Bio::Sequence::NA.new('ACTTTGC')"
myseq2 = Bio::Sequence::NA.new("ACTTTGC")
puts myseq2.class
puts myseq2.seq
puts myseq2.seq.class
puts

#puts myseq2.public_methods.join("\n")
puts "myseq2 = #{myseq2}"

myseq3 = myseq2.reverse_complement
puts "myseq3 = #{myseq3}  (reverse complement)"

myseq2.reverse_complement!    #   <---- note the ! ->  When you see this in Ruby, it means the method will change the object
puts "myseq2 = #{myseq2} (reverse complement!)"

# this is the better way to do it:
puts myseq2.to_s    # to_s means "to string" - you will see this in a lot of Ruby objects


# Limitations

* BioRuby does not have built-in connections to every database on earth (e.g. no connection to TAIR or AraPort)
 * You will still need to use REST interfaces and regular expressions...often!
* BioRuby does not recognize every kind of sequence identifier (e.g. At1g287748 means nothing to BioRuby)



# Sequence Object Methods



In [None]:
require 'bio'

aaseq = Bio::Sequence.auto("BBSTD")
puts aaseq.seq.class
puts aaseq.seq.public_methods.sort
puts
aaseq = Bio::Sequence.auto("AAAACCCCCCCTTTTTTTGGGGGG")
puts aaseq.seq.class
puts aaseq.seq.public_methods.sort



## writing sequences

* :output 
* :list_output_formats
* :output_fasta

Note that the documentation for Bio::Sequence says:

    Included Modules

        Bio::Sequence::Format
        Bio::Sequence::SequenceMasker 

That's where these "output" methods are coming from...



In [None]:
aaseq.list_output_formats

In [None]:
puts aaseq.output_fasta    # ask for fasta
puts aaseq.output    # the default is fasta

In [None]:
puts aaseq.output(:fastq)   # specify that you want fastq


## searching sequences 

see:  https://www.bioinformatics.org/sms/iupac.html

This is accomplished in bioruby using the .to_re



In [None]:
require 'bio'

seq = Bio::Sequence::NA.new("actgggggatccc")
search = Bio::Sequence::NA.new("ATMM")   # M indicates it is an A or C

re = Regexp.new(search.to_re)
puts re.source   # to see the content of a regualr expression call .source

match = seq.seq.match(re)
puts match



## Other common sequence methods
(from the BioRuby Tutorial)

    seq = Bio::Sequence::NA.new("atgcatgcaaaa")
    puts seq.complement

    puts seq.subseq(3,8) # gets subsequence of positions 3 to 8 (starting from 1)
    puts seq.gc_percent 
    puts seq.composition 
    puts seq.translate 
    puts seq.translate(2)        # translate from frame 2
    puts seq.translate(1,11)     # codon table 11
    puts seq.translate.codes
    puts seq.translate.names
    puts seq.translate.composition
    puts seq.translate.molecular_weight
    puts seq.complement.translate


In [None]:
seq = Bio::Sequence::NA.new("atgcatgcaaaa")
puts seq.complement

puts seq.subseq(3,8) # gets subsequence of positions 3 to 8 (starting from 1)
puts seq.gc_percent 
puts seq.composition 
puts seq.translate 
puts seq.translate(2)        # translate from frame 2
puts seq.translate(1,11)     # codon table 11
puts seq.translate.codes
puts seq.translate.names
puts seq.translate.composition
puts seq.translate.molecular_weight
puts seq.complement.translate

# Loading common database formats

BioRuby can import a wide range of Sequence files, in many formats:

    $ ls -l /home/osboxes/.rvm/gems/ruby-2.4.2/gems/bio-1.5.1/lib/bio/db
    total 324
    -rw-r--r-- 1 osboxes osboxes  7198 Oct 26 05:58 aaindex.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 biosql
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 embl
    -rw-r--r-- 1 osboxes osboxes 15444 Oct 26 05:58 fantom.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 fasta
    -rw-r--r-- 1 osboxes osboxes  9384 Oct 26 05:58 fasta.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 fastq
    -rw-r--r-- 1 osboxes osboxes 18543 Oct 26 05:58 fastq.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 genbank
    -rw-r--r-- 1 osboxes osboxes 61643 Oct 26 05:58 gff.rb
    -rw-r--r-- 1 osboxes osboxes 10036 Oct 26 05:58 go.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 kegg
    -rw-r--r-- 1 osboxes osboxes  9021 Oct 26 05:58 lasergene.rb
    -rw-r--r-- 1 osboxes osboxes  1478 Oct 26 05:58 litdb.rb
    -rw-r--r-- 1 osboxes osboxes  7622 Oct 26 05:58 medline.rb
    -rw-r--r-- 1 osboxes osboxes  5328 Oct 26 05:58 nbrf.rb
    -rw-r--r-- 1 osboxes osboxes 12178 Oct 26 05:58 newick.rb
    -rw-r--r-- 1 osboxes osboxes 56187 Oct 26 05:58 nexus.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 pdb
    -rw-r--r-- 1 osboxes osboxes   517 Oct 26 05:58 pdb.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 phyloxml
    -rw-r--r-- 1 osboxes osboxes 11233 Oct 26 05:58 prosite.rb
    -rw-r--r-- 1 osboxes osboxes 14132 Oct 26 05:58 rebase.rb
    drwxrwxr-x 2 osboxes osboxes  4096 Oct 26 05:58 sanger_chromatogram
    -rw-r--r-- 1 osboxes osboxes 14249 Oct 26 05:58 soft.rb
    -rw-r--r-- 1 osboxes osboxes  6022 Oct 26 05:58 transfac.rb
    
(NOTE: some of these are folders, if the database contains different things - e.g. kegg has genes, pathways, compounds, etc.)

BioRuby is also able to guess (usually correctly!) what kind of sequence file you give it.  

So, you can either tell it what kind of data you have:

---------------
<code>
    ff = Bio::FlatFile.new(Bio::GenBank, data)
</code>
----------------

Or, you can ask BioRuby to look at the data and make the decision itself:

----------------------
<code>
    ff = Bio::FlatFile.auto(data)

    ff.each_entry do |entry|    # many database formats, like FASTA, allow multiple records in a single file/string
      p entry.entry_id          # identifier of the entry
      p entry.definition        # definition of the entry
      p entry.seq               # sequence data of the entry
    end
</code>
----------------------

## Let's try this for ourselves:


In [None]:
require 'net/http'   # this is how you access the Web
require 'bio'
address = URI('http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=ensemblgenomesgene&format=embl&id=At3g54340')  # create a "URI" object (Uniform Resource Identifier: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)

response = Net::HTTP.get_response(address)  # use the Net::HTTP object "get_response" method
                                             
record = response.body
puts record[1..100]  # show just a little bit...

# create a local file with this data
File.open('At3g54340.embl', 'w') do |myfile|  # w makes it writable
  myfile.puts record
end


puts "Trying Method 1a - directly create a Bio::EMBL object from string"
#  METHOD 1a - Create an object of the correct type from a string
entry = Bio::EMBL.new(record)
puts entry.class
puts "The record is #{entry.definition} "



#puts "\n\nTrying Method 1b - directly create a Bio::EMBL object from a file"
# METHOD 1b - Create an object of the correct type from a file
#entry = Bio::EMBL.open('At3g54340.embl')
#puts entry.class
#puts "The record is #{entry.definition} "


# ==========================================================
# see documentation at http://bioruby.org/rdoc/Bio/FlatFile

puts "\n\nTrying method 2 - create a new FlatFile object of the correct type"

# we can ask what type it is...
datafiletype = Bio::FlatFile.autodetect(record)
puts datafiletype

datafile1 = Bio::FlatFile.new(datafiletype, File.open('At3g54340.embl', 'r')) # use that to create the correct type
# note that the .new requires a "stream" as its second argument.  This is ugly, but not difficult...

puts datafile1.class 

datafile1.each_entry do |entry|   # the FILE is not the same as the RECORD - multiple records can exist in a file
  puts entry.class
  puts "The record is #{entry.definition} "
end




# ==========================================================
# see documentation at http://bioruby.org/rdoc/Bio/FlatFile

puts "\n\nTrying method 3 - create a new FlatFile object using auto-detect for type"
datafile2 = Bio::FlatFile.auto('At3g54340.embl')
puts datafile2.class  

datafile2.each_entry do |entry| # the FILE is not the same as the RECORD - multiple records can exist in a file
  puts entry.class
  puts "The record is *#{entry.definition}* " unless entry.definition.nil? || entry.definition.empty?  # see this? Useful!
end



# Methods available for the common Sequence files:

http://bioruby.org/rdoc/Bio/EMBL


    #cc
    #comment
    #data_class
    #date_created
    #date_modified
    #dblinks
    #division
    #dt
    #each_cds
    #each_gene
    #entry  *******
    #entry_id
    #entry_name
    #entry_version
    #features    ***********
    #fh
    #ft
    #id_line
    #molecule
    #molecule_type
    #naseq
    #ntseq
    #os
    #release_created
    #release_modified
    #seq
    #seqlen
    #sequence_length
    #species    ************
    #sq
    #sv
    #to_biosequence
    #topology
    #version 
    
    http://bioruby.org/rdoc/Bio/EMBLDB/Common.html
    
    #ac
    #accession   *********
    #accessions
    #de
    #definition
    #description
    #dr
    #keywords
    #kw
    #oc
    #og
    #os
    #ref
    #references 
    
    http://bioruby.org/rdoc/Bio/DB.html
    
    #entry_id
    #exists?
    #fetch
    #get
    #tags 

## Sequence features - exons, introns, polyA sites, etc.

http://bioruby.org/rdoc/Bio/Feature.html

http://bioruby.org/rdoc/Bio/Feature/Qualifier.html


In [None]:
require 'bio' 

datafile2 = Bio::FlatFile.auto('At3g54340.embl')
puts datafile2.class  

datafile2.each_entry do |entry| # the FILE is not the same as the RECORD - multiple records can exist in a file
# shows accession and organism
  next unless entry.accession

  puts entry.class
  puts "# #{entry.accession} - #{entry.species}"

  # iterates over each element in 'features'
  entry.features.each do |feature|
    position = feature.position
    puts "\n\n\n\nPOSITION = #{position}"
    qual = feature.assoc            # feature.assoc gives you a hash of Bio::Feature::Qualifier objects 
                                    # i.e. qualifier['key'] = value  for example qualifier['gene'] = "CYP450")
    puts "Associations = #{qual}"
    # skips the entry if "/translation=" is not found
    next unless qual['translation']    # this is an indication that the feature is a transcript

    # collects gene name and so on and joins it into a string
    gene_info = [
      qual['gene'], qual['product'], qual['note'], qual['function']
    ].compact.join(', ')
    puts "TRANSCRIPT FOUND!\nGene Info:  #{gene_info}"
    # shows nucleic acid sequence
    puts "\n\n>NA splicing('#{position}') : #{gene_info}"
    puts entry.naseq.class   # this is a Bio::Sequence::NA    Look at the documentation to understand the .splicing() method
    puts entry.naseq.splice(position)  # http://bioruby.org/rdoc/Bio/Sequence/Common.html#method-i-splice

    # shows amino acid sequence translated from nucleic acid sequence
    puts "\n\n>AA translated by splicing('#{position}').translate"
    puts entry.naseq.splicing(position).translate

    # shows amino acid sequence in the database entry (/translation=)
    puts "\n\n>AA original translation"
    puts qual['translation']
  end

  
  puts "\n\nNumber of features #{entry.features.length}"
  

end


<pre>


</pre>

# Creating your own features

The BioRuby Class that allows you to add new features is the Bio::Sequence class.  Other classes, like Bio::EMBL, are representations of DATABASE RECORDS - you don't manipulate EMBL records!  But you can "clone" these into a Bio::Sequence object, which allows you to add new features.

See the code below to understand this more deeply:

In [None]:
require 'bio' 

datafile2 = Bio::FlatFile.auto('At3g54340.embl')
puts "datafile 2 is of Class: #{datafile2.class}\n\n\n"


entry =  datafile2.next_entry   # this is a way to get just one entry from the FlatFile

puts "working on #{entry.accession}"
puts "entry is of class #{entry.class}"
puts "This entry has:#{entry.features.length} features right now"
puts "entry can respond to a call to RETRIEVE features? #{entry.respond_to?('features')}"
puts puts "entry can respond to a call to SET features? #{entry.respond_to?('features=')} <-------!!!"


puts "\n\nconverting it to a Bio::Sequence"
bioseq = entry.to_biosequence  # this is how you convert a database entry to a Bio::Sequence

puts "bioseq is of class #{bioseq.class}"
puts "the equivalent of Bio::EMBL.accession is Bio::Sequence.primary_accession:  #{bioseq.primary_accession}\n\t\t\t\t (I agree, that is a bit frustrating ;-) )"
puts "This Bio::Sequence has:#{bioseq.features.length} features, the same as the Bio::EMBL - it is a clone, but...."
puts "bioseq can respond to a call to RETRIEVE features? #{bioseq.respond_to?('features')}"
puts puts "bioseq can respond to a call to SET features? #{bioseq.respond_to?('features=')} <-------!!!"



<pre>
  

</pre>

# Creating your own Bio::Features
  
You need to create a new [Bio::Feature](http://bioruby.org/rdoc/Bio/Feature.html) object:

------------


**Public Class Methods**

*new(feature = '', position = '', qualifiers = [])*

    Create a new Bio::Feature object. 
    
Arguments:

        (required) feature: type of feature (e.g. “exon”)

        (required) position: position of feature (e.g. “complement(1532..1799)”)

        (opt) qualifiers: list of Bio::Feature::Qualifier objects (default: [])

Returns

        Bio::Feature object


**Public Instance Methods**

*append(a)*

    Appends a Qualifier object to the Feature.

Arguments:

        (required) qualifier: Bio::Feature::Qualifier object

Returns

        Bio::Feature object

---------------
<pre>


</pre>

**NOTE:  You should read about [INSDC Feature tables](http://www.insdc.org/files/feature_table.html), <span style="color: red;">especially how to format the location portion of a feature!!  [Location Examples](http://www.insdc.org/files/feature_table.html#3.4.3)**</span>

</pre>

</pre>

In [None]:
require 'bio'  

datafile2 = Bio::FlatFile.auto('At3g54340.embl')
entry =  datafile2.next_entry   # this is a way to get just one entry from the FlatFile
puts "\n\nconverting it to a Bio::Sequence"
bioseq = entry.to_biosequence  # this is how you convert a database entry to a Bio::Sequence
puts "This entry has: #{bioseq.features.length} features at the beginning"


f1 = Bio::Feature.new('myrepeat','120..124')
f1.append(Bio::Feature::Qualifier.new('repeat_motif', 'AAGCC'))
f1.append(Bio::Feature::Qualifier.new('note', 'found by repeatfinder 2.0'))
f1.append(Bio::Feature::Qualifier.new('strand', '+'))
bioseq.features << f1  # you can append features one-by-one, using the << operator of Ruby arrays 

f2 = Bio::Feature.new('myrepeat','complement(190..194)')   # NOTE THE FORMAT HERE!  See note in RED above!!!!!!!!!!
f2.append(Bio::Feature::Qualifier.new('repeat_motif', 'AAGCC'))
f2.append(Bio::Feature::Qualifier.new('note', 'found by repeatfinder 2.0'))
f2.append(Bio::Feature::Qualifier.new('strand', '-'))
bioseq.features << f2


puts "This entry has: #{bioseq.features.length} features afer appending two individual features"

bioseq.features.concat([ f1, f2 ])   # or you can take an array of features and concatenate with the .features array

puts "This entry has: #{bioseq.features.length} features after concatenating a list of two new features"

  bioseq.features.each do |feature|
    featuretype = feature.feature
    next unless featuretype == "myrepeat"
    position = feature.position
    puts "\n\n\n\nFEATURE #{featuretype} @ POSITION = #{position}"
    qual = feature.assoc            # feature.assoc gives you a hash of Bio::Feature::Qualifier objects 
                                    # i.e. qualifier['key'] = value  for example qualifier['gene'] = "CYP450")
    puts "Associations = #{qual}"
    # skips the entry if "/translation=" is not found
  end

puts "\n\n\ndone"


# Prove that you understand

easy:  Modify the code above to print a list of the SNPs in this EMBL record.

harder:  For each SNP, report if it is in an exon or an intron

NOTE: from the Bio::Feature documentation


### Attributes
    feature[RW]
    Returns type of feature in String (e.g 'CDS', 'gene')
    
    position[RW]
    Returns position of the feature in String (e.g. 'complement(123..146)')

    qualifiers[RW]
    Returns an Array of Qualifier objects.




# General Feature Format - a concise way to represent sequence features

It is common to want to exchange/publish sequence annotations.  There are many ways to do this - EMBL files, GenBank files, etc.; however, these formats are extremely "rich" and complicated.  The bioinformatics community has, for many years, looked for a "quick and dirty" way to exchange sequence features.  The solution they came to was called "General Feature Format" or "GFF".  GFF, basically, includes basic information about a sequence feature on a single line, consisting of 9 tab-separated fields:  
    
    seqid   source   type   start    end    score    strand    phase    attributes
    
These fields allow you to say what kind of feature, appears in what position, on which strand of which sequence, along with some additional metadata and annotations.

Unfortunately, "quick and dirty" was not a very good idea!  For example, there was no agreement in the community about what types of features there could be, or how you could group features together (e.g. the exons of a specific gene).  Basically, it ended up not being very useful as a way to **exchange** feature data, because there was no agreement about... anything! So GFF went through a few iterations to try to solve these problems.

There have been 3 increasingly rigorous versions of GFF created over the past ~10 years:  GFF, GFF2 and GFF3.  We are going to look at GFF3 - the most complex - because it fits with the discussion of ontologies we had a few weeks ago.

The definition of the GFF3 Format is HERE: http://www.ensembl.org/info/website/upload/gff3.html

The abbreviated explanation is:

-----------



**Fields**

    The first line of a GFF3 file must be a comment that identifies the version, e.g.

        ##gff-version 3

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

  - **seqid** - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  - **source** - name of the program that generated this feature, or the data source (database or project name)
  -  **type** - type of feature. Must be a term or accession from the **[SOFA sequence ontology](https://github.com/The-Sequence-Ontology/SO-Ontologies/blob/master/so.obo)**
  -  **start** - Start position of the feature, with sequence numbering starting at 1.
  -  **end** - End position of the feature, with sequence numbering starting at 1.
  -  **score** - A floating point value.
  -  **strand** - defined as + (forward) or - (reverse).
  -  **phase** - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  - **attributes** - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent - see the GFF documentation for more details.

The part I want you to pay attention to is "**type**" above.  It says that the "type" field is constrained to be a valid term from the Sequence Ontology.  You can [browse the Sequence Ontology here](http://www.sequenceontology.org/browser/obob.cgi).   Note that this is a **simple** explanation of the GFF3 format.  [The complete explanation is much more detailed](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md)!!

<pre>


</pre>

A typical line of GFF3 looks like this:

     Chr3    GeneScan     coding_exon     1300     9745     0.97     +     1     ID=mrna0001.1;Name=sonichedgehog.e1

This says that, on the sequence called Chr3 there is a [coding_exon](http://www.sequenceontology.org/browser/current_svn/term/SO:0000195) that runs on the + strand from sequence position 1300 to 1745 that was discovered by the GeneScan software, that gave it a liklihood score of 97%, and is in Prame-1 relative to the coding sequence (i.e. starts with the second nucleotide of the triplet-codon).  It has an ID of mrna0001.1, and it has a display-name of "sonichedgehog.e1"

This may look trivial - and in fact, it is quite a simple format! - however, it can be difficult to create these files.  For example, the 'phase' field can be tricky to calculate.  Most importantly, the attributes field is used to create parent-child relationships (for example, Exon with ID E1.1 belongs to Transcript T1.1 which belongs to Gene G1) and these are quite difficult to both create and parse.  A larger example of [a Rice (_Oryza sativa_) GFF3 file is here]( http://ftp.gramene.org/archives/PAST_RELEASES/release45/data/vcf/oryza_sativa/Duitama/WGSOryza_CIAT_LSU_USDA_NCGR_SV/MV89-80_bowtie2_NGSEP_SV.gff)

For my exercises, I want you to not worry about the Attributes field, and you may use a "." character in the "phase" field ("." means "I don't know/don't care").  The field I want you to pay attention to is the type field - when you need to refer to a type of sequence feature, I want you to look-up what the correct Sequence Ontology Term is. 

Finally, at the beginning of the GFF3 file, you must have a metadata line "##gff-version 3".  At the end of the GFF3, following the directive "##FASTA" on a single line, you **may** (optional) include the FASTA-formatted sequences corresponding to the features in the GFF file.  The FASTA identifier (in the >header of the FASTA record) should be identical to the identifier in Field #1 of the GFF.  For example:


    ##gff-version 3
    seq1.1   CutScan  restriction_enzyme_cut_site    3    8    .    +    .   ID=EcoRI_1;Name=EcoRI_CutSite_1
    ##FASTA
    >seq1.1
    ATAGAATTCTTGCGATGGCGAGTTTACCGGCGTAT
    



<pre>
  

</pre>
  
# GFF and BioRuby


In BioRuby, we can **read** GFF3 (but we cannot write it... unfortunately...)

The three objects you need to understand are:

- [Bio::GFF](http://bioruby.org/rdoc/Bio/GFF.html) --> the parent object, responds to #records method call

- [Bio::GFF::GFF3](http://bioruby.org/rdoc/Bio/GFF/GFF3) - this represents a GFF **file** (all of the features, plus the metadata, plus the FASTA)

- [Bio::GFF::GFF3::Record](http://bioruby.org/rdoc/Bio/GFF/GFF3/Record) - this represents a single line of the 9-field GFF format - i.e. a single record in the file.

- [Bio::GFF::GFF2::Record](http://bioruby.org/rdoc/Bio/GFF/GFF2/Record) --> called by the GFF3::Record object - this gives you most of the method calls a GFF Record.

There are a few frustrating aspects to the BioRuby GFF3 objects.  

1. They consume a String (not a file or filehandle or other streaming object)
2. They consume the entire string - all or nothing - so they use a LOT of memory!

Regardless, we will continue...

This is the code for parsing GFF into BioRuby objects:



In [None]:
require 'bio'

gff = <<END
seq1.1\tCutScan\trestriction_enzyme_cut_site\t3\t8\t.\t+\t.\tID=EcoRI_1;Name=EcoRI_CutSite_1
seq1.1\tCutScan\trestriction_enzyme_cut_site\t10\t15\t.\t+\t.\tID=EcoRI_1;Name=EcoRI_CutSite_1
##FASTA
>seq1.1
ATAGAATTCTTGCGATGGCGAGTTTACCGGCGTAT

END

record = Bio::GFF::GFF3.new(gff)

record.records.each do |feature|  # look at the Bio::GFF::GFF2 documentation to see the attributes
  puts feature.class   # I am surprised that this is not a Seq::Feature object...???  Anyway...
  puts feature.id  # a shortcut to the ID attribute (a requirement of GFF3)
  print "start and nucleotide #{feature.start} until nucl. #{feature.end} "
  print "of type #{feature.feature} from #{feature.source} "
  puts  "with name #{feature.get_attribute('Name')}"  # NOTE you can call individual attributes by their name!
  puts feature.attributes_to_hash

  puts "\n\n"
end

puts "the original record:"
puts record.to_s
