# First - read the manuscript

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951089/



# What Can BioRuby Do?


* Sequence and feature (e.g. gene, exon, intron) creating/reading/writing
* Sequence database retrieval (e.g. embl, genbank, swissprot…)
* Sequence Alignment (many many types: fasta, clustal, phylip…,..,..,..,..,...,...)
* Sequence and feature database creation/query
* Sequence search/analysis (e.g. blast, ‘virtual restrictions’, & many more)
* Sequence annotation
* Sequence assembly
* Physical mapping (e.g. clones, contigs, etc.)
* Taxonomy representation and database query/retrieval
* Bio-ontology manipulation/exploration (e.g. Gene Ontology)
* Phenotype representation and query (e.g. OMIM)
* Phylogeny analyses
* more more more more more more


# We will start from the command-line

BioRuby provides its own command line (i.e. its own "terminal window")

    /Lectures/$  bioruby              < NOTE WHERE WE START!  IN THE LECTURES FOLDER!!!!
    Creating directory (/home/osboxes/.bioruby/shell/session) ... done
    Creating directory (/home/osboxes/.bioruby/shell/plugin) ... done
    Creating directory (/home/osboxes/.bioruby/data) ... done

    . . . B i o R u b y   i n   t h e   s h e l l . . .

      Version : BioRuby 1.5.1 / Ruby 2.4.2

    bioruby> 



## Demo of FASTQ Sequence quality masking

It is important to mask low-quality base reads before doing a wide range of downstream operations like alignment or SNP-calling.

We will now take a sequence file (in <code> Lectures/files/SP1.fq </code>) and we will use the BioRuby command line to mask it.

The code to do this is:

---------------------------
    Bio::FlatFile.open('./files/SP1.fq').each do |entry|  # each sequence will be put into 'entry'
        hq = entry.mask(20)                               # set the quality filter to 20
        puts hq.output_fasta(entry.entry_id)              # print the FASTA out again, with the same ID as the head
    end

-----------------
In BioRuby's command line it looks like this:


<code>
bioruby> Bio::FlatFile.open('./files/SP1.fq').each do |entry|
**bioruby+**   hq = entry.mask(20)               < NOTE:  'bioruby+' means that it is waiting for more input
puts hq.output_fasta(entry.entry_id)
end
<code>    
    




# Command line as a debugger

One possible use of the BioRuby command line is as a debugger - you can explore an object "live" until you decide what command you want to use, then you put that command into your code.


# BioRuby - Documentation

http://bioruby.org/rdoc/

# Sequences and Sequence Features

We will focus on Sequences in this class.  
* how to retrieve them
* how to write them to files
* how to change their formats
* how to explore their features (e.g. genes)
* how to create databases from them to:
* make them easier to find/query

# BioRuby's Sequence objects

Like all other Ruby objects, you create a new Sequence object using "new"

First, look at the documentation for Bio::Sequence, Bio::Sequence::AA, Bio::Sequence::NA (http://bioruby.org/rdoc/Bio/Sequence.html and find the other two yourself :-) )



In [33]:
require 'bio'

puts "MYSEQ0 = Bio::Sequence.new('ACTTTGC')"
myseq0 = Bio::Sequence.new("ACTTTGC")
puts myseq0.class
puts myseq0.seq
puts myseq0.seq.class
puts

puts "MYSEQ1 = Bio::Sequence.auto('ACTTTGC')"
myseq1 = Bio::Sequence.auto("ACTTTGC")
puts myseq1.class
puts myseq1.seq
puts myseq1.seq.class
puts

puts "MYSEQ2 = Bio::Sequence::NA.new('ACTTTGC')"
myseq2 = Bio::Sequence::NA.new("ACTTTGC")
puts myseq2.class
puts myseq2.seq
puts myseq2.seq.class
puts

#puts myseq2.public_methods.join("\n")
puts "myseq2 = #{myseq2}"

myseq3 = myseq2.reverse_complement
puts "myseq3 = #{myseq3}  (reverse complement)"

myseq2.reverse_complement!    #   <---- note the ! ->  When you see this in Ruby, it means the method will change the object
puts "myseq2 = #{myseq2} (reverse complement!)"

# this is the better way to do it:
puts myseq2.to_s    # to_s means "to string" - you will see this in a lot of Ruby objects


MYSEQ0 = Bio::Sequence.new('ACTTTGC')
Bio::Sequence
ACTTTGC
String

MYSEQ1 = Bio::Sequence.auto('ACTTTGC')
Bio::Sequence
actttgc
Bio::Sequence::NA

MYSEQ2 = Bio::Sequence::NA.new('ACTTTGC')
Bio::Sequence::NA
actttgc
Bio::Sequence::NA

myseq2 = actttgc
myseq3 = gcaaagt  (reverse complement)
myseq2 = gcaaagt (reverse complement!)
gcaaagt


# Limitations

* BioRuby does not have built-in connections to every database on earth (e.g. no connection to TAIR or AraPort)
 * You will still need to use REST interfaces and regular expressions...often!
* BioRuby does not recognize every kind of sequence identifier (e.g. At1g287748 means nothing to BioRuby)



# Sequence Object Methods



In [48]:
aaseq = Bio::Sequence.auto("BBSTD")
puts aaseq.seq.class
puts aaseq.seq.public_methods.sort
puts
aaseq = Bio::Sequence.auto("AAAACCCCCCCTTTTTTTGGGGGG")
puts aaseq.seq.class
puts aaseq.seq.public_methods.sort



Bio::Sequence::AA
[:!, :!=, :!~, :%, :*, :+, :+@, :-@, :<, :<<, :<=, :<=>, :==, :===, :=~, :>, :>=, :[], :[]=, :__id__, :__send__, :ascii_only?, :b, :between?, :bytes, :bytesize, :byteslice, :capitalize, :capitalize!, :casecmp, :casecmp?, :center, :chars, :chomp, :chomp!, :chop, :chop!, :chr, :clamp, :class, :clear, :clone, :codepoints, :codes, :composition, :concat, :count, :crypt, :define_singleton_method, :delete, :delete!, :display, :downcase, :downcase!, :dump, :dup, :each_byte, :each_char, :each_codepoint, :each_line, :empty?, :encode, :encode!, :encoding, :end_with?, :enum_for, :eql?, :equal?, :extend, :force_encoding, :freeze, :frozen?, :getbyte, :gsub, :gsub!, :hash, :hex, :include?, :index, :insert, :inspect, :instance_eval, :instance_exec, :instance_of?, :instance_variable_defined?, :instance_variable_get, :instance_variable_set, :instance_variables, :intern, :is_a?, :itself, :kind_of?, :length, :lines, :ljust, :lstrip, :lstrip!, :match, :match?, :method, :methods, :molecula

## writing sequences

* :output 
* :list_output_formats
* :output_fasta

Note that the documentation for Bio::Sequence says:

    Included Modules

        Bio::Sequence::Format
        Bio::Sequence::SequenceMasker 

That's where these "output" methods are coming from...



In [16]:
aaseq.list_output_formats

[:fastq, :fasta, :raw, :fasta_ncbi, :fastq_sanger, :fastq_solexa, :fastq_illumina, :fasta_numeric, :qual]

In [22]:
puts aaseq.output_fasta    # ask for fasta
puts aaseq.output    # the default is fasta

>. 
BBSTD

>. 
BBSTD



In [43]:
puts aaseq.output(:fastq)   # specify that you want fastq


@. 
BBSTD
+
!!!!!




## searching sequences 

see:  https://www.bioinformatics.org/sms/iupac.html

This is accomplished in bioruby using the .to_re



In [64]:
seq = Bio::Sequence::NA.new("actgggggatccc")
search = Bio::Sequence::NA.new("ATMM")   # M indicates it is an A or C

re = Regexp.new(search.to_re)
puts re.source   # to see the content of a regualr expression call .source

match = seq.seq.match(re)
puts match



at[acm][acm]
atcc


## Other common sequence methods
(from the BioRuby Tutorial)

    seq = Bio::Sequence::NA.new("atgcatgcaaaa")
    puts seq.complement

    puts seq.subseq(3,8) # gets subsequence of positions 3 to 8 (starting from 1)
    puts seq.gc_percent 
    puts seq.composition 
    puts seq.translate 
    puts seq.translate(2)        # translate from frame 2
    puts seq.translate(1,11)     # codon table 11
    puts seq.translate.codes
    puts seq.translate.names
    puts seq.translate.composition
    puts seq.translate.molecular_weight
    puts seq.complement.translate


In [65]:
seq = Bio::Sequence::NA.new("atgcatgcaaaa")
puts seq.complement

puts seq.subseq(3,8) # gets subsequence of positions 3 to 8 (starting from 1)
puts seq.gc_percent 
puts seq.composition 
puts seq.translate 
puts seq.translate(2)        # translate from frame 2
puts seq.translate(1,11)     # codon table 11
puts seq.translate.codes
puts seq.translate.names
puts seq.translate.composition
puts seq.translate.molecular_weight
puts seq.complement.translate

ttttgcatgcat
gcatgc
33
{"a"=>6, "t"=>2, "g"=>2, "c"=>2}
MHAK
CMQ
MHAK
["Met", "His", "Ala", "Lys"]
["methionine", "histidine", "alanine", "lysine"]
{"M"=>1, "H"=>1, "A"=>1, "K"=>1}
485.6050000000001
FCMH
