# Introduction to Bioconductor

The Bioconductor project is an open source repository for R packages, datasets and workflows that are specific for analyzing biological data. The Bioconductor project is a useful extension on CRAN, the R Archive, because it provides us with the software tools to explore, understand, and solve simple and complex molecular biology questions. Hence, Bioconductor's tagline is "open source software for bioinformatics".

Molecular biology questions are usually about either the structure or the function of each of the building blocks of an organism, and very often how they interconnect to one another. Thus, examining & analyzing molecular structure is a key process of identifying a certain component function.

We are interested in locating and describing specific locations in a genome because this allows us to learn about diversity, evolution, hereditary changes, and more. To understand this better we subdivide a genome. The written information in a genome uses the DNA alphabet. Think of a genome as a set of books and each book is a chromosome. Chromosome numbers on each genome are highly variable. Usually, chromosomes come in pairs, but multiple sets are very common too. Each chromosome has ordered genetic sequences, think of chapters in a book. To find specific genetic instructions we look at genes. These are like the pages in a book, containing a recipe to make proteins.

Some genes will produce proteins but some won't. These are called coding and non-coding genes. Coding genes are expressed through proteins responsible for specific functions. Proteins come up following a two-step process, DNA-to-RNA, a step known as transcription, while the RNA-to-protein is a step called translation.

## Centeral Dogma in Molecular Biology

The Central Dogma. This states that once "information" has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.

![Centeral Dogma](https://ib.bioninja.com.au/_Media/central-dogma_med.jpeg)

## Installing Bioconductor

In [1]:
#install.packages("BiocManager") # Installing Bioconductor Package manager

#BiocManager::install() #Installing core packages

paste0("The number of available packages is: ",length(BiocManager::available())) # Get available packages

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org



In [2]:
#BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3")
library(BSgenome.Scerevisiae.UCSC.sacCer3) # Yeast genome

yeast <- BSgenome.Scerevisiae.UCSC.sacCer3

class(yeast)

paste0("The number of chromosomes in yeast genome: ",length(yeast))

print("Chromosomes : ")

seqlengths(yeast)

Loading required package: BSgenome
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching

[1] "Chromosomes : "


In [3]:
getSeq(yeast, "chrXI")

getSeq(yeast, "chrXI", end = 13) # first 13 bases

  666816-letter "DNAString" instance
seq: CACCACACCCACACACCACACCCACACACACACCAC...GTGGTGTGTGTGTGGGTGTGGGTGTGGTGTGTGTGT

  13-letter "DNAString" instance
seq: CACCACACCCACA

## Biostrings

The Biostrings package contains classes and functions for representing biological strings such as DNA, RNA and amino acids. In addition the package has functionality for pattern matching (short read alignment) as well as a pairwise alignment function implementing Smith-Waterman local alignments and Needleman-Wunsch global alignments used in classic sequence alignment (see (Durbin et al. 1998) for a description of these algorithms). There are also functions for reading and writing output such as FASTA files.

In [4]:
DNA <- DNAString("AGTCAT")

DNA

RNAString(DNA) #Transcription

translate(DNA) # Translation

translate(DNA) == translate(RNAString(DNA))

complement(DNA) #Complementing DNA

reverseComplement(DNA) #Reverse complement

rev(complement(DNA)) == reverseComplement(DNA)


  6-letter "DNAString" instance
seq: AGTCAT

  6-letter "RNAString" instance
seq: AGUCAU

  2-letter "AAString" instance
seq: SH

  6-letter "DNAString" instance
seq: TCAGTA

  6-letter "DNAString" instance
seq: ATGACT

### Pattern-matching in Biostrings

Sequence patterns in the DNA help us find interesting things, such as sequence repeats, protein end codons, poly-A tails, conserved sequences, binding sites, and more. Our goal in analyzing sequence patterns is to discover their occurrence frequency, periodicity, and length.

Where does a gene start, where does a protein end, which regions make a gene expressed or silent, which regions are conserved between organisms, and what is the overall genetic variation, are common questions solved by sequence pattern matching.

In [5]:
XI <- getSeq(yeast, "chrXI")

matchPattern("ATTGCATGCA",XI)

  Views on a 666816-letter DNAString subject
subject: CACCACACCCACACACCACACCCACACACACACC...GGTGTGTGTGTGGGTGTGGGTGTGGTGTGTGTGT
views:
     start    end width
[1]  23542  23551    10 [ATTGCATGCA]
[2] 162051 162060    10 [ATTGCATGCA]

In [6]:
findPalindromes(XI)

  Views on a 666816-letter DNAString subject
subject: CACCACACCCACACACCACACCCACACACACACC...GGTGTGTGTGTGGGTGTGGGTGTGGTGTGTGTGT
views:
        start    end width
   [1]    222    230     9 [TAACCGTTA]
   [2]    458    465     8 [GTATATAC]
   [3]    827    836    10 [ATGACGTCAT]
   [4]    853    860     8 [TAATATTA]
   [5]    887    894     8 [AAATATTT]
   ...    ...    ...   ... ...
[6322] 665629 665639    11 [ATATGACATAT]
[6323] 665676 665684     9 [AAATTATTT]
[6324] 666204 666212     9 [ATGGGCCAT]
[6325] 666249 666256     8 [GTATATAC]
[6326] 666477 666485     9 [TAACGGTTA]

Enzymes such as restriction enzymes have to recognize a very specific sequence in order to carry out its task. It binds to the DNA only in one specific configuration. Luckily! because you don't want a 'pacman' that cuts DNA at random places.

DNA is double stranded, so it has 'two sides' to which the enzyme can bind. A palindromic sequence is the same backwards and forwards on both sides (see image below). This means that the enzyme recognizes the sequence no matter from which side the enzyme approaches the DNA.

A palindromic sequence also increases the chance that both strands of DNA are cut. It is even possible that two enzymes work as a dimer to cut the palindromic sequence, further increasing efficiency.

## Gene annotation

DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. Such information is categorized into several classes of data, each class has it's own data base that serves this kind of information. Each database has it's own ID for a given gene. Thus, a proccess of key collection has to be done so you can identify all of the information surrounding a certain gene quickly.

In [7]:
#BiocManager::install('org.Hs.eg.db') #installing db package
library(org.Hs.eg.db)

uniprot_keys <- keys(org.Hs.eg.db, keytype = "UNIPROT")
head(uniprot_keys)
length(uniprot_keys)

head(select(org.Hs.eg.db, keys = uniprot_keys, keytype= "UNIPROT", columns = c("SYMBOL", "OMIM")))

Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.




'select()' returned 1:many mapping between keys and columns


UNIPROT,SYMBOL,OMIM
P04217,A1BG,138670
V9HWD8,A1BG,138670
P01023,A2M,103950
P01023,A2M,104300
P01023,A2M,614036
P18440,NAT1,108345


In [8]:
help("SYMBOL")

Credit to [Mohmed AbdelFattah](https://www.youtube.com/channel/UC_nDXDXs0j_-1PgUYcmtf1w)