<a href="https://colab.research.google.com/github/nunososorio/bhs/blob/main/NSO_PracticalClass_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Training: Omics, Bioinformatics, and Pharmacogenomics

By **Nuno S. Osório** 🖋️

👋 Welcome to this tutorial! We will explore how to access and use omics-related databases. We will focus on retrieving data from **BioMart** using their respective Python packages, and then analyze the retrieved data. This tutorial is designed to be run on jupyter notebook environments and includes exercises that involve running Python code. 🐍💻

You can access an interactive cloud version of the notebook here (https://colab.research.google.com/github/nunososorio/bhs/blob/main/NSO_PracticalClass_II.ipynb).

Let's dive in! 🏊‍♂️


# Introduction to Omics and Its Role in Drug Development 🧪💊

Welcome to this practical exploration of omics and its significant role in drug development. 🎓🔬

## Omics 🧬

Omics is an integrative field of study in biology that encompasses disciplines such as genomics, proteomics, and metabolomics. The goal of omics is to collectively characterize and quantify pools of biological molecules, providing a holistic view of the structure, function, and dynamics of an organism. 🌐

## Omics in Drug Development 💡

In drug development, omics technologies play a crucial role in understanding the molecular mechanisms of diseases. This molecular-level understanding aids in the identification of potential drug targets and the development of effective therapeutic agents. 💊

## Databases in Omics Research 🗃️

Bioinformatics databases are essential resources in omics research. These databases, such as the [Ensembl](https://www.ensembl.org/index.html), and [UniProt](https://www.uniprot.org/), store a wealth of omics data. Accessing and analyzing this data is a critical step in drug discovery and development. 📚

## Programmatic Access to Databases 💻

For large-scale and reproducible analysis, programmatic access to these databases is often more efficient than manual data retrieval through web interfaces. Python, a popular language in bioinformatics, offers several libraries for this purpose. For instance, the pybiomart library provides an interface to the Ensembl, enabling the retrieval and analysis of omics data directly within Python scripts. 🐍

In the following sections of this tutorial, we will delve deeper into how to use this Python library for accessing and analyzing omics data. 🚀


## Setup
First, we need to install the necessary Python packages. Run the following commands in your environment:


In [None]:
!pip install pybiomart


Now, import the library:

In [11]:
from pybiomart import Server

## Explore

Lets start by learning all types of data or information we can retreive from Ensembl database using the pybiomart library:

In [None]:
# Connect to the server
server = Server(host='http://www.ensembl.org')

# Select the human dataset
dataset = server.marts['ENSEMBL_MART_ENSEMBL'].datasets['hsapiens_gene_ensembl']

# List all available attributes
attributes = dataset.list_attributes()
attributes



As you can see, it returns a table containing a wealth of information about different attributes (3205) of each entry.

Here’s a brief explanation of some of the attributes:

- **'ensembl_gene_id'**: The Ensembl ID of the gene.
- **'ensembl_transcript_id'**: The Ensembl ID of the transcript.
- **'description'**: The description of the gene.
- **'chromosome_name'**: The name of the chromosome where the gene is located.
- **'start_position'**: The start position of the gene on the chromosome.
- **'end_position'**: The end position of the gene on the chromosome.
- **'strand'**: The strand of the gene on the chromosome.
- **'transcript_count'**: The number of transcripts of the gene.
- **'percentage_gc_content'**: The percentage of GC content in the gene.



We can query the database selecting the atributtes and filters of interest. The available filters are:

In [None]:
# List all available filters
filters = dataset.list_filters()
filters


If we want to get information on a gene by 'gene_id' we can use:

In [21]:
from pybiomart import Server

# Connect to the server
server = Server(host='http://www.ensembl.org')

# Select the dataset
dataset = server.marts['ENSEMBL_MART_ENSEMBL'].datasets['hsapiens_gene_ensembl']

# Define the gene id
gene_id = 'ENSG00000001626'  # replace with your gene id

# Define the attributes
attributes = ['ensembl_gene_id', 'external_gene_name']

# Query the dataset
result = dataset.query(attributes=attributes, filters={'gene_id': gene_id})

# Print the result
print(result)


Empty DataFrame
Columns: [Gene stable ID, Gene name]
Index: []


In [14]:
# Define the gene id
gene_id = 'ENSG00000001626'

# Define the attributes
attributes = ['ensembl_gene_id', 'external_gene_name', 'description', 'chromosome_name', 'start_position', 'end_position', 'percentage_gene_gc_content', 'gene_biotype']

# Query the dataset
result = dataset.query(attributes=attributes, filters={'gene_id': [gene_id]})

result


Unnamed: 0,Gene stable ID,Gene name,Gene description,Chromosome/scaffold name,Gene start (bp),Gene end (bp),Gene % GC content,Gene type


In [24]:
from pybiomart import Server

# Connect to the server
server = Server(host='http://www.ensembl.org')

# Select the dataset
dataset = server.marts['ENSEMBL_MART_ENSEMBL'].datasets['hsapiens_gene_ensembl']

# Define the gene id
gene_id = '248333'  # replace with your gene id

# Define the attributes
attributes = ['ensembl_gene_id', 'external_gene_name', 'description', 'chromosome_name', 'start_position', 'end_position', 'percentage_gene_gc_content', 'gene_biotype']

# Query the dataset
result = dataset.query(attributes=attributes, filters={'gene_id': [gene_id]})

# Print the result
print(result)


Empty DataFrame
Columns: [Gene stable ID, Gene name, Gene description, Chromosome/scaffold name, Gene start (bp), Gene end (bp), Gene % GC content, Gene type]
Index: []


Not easy to read... Don't worry you can convert the 'genes' object into a pandas DataFrame easier reading and manipulation. Here’s how you can do it:

In [None]:
import pandas as pd
# Convert the 'genes' object into a pandas DataFrame.
genes_df = pd.DataFrame(genes)

# Display the DataFrame.
genes_df


## Exercise 1: Exploring Gene Information

Now that we have our data in a more readable format, let's start exploring it. For this exercise, you will retrieve and analyze information about a specific gene.

**Task:**

1. Retrieve information about the gene with the Ensembl ID 'ENSG00000139618'.
2. Display the information in a readable format.
3. Analyze the information and answer the following questions:
    - What is the name of the gene?
    - On which chromosome is the gene located?
    - What is the start and end position of the gene on the chromosome?
    - What is the percentage of GC content in the gene?
    - What is the biotype of the gene?
    - What is the status of the gene in the BioMart database?

Write your code in the cell below:

In [None]:
# Write your code here


## Exercise 2: Comparing Genes

In this exercise, you will compare two genes based on their GC content and biotype.

**Task:**

1. Retrieve information about the genes with the Ensembl IDs 'ENSG00000139618' and 'ENSG00000248333'.
2. Compare the two genes based on their GC content and biotype.
3. Answer the following questions:
    - Which gene has a higher GC content?
    - Do the two genes have the same biotype?

Write your code in the cell below:

In [None]:
# Write your code here


## Scenario-Based Exercise: Investigating a Genetic Disorder

Imagine you are a geneticist investigating a rare genetic disorder. You suspect that the disorder is caused by a mutation in one of two genes: 'ENSG00000139618' or 'ENSG00000248333'.

You decide to use the BioMart database to gather more information about these two genes.

**Task:**

1. Retrieve information about the genes 'ENSG00000139618' and 'ENSG00000248333'.
2. Analyze the information and answer the following questions:
    - On which chromosomes are the genes located?
    - What are the start and end positions of the genes on their respective chromosomes?
    - What are the bitypes of the genes?
    - What are the statuses of the genes in the BioMart database?
3. Based on your analysis, do you think one of these genes could be responsible for the genetic disorder? Why or why not?

Write your code in the cell below:

In [None]:
# Write your code here


## Conclusion

Congratulations on completing the tutorial! 🎉 You've learned how to access and use the BioMart database using Python. You've also gained experience in analyzing gene information, which is a crucial skill in bioinformatics and pharmacogenomics.

Keep practicing and exploring the BioMart database. There's a wealth of information waiting to be discovered! 🚀
