<a href="https://colab.research.google.com/github/nunososorio/bhs/blob/main/NSO_PracticalClass_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical Training: The Role of Omics, Bioinformatics, and Pharmacogenomics

By **Nuno S. Osório** 🖋️

👋 Welcome to this tutorial! We will explore how to access and use omics-related databases. We will focus on retrieving data from **BioMart** using their respective Python packages, and then analyze the retrieved data. This tutorial is designed to be run on jupyter notebook environments and includes exercises that involve running Python code. 🐍💻

You can access an interactive cloud version of the notebook here (https://colab.research.google.com/github/nunososorio/bhs/blob/main/NSO_PracticalClass_II.ipynb).

Let's dive in! 🏊‍♂️


## Introduction

The use of databases is crucial in the steps of **Target-to-Hit** and **Hit-to-Lead** in drug discovery. 🎯💊 These databases provide a wealth of information about potential drug targets and the compounds that could interact with these targets. Accessing and analyzing this data can help in the identification of potential new drugs. 🧪🔬

Accessing these databases can be done via their respective websites. However, for reproducible and large-scale analysis, accessing the database programmatically via code is more efficient. In this tutorial, we will guide you on how to do this. 🖥️📚

The **BioMart** database is a powerful query-oriented data management system that provides unified access to distributed research data. It brings together biological, experimental, and medical data to aid the translation of genomic information into effective new drugs. 🧬💡

The `biomart` is the official Python client library for accessing BioMart data. 🐍📦


## Setup
First, we need to install the necessary Python packages. Run the following commands in your environment:


In [None]:
!pip install biomart


Now, import the library:

In [None]:
from biomart import BiomartServer


## Explore

Lets start by learning all types of data or information we can retreive from BioMart database using the biomart client. You can list available data entities using the following code:

In [None]:
server = BiomartServer( "http://www.biomart.org/biomart" )
available_resources = server.list_marts()

available_resources


In Python, attributes that start with an underscore are typically used for internal purposes and are not meant to be accessed directly.

Lets peek into the 'ensembl' data entity in the BioMart database:

In [None]:
ensembl = server.datasets['hsapiens_gene_ensembl']
ensembl


As you can see, it returns a dictionary containing a wealth of information about a specific gene in the BioMart database. The keys in this dictionary represent different attributes of the gene, and the values associated with these keys provide specific information about these attributes.

Here’s a brief explanation of some of the keys in the dictionary:

- **'ensembl_gene_id'**: The Ensembl ID of the gene.
- **'ensembl_transcript_id'**: The Ensembl ID of the transcript.
- **'description'**: The description of the gene.
- **'chromosome_name'**: The name of the chromosome where the gene is located.
- **'start_position'**: The start position of the gene on the chromosome.
- **'end_position'**: The end position of the gene on the chromosome.
- **'strand'**: The strand of the gene on the chromosome.
- **'transcript_count'**: The number of transcripts of the gene.
- **'percentage_gc_content'**: The percentage of GC content in the gene.
- **'gene_biotype'**: The biotype of the gene.
- **'status'**: The status of the gene in the BioMart database.



If we want to find a gene by 'ensembl_gene_id' we can use:

In [None]:
# Create a 'gene' object that allows you to access the 'gene' data entity in the BioMart database.
gene = ensembl

# Use the 'filter' method of the 'gene' object to retrieve all genes whose ensembl_gene_id is 'ENSG00000139618'.
genes = gene.search({'filters': {'ensembl_gene_id': 'ENSG00000139618'}, 'attributes': ['ensembl_gene_id', 'external_gene_name', 'description', 'chromosome_name', 'start_position', 'end_position', 'percentage_gc_content', 'gene_biotype', 'status']})

# 'genes' now contains a list of all genes in the BioMart database whose ensembl_gene_id is 'ENSG00000139618'.
genes



Not easy to read... Don't worry you can convert the 'genes' object into a pandas DataFrame easier reading and manipulation. Here’s how you can do it:

In [None]:
import pandas as pd
# Convert the 'genes' object into a pandas DataFrame.
genes_df = pd.DataFrame(genes)

# Display the DataFrame.
genes_df


## Exercise 1: Exploring Gene Information

Now that we have our data in a more readable format, let's start exploring it. For this exercise, you will retrieve and analyze information about a specific gene.

**Task:**

1. Retrieve information about the gene with the Ensembl ID 'ENSG00000139618'.
2. Display the information in a readable format.
3. Analyze the information and answer the following questions:
    - What is the name of the gene?
    - On which chromosome is the gene located?
    - What is the start and end position of the gene on the chromosome?
    - What is the percentage of GC content in the gene?
    - What is the biotype of the gene?
    - What is the status of the gene in the BioMart database?

Write your code in the cell below:

In [None]:
# Write your code here


## Exercise 2: Comparing Genes

In this exercise, you will compare two genes based on their GC content and biotype.

**Task:**

1. Retrieve information about the genes with the Ensembl IDs 'ENSG00000139618' and 'ENSG00000248333'.
2. Compare the two genes based on their GC content and biotype.
3. Answer the following questions:
    - Which gene has a higher GC content?
    - Do the two genes have the same biotype?

Write your code in the cell below:

In [None]:
# Write your code here


## Scenario-Based Exercise: Investigating a Genetic Disorder

Imagine you are a geneticist investigating a rare genetic disorder. You suspect that the disorder is caused by a mutation in one of two genes: 'ENSG00000139618' or 'ENSG00000248333'.

You decide to use the BioMart database to gather more information about these two genes.

**Task:**

1. Retrieve information about the genes 'ENSG00000139618' and 'ENSG00000248333'.
2. Analyze the information and answer the following questions:
    - On which chromosomes are the genes located?
    - What are the start and end positions of the genes on their respective chromosomes?
    - What are the bitypes of the genes?
    - What are the statuses of the genes in the BioMart database?
3. Based on your analysis, do you think one of these genes could be responsible for the genetic disorder? Why or why not?

Write your code in the cell below:

In [None]:
# Write your code here


## Conclusion

Congratulations on completing the tutorial! 🎉 You've learned how to access and use the BioMart database using Python. You've also gained experience in analyzing gene information, which is a crucial skill in bioinformatics and pharmacogenomics.

Keep practicing and exploring the BioMart database. There's a wealth of information waiting to be discovered! 🚀
