<a href="https://colab.research.google.com/github/AlaseeriRawan/ACMG-PVS1-M-S/blob/main/pct_of_exon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is Ensembl?**
**Ensembl** is a comprehensive and widely used database in genomics and bioinformatics that provides detailed information about the genomes of a wide range of species, including humans, mice, plants, and many others. It's developed by the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute.

### Key Features of Ensembl:

1. **Genome Annotation:**
   - Ensembl provides detailed annotations of genomes, including information about genes, transcripts, exons, regulatory elements, and variations. These annotations are based on the latest genomic research and data, providing high-quality, up-to-date resources for researchers.

2. **Gene and Transcript Information:**
   - For each gene in a genome, Ensembl provides information about its location, structure (including exons, introns, and untranslated regions), and function.
   - It also provides details about different transcripts (mRNA sequences produced from the gene), including protein-coding and non-coding transcripts.

3. **Comparative Genomics:**
   - Ensembl includes tools for comparing genomes across different species. This allows researchers to study evolutionary relationships, identify conserved regions, and explore functional genomics.

4. **Variation Data:**
   - Ensembl hosts extensive data on genetic variations, such as single nucleotide polymorphisms (SNPs), insertions/deletions (indels), and structural variants. This is crucial for studying genetic diversity, disease associations, and personalized medicine.

5. **Regulation:**
   - The database includes information on regulatory elements like promoters, enhancers, and transcription factor binding sites, helping researchers understand how genes are regulated at the genomic level.

6. **Tools and Interfaces:**
   - **Ensembl Genome Browser:** A web-based interface that allows users to explore and visualize genomic data interactively.
   - **BioMart:** A powerful data mining tool for querying and exporting data from Ensembl.
   - **APIs:** Ensembl provides programmatic access to its data through RESTful APIs, making it easy for developers and bioinformaticians to integrate Ensembl data into their analyses.

7. **Multi-Species Data:**
   - Ensembl is not limited to human data; it covers a broad range of species, making it a valuable resource for comparative genomics and evolutionary studies.

### Why Ensembl is Important:

- **Centralized Resource:** Ensembl is a one-stop-shop for genomic data, integrating data from multiple sources and presenting it in a user-friendly way.
- **High-Quality Annotations:** The data in Ensembl is curated and updated regularly, ensuring that researchers have access to the most reliable and current information.
- **Broad Accessibility:** Ensembl is freely available and used by researchers around the world, supporting a wide range of applications from basic research to clinical genomics.

### Practical Uses of Ensembl:

- **Gene Annotation:** Researchers use Ensembl to find information about specific genes, including their sequences, functions, and associated variations.
- **Disease Research:** Ensembl's variation data helps researchers understand the genetic basis of diseases by linking variations to phenotypes.
- **Comparative Studies:** Ensembl's comparative genomics tools allow researchers to compare genes and genomes across species, shedding light on evolutionary processes.
- **Bioinformatics Pipelines:** Ensembl data is often integrated into bioinformatics pipelines for tasks like genome annotation, variant calling, and functional genomics.

### Example Use Case:
If you're studying the *Titin* (TTN) gene, Ensembl would allow you to:
- Access the gene's full sequence and structure, including all known exons and introns.
- Explore different transcripts of the TTN gene, such as the primary coding transcript and its alternative splicing variants.
- Examine variations in the TTN gene that may be linked to muscular diseases.

### Accessing Ensembl:
- **Website:** [Ensembl.org](https://www.ensembl.org) provides a user-friendly web interface for exploring all of this data.
- **Programmatic Access:** For large-scale data analysis, you can use Ensembl's APIs or download data directly from their FTP servers.

Ensembl is an essential tool in modern genomics, offering comprehensive resources that make it easier for researchers to explore and understand the complexities of genomes.

# **What is GTF?**
A **GTF** file is a type of text file used in bioinformatics to store and describe gene structure annotations, such as the positions of exons, introns, and other genomic features within a genome.

### What is a GTF File?

**GTF** is a tab-delimited text file format, and it is widely used to represent gene annotations, particularly in the context of genome assemblies. The GTF format is very similar to the General Feature Format (GFF), but with some specific conventions that make it more suited for certain tasks, especially those involving transcript annotations.

### Structure of a GTF File:

A GTF file typically contains the following fields, separated by tabs:

1. **seqname**: The name of the sequence (e.g., chromosome or scaffold) where the feature is located.
2. **source**: The name of the program or database that generated the annotation (e.g., Ensembl, HAVANA).
3. **feature**: The type of genomic feature (e.g., gene, transcript, exon, CDS).
4. **start**: The starting position of the feature on the sequence (1-based indexing).
5. **end**: The ending position of the feature on the sequence.
6. **score**: A floating-point number representing the confidence or score of the feature (often `.` if not used).
7. **strand**: The strand on which the feature is located (`+` for the forward strand, `-` for the reverse strand).
8. **frame**: The reading frame of the feature (`0`, `1`, `2`, or `.` if not applicable). This is particularly relevant for coding sequences (CDS).
9. **attribute**: A semicolon-separated list of key-value pairs that provide additional information about the feature, such as gene ID, transcript ID, gene name, transcript name, and exon number.

### Example of a GTF File:

Here’s an example of a GTF entry:

```
chr1    HAVANA  gene       11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
chr1    HAVANA  transcript 11869   14409   .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "processed_transcript";
chr1    HAVANA  exon       11869   12227   .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; transcript_name "DDX11L1-201";
```

### Key Components:

- **gene_id** and **transcript_id**: These fields uniquely identify the gene and transcript. They are crucial for linking exons to their corresponding transcripts and genes.
- **feature**: The feature type can be `gene`, `transcript`, `exon`, `CDS`, etc.
- **start** and **end**: These specify the genomic coordinates of the feature.
- **strand**: Indicates whether the feature is on the forward or reverse DNA strand.

### Common Uses of GTF Files:

1. **RNA-Seq Analysis**: GTF files are used to map sequencing reads to specific genes and transcripts, helping to quantify gene expression levels.
2. **Genome Annotation**: Researchers use GTF files to understand the structure of genes, including the locations of exons, introns, and UTRs (untranslated regions).
3. **Comparative Genomics**: By comparing GTF files across species, researchers can study gene conservation and evolutionary relationships.
4. **Variant Annotation**: GTF files help in linking genetic variants to specific regions of genes, such as exons or regulatory regions.

### Tools for Working with GTF Files:

- **HTSeq**: A Python package used for counting reads mapped to genes, often using GTF files.
- **gffread**: A utility from the Cufflinks package for processing GTF and GFF files.
- **pandas (Python)**: Can be used to parse and analyze GTF files in a tabular format.
- **BEDTools**: A suite of tools for comparing genomic features, which can work with GTF files.

### Differences Between GTF and GFF:

While GTF is similar to GFF (General Feature Format), GTF is more tailored for gene and transcript annotations with stricter conventions, making it easier to work with in specific genomic contexts.

# **Human Ensembl GTF**
A **Human Ensembl GTF** file is a specific type of GTF (Gene Transfer Format) file that contains gene annotations for the human genome, as provided by the Ensembl project. These annotations include detailed information about the structure of human genes, such as their positions, exons, transcripts, and other related genomic features.

### Key Points about Human Ensembl GTF Files:

1. **Source of Annotations:**
   - The annotations in the Human Ensembl GTF file are generated by the Ensembl project, which integrates data from various sources, including experimental data, computational predictions, and manual curation. The Ensembl project provides high-quality, up-to-date annotations for the human genome and other species.

2. **Genome Assembly:**
   - The GTF file corresponds to a specific version of the human genome assembly, such as GRCh38 (the most recent human reference genome) or the older GRCh37. The file will specify which genome assembly it corresponds to.

3. **Content of the GTF File:**
   - **Gene Annotations**: Information about the start and end positions of genes, their names, and other related metadata.
   - **Transcript Annotations**: Details about the different transcripts produced from each gene, including the start and end positions of exons, coding sequences (CDS), and untranslated regions (UTRs).
   - **Exon Structure**: Positions of exons for each transcript, which are crucial for understanding the gene’s structure and for RNA-Seq analyses.
   - **Other Features**: Information about regulatory elements, non-coding RNA genes, pseudogenes, etc., depending on the file.

4. **Applications:**
   - **RNA-Seq Analysis:** Used to map sequencing reads to specific genes and transcripts to measure gene expression levels.
   - **Variant Annotation:** Helps link genetic variants to specific regions within genes (e.g., exons, introns, UTRs).
   - **Gene Structure Studies:** Used to study the structure of genes, including alternative splicing and transcript diversity.

### Example Use of a Human Ensembl GTF File:

For example, if you are conducting an RNA-Seq analysis, you would download the Human Ensembl GTF file corresponding to the genome assembly you are working with (e.g., GRCh38). You would then use this GTF file to align your RNA-Seq reads to the human genome, allowing you to quantify gene expression accurately.

### How to Access Human Ensembl GTF Files:

1. **Ensembl FTP Site:**
   - You can download Human Ensembl GTF files from the Ensembl FTP site. The files are organized by genome assembly and Ensembl release version.
   - Example link: `ftp://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/` (Replace `112` with the relevant release version).

2. **Ensembl Genome Browser:**
   - You can navigate the Ensembl Genome Browser, search for the human genome, and download the corresponding GTF file from the "Downloads" section.

3. **BioMart:**
   - Ensembl’s BioMart tool allows for customized queries and downloads of gene annotations, which can be output in GTF format.

### Naming Convention:
- The naming of a Human Ensembl GTF file typically follows this pattern: `Homo_sapiens.GRCh38.112.gtf.gz`, where:
  - **Homo_sapiens**: Indicates that this is the human genome.
  - **GRCh38**: The genome assembly version.
  - **112**: The Ensembl release version.
  - **gtf.gz**: The file is compressed in Gzip format.

### Example Line from a Human Ensembl GTF File:

```
1       ensembl gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_biotype "transcribed_unprocessed_pseudogene";
1       ensembl transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; transcript_name "DDX11L1-201";
1       ensembl exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1"; gene_name "DDX11L1"; transcript_name "DDX11L1-201";
```


The Human Ensembl GTF file is a crucial resource in genomics, providing comprehensive gene annotations that are essential for many types of genetic and genomic analyses. It serves as a standardized format for representing gene structures, making it an indispensable tool in bioinformatics.

# Install the necessary packages
### complete

In [None]:
# Instillation
!pip install pandas gffutils
!pip install openpyxl
!wget ftp://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz
!gunzip Homo_sapiens.GRCh38.112.gtf.gz

In [2]:
import pandas as pd
import requests

In [3]:
# Define the GTF file path
gtf_file = 'Homo_sapiens.GRCh38.112.gtf'
# Define the column names for the GTF file
columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
# Load the GTF file into a pandas DataFrame
df = pd.read_csv(gtf_file, sep='\t', comment='#', names=columns, low_memory=False)

In [4]:
# df.head() data frame of the whole GTF Ensembl data

# Get Ensembl gene and transcript IDs from RefSeq ID, with exon and CDS count

In [None]:
# @title Transcript Number
# Prompt the user to input the RefSeq ID
refseq_id = input("Please enter the RefSeq ID (e.g., NM_001267550): ")

# Query Ensembl REST API for mapping
server = "https://rest.ensembl.org"
ext = f"/xrefs/symbol/homo_sapiens/{refseq_id}?"

response = requests.get(server + ext, headers={"Content-Type": "application/json"})

if not response.ok:
    response.raise_for_status()

decoded = response.json()

# Initialize variables
gene_id = None
transcript_id = None

# Assign the Ensembl Gene ID and Transcript ID
for item in decoded:
    if 'id' in item and item['type'] == 'gene':
        gene_id = item['id']
        print(f"RefSeq ID: {refseq_id} -> Ensembl Gene ID: {gene_id}")
    elif 'id' in item and item['type'] == 'transcript':
        transcript_id = item['id']
        print(f"RefSeq ID: {refseq_id} -> Ensembl Transcript ID: {transcript_id}")

# Display the assigned values
print(f"\nAssigned Gene ID: {gene_id}")
print(f"Assigned Transcript ID: {transcript_id}")

# Filter the DataFrame for the user-specified gene
gene_df = df[df['attribute'].str.contains(gene_id)]

# Further filter the DataFrame for the specific transcript ID
transcript_df = gene_df[gene_df['attribute'].str.contains(transcript_id)]

# Filter the DataFrame for the specific transcript ID
exons_df = gene_df[(gene_df['feature'] == 'exon') & (gene_df['attribute'].str.contains(transcript_id))]
transcript_ids = gene_df['attribute'].str.extract(r'transcript_id "([^"]+)"')[0].dropna().unique()
cds_df = gene_df[(gene_df['feature'] == 'CDS') & (gene_df['attribute'].str.contains(transcript_id))]


# Count the number of exons
exon_count = len(exons_df)
cds_count = len(cds_df)

# Display the number of exons
print(f"The transcript {transcript_id} has {exon_count} exons and {cds_count} CDS")
print("If the above information is correct, please proceed with the rest of the code below.")

# DATAFRAMES
# gene_df : df for entire gene with all transcrpts
# exons_df : df of just the exons for my particulare transcript
# cds_df :df contains just the CDS for my transcript
# transcript_df : df contains all the data for the transcript including exons, cds, start and end codons and utr regions

# IF YOU GET THE WRONG TRANSCRIPT:

In [None]:
# if the total exon number is not right, check if the transcript id for your refseq is the same manually. else:

# Prompt the user to input the Ensembl gene ID
gene_id_check = input("Please enter the Ensembl gene ID as shown above (e.g., ENSG00000155657): ")

# Filter the DataFrame for the specific gene ID and CDS features
cds_df_check = df[(df['attribute'].str.contains(gene_id)) & (df['feature'] == 'exon')]

# Extract transcript IDs and count CDS entries for each transcript
cds_counts_check = cds_df_check['attribute'].str.extract(r'transcript_id "([^"]+)"')[0].value_counts()

# Display the number of CDS entries for each transcript
print(f"Exon counts for each transcript in gene {gene_id}:")
for transcript_id, count in cds_counts_check.items():
    print(f"Transcript ID: {transcript_id}, Number of Exons: {count}")


# Code Execution

# Calculate % of the exon

In [None]:
# @title Exon Number
# Input the CDS number
cds_number = input("Please enter the exon number you want to analyze (e.g., 15): ")

# Filter for the specified CDS number in the CDS df of the transcript
myCDS_df = cds_df[cds_df['attribute'].str.contains(f'exon_number "{cds_number}"')]

# Extract the start, end, and attribute information for the selected cds
cds_start = myCDS_df['start'].values[0]
cds_end = myCDS_df['end'].values[0]
cds_length = cds_end - cds_start + 1  # Calculate exon length
cds_attributes = myCDS_df['attribute'].values[0]

# Extract the start, end, and attribute of transcript
transcript_start = transcript_df['start'].values[0]
transcript_end = transcript_df['end'].values[0]
transcript_length = transcript_end - transcript_start + 1

# Calculate the total length of all exons
cds_lengths = cds_df['end'] - cds_df['start'] +1
full_cds_length = cds_lengths.sum() + 3

# Calculate the percentage of the selected exon relative to the entire gene length
exon_percentage = (cds_length / transcript_length) * 100

# Display the start, end, length, and full attributes
print(f"Exon {cds_number} Start: {cds_start}")
print(f"Exon {cds_number} End: {cds_end}")
print(f"Exon {cds_number} Length: {cds_length}")

# Display the start, end, and length of the transcript
print(f"Transcript Start: {transcript_start}")
print(f"Transcript End: {transcript_end}")
print(f"CDS Length: {full_cds_length}")

# pct manual
pct_ex = (cds_length / full_cds_length) * 100
print(f"Exon {cds_number} is {pct_ex}% of the total gene length.")

if pct_ex > 9.99:
  print("PVS1_STRONG")
else:
    print("PVS1_MODERATE")