## Mapping CNV Segments to Genes

### Overview
This notebook uses PyEnsembl's genome metadata to look up genes based on the retrieved CNV data. Chromosome, Start, and End (genomic positions of CNV segment) are used to specify the locus. GRCh38, the most recent and comphrehensive human reference genome assembly, is used via PyEnsembl as well.


PyEnsembl installation and reference genome data (GRCh38) download

In [1]:
!pip install pyensembl
!pyensembl install --release 84 --species homo_sapiens

Collecting pyensembl
  Downloading pyensembl-2.3.13-py3-none-any.whl.metadata (9.4 kB)
Collecting typechecks<1.0.0,>=0.0.2 (from pyensembl)
  Downloading typechecks-0.1.0.tar.gz (3.4 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting datacache<2.0.0,>=1.4.0 (from pyensembl)
  Downloading datacache-1.4.1-py3-none-any.whl.metadata (1.9 kB)
Collecting memoized-property>=1.0.2 (from pyensembl)
  Downloading memoized-property-1.0.3.tar.gz (5.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tinytimer<1.0.0,>=0.0.0 (from pyensembl)
  Downloading tinytimer-0.0.0.tar.gz (2.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gtfparse<3.0.0,>=2.5.0 (from pyensembl)
  Downloading gtfparse-2.5.0-py3-none-any.whl.metadata (2.1 kB)
Collecting serializable<1.0.0,>=0.2.1 (from pyensembl)
  Downloading serializable-0.4.1-py3-none-any.whl.metadata (2.2 kB)
Collecting pylint<3.0.0,>=2.17.2 (from pyensembl)
  Downloading pylint-2.17.7-py3-none-any.whl

Import libraries

In [3]:
import pandas as pd
from pyensembl import EnsemblRelease
from google.colab import files

Get CNV data from uploaded files on Google Colab

In [4]:
uploaded = files.upload()
df = pd.read_csv("ground_truth_combined.csv", delimiter=',')
print(df.head())

Saving ground_truth_combined.csv to ground_truth_combined.csv
                            GDC_Aliquot Chromosome  Start        End  \
0  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr1  13116  248945703   
1  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr2  10587  242183243   
2  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr3  18519  198181744   
3  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr4  11961  190122722   
4  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr5  11882  181363900   

   Copy_Number  Major_Copy_Number  Minor_Copy_Number    Case_ID  Sample  
0            4                  2                  2  C3L-00359       1  
1            4                  2                  2  C3L-00359       1  
2            4                  2                  2  C3L-00359       1  
3            4                  2                  2  C3L-00359       1  
4            4                  2                  2  C3L-00359       1  


Or retrieve combined CNV data locally

In [None]:
file_path = "D:/GDC-data/ground_truth_combined.csv"
df = pd.read_csv(file_path, delimiter=',')
print(df.head())

Gene data retrieval function using PyEnsembl

---



In [14]:
# use latest Ensembl release for human genome
data = EnsemblRelease(84)
data.download()
data.index()

# function to get gene data for the locus
def get_gene_data(chromosome, start, end):
    genes = data.genes_at_locus(contig=chromosome, position=start, end=end)
    if not genes:
        return None, None, None, None

    # Choose the longest gene at the locus
    prominent_gene = max(genes, key=lambda gene: gene.end - gene.start)

    gene_name = prominent_gene.name
    gene_biotype = prominent_gene.biotype
    gene_length = prominent_gene.end - prominent_gene.start
    exon_count = len(prominent_gene.exons)

    return gene_name, gene_biotype, gene_length, exon_count

Append gene data columns to existing dataframe

In [19]:
# apply function to each row in dataframe
df[['gene_name', 'gene_biotype', 'gene_length', 'exon_count']] = df.apply(
    lambda row: pd.Series(get_gene_data(row['Chromosome'], row['Start'], row['End'])),
    axis=1
)

print(df.head(10)) # updated dataframe

                            GDC_Aliquot Chromosome     Start        End  \
0  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr1     13116  248945703   
1  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr2     10587  242183243   
2  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr3     18519  198181744   
3  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr4     11961  190122722   
4  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr5     11882  181363900   
5  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr6    100116   32468178   
6  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr6  32469273   32536402   
7  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr6  32537412  170740469   
8  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr7     20608  159334386   
9  2eb1196d-5e6d-4223-953c-d558c75a8c2e       chr8     61774  125589296   

   Copy_Number  Major_Copy_Number  Minor_Copy_Number    Case_ID  Sample  \
0            4                  2                  2  C3L-00359       1   
1            4          

### Gene Data Selection

Specific gene data was chosen to add relevant features to the current ground truth table

- Gene name: for information, common name for interpreting data
- Gene biotype: type of gene, can indicate if the CNV affects protein-coding genes
- Gene length: measurement to check for patterns among CNV segments
- Exon count: checking if the CNV affects specific regions of a gene

### Additional Notes
- PyEnsembl's `gene_at_locus()` returns all genes overlapping at that locus.
- We will map the most "prominent" gene to the corresponding row of the ground truth dataframe.
- Currently the criteria for the prominent gene is the longest gene at the locus (see the lambda function for returning the max gene length).