# SCNA analysis step 2: Get locations of genes

- Input: SCNA table
- Output: SCNA table with locations (in base pairs)
- Steps:
    1. From the SCNA table, get a list of all the unique values in the "gene" column
    2. From Ensembl, get a column of the chromosome for each gene, a column for the base pair start location for each gene, and a column for the base pair end location for each gene.
    3. Create a column "first", defined by:
        - `np.where(scna["start"] > scna["end"], scna["end"], scna["start"])`
        - Meaning, if for a gene, the end comes before the start, then choose the end location, otherwise choose the start location. This way, we can sort all the genes by location with one column.
    4. Create a column "last", defined by:
        - `np.where(scna["end"] > scna["start"], scna["end"], scna["start"])`
        - Meaning, if for a gene, the end comes after the start, then choose the end location, otherwise choose the start location.
    5. Join this into the SCNA table. So we now have these columns: cancer_type, Patient_ID, gene, cna_val, passes, chromosome, start, end, first, last


# TODO: After Chelsie finishes the better way to get location data, start using that

- Also, we should get the arm, not just the chromosome

## Setup

In [1]:
import cptac
import pandas as pd
import numpy as np
import pyensembl
import datetime
import os

TIME_START = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

# Note: If you are running this yourself, you will need to first run
# the step 1 notebook (01_mark_scnas) and sub in the appropriate output
# file name for STEP1_FILE_NAME. We didn't include the output files in
# the repository because they exceed GitHub's 100 MB per file limit.

STEP1_DIR = "01_outputs"
STEP1_FILE_NAME = "scna_cutoff_0.2_20200706_092210.tsv.gz"
STEP1_FILE_PATH = os.path.join(STEP1_DIR, STEP1_FILE_NAME)

STEP2_DIR = "02_outputs"
if not os.path.isdir(STEP2_DIR):
    os.mkdir(STEP2_DIR)
    
STEP2_FILE_PATH = os.path.join(STEP2_DIR, f"locations_{TIME_START}_from_{STEP1_FILE_NAME}")

In [2]:
print(STEP2_FILE_PATH)

02_outputs/locations_scna_cutoff_0.2_20200706_092210.tsv.gz


In [3]:
cnas = pd.read_csv(STEP1_FILE_PATH, sep="\t", dtype={"Database_ID": 'O'})

In [4]:
print(f"{cnas.shape[0]:,}")

25,460,266


In [5]:
# Make sure we have the Ensembl data downloaded
# The most recent release is 100, but 99 is still
# recent (Jan 2020), and PyEnsembl only supports
# up to 99 right now.
ensembl = pyensembl.EnsemblRelease(99)

try:
    ensembl.genes() # If this fails, we need to download the data again.
except ValueError as e:
    if str(e).startswith("Missing genome data file from "):
        ensembl.download()
        ensembl.index()
    else:
        raise e from None

## Get gene locations
First we'll get a unique list of the genes in the data, so we only send a request once for each gene.

In [6]:
unique_genes = cnas[["gene", "Database_ID"]].drop_duplicates()

In [7]:
print(f"{unique_genes.shape[0]:,}")

90,383


Then, we'll get the location data for all those genes.

In [8]:
chrs = []
starts = []
ends = []

for idx in unique_genes.index:
    db_id = unique_genes.loc[idx, "Database_ID"]
   
    if pd.notnull(db_id):
        try:
            info = ensembl.gene_by_id(db_id)
        except ValueError as e:
            if str(e).startswith("Gene not found: "):
                pass # This will go down to the next try/catch and attempt lookup by name instead of ID
            else:
                raise e from None
        else:
            chrs.append(info.contig)
            starts.append(info.start)
            ends.append(info.end)
            continue
            
    # We get to this try/catch either if Database_ID is null,
    # or if nothing was found for it. That way we if don't find
    # anything with the Database_ID, we try again with the gene name.
    # It appears that some genes have old names that are out of date,
    # such as LSMD1. If we want to get even better coverage, we could
    # try querying HGNC with old gene names, if the below returns nothing.
    try:
        info = ensembl.genes_by_name(unique_genes.loc[idx, "gene"])
    except ValueError as e:
        if str(e).startswith("No results found for query"):
            chrs.append(np.nan)
            starts.append(np.nan)
            ends.append(np.nan)
        else:
            raise e from None
    else:
        chrs.append(info[0].contig)
        starts.append(info[0].start)
        ends.append(info[0].end)

And now we'll add that location data to the unique_genes table.

In [9]:
unique_genes = unique_genes.assign(
    chromosome=chrs,
    start=starts,
    end=ends
)

Quick check of how many genes are missing

In [10]:
pd.isnull(unique_genes["start"]).sum() / unique_genes.shape[0]

0.08377681643671929

Now we'll create the "first" and "last" columns, as noted above.

In [11]:
unique_genes = unique_genes.assign(
    first=np.where(unique_genes["start"] > unique_genes["end"], unique_genes["end"], unique_genes["start"]),
    last=np.where(unique_genes["start"] > unique_genes["end"], unique_genes["start"], unique_genes["end"])
)

Finally, we'll join this into the cnas table.

In [12]:
cnas = cnas.merge(
    right=unique_genes,
    how="outer",
    on=["gene", "Database_ID"],
    validate="many_to_one"
)

Now we'll just select the rows where we were able to get location data.

In [13]:
cnas = cnas[pd.notnull(cnas["start"])]

In [14]:
cnas.to_csv(STEP2_FILE_PATH, index=False, compression="gzip", sep="\t")

In [15]:
cnas.head(10)

Unnamed: 0,Patient_ID,gene,Database_ID,cna_val,cancer_type,passes,chromosome,start,end,first,last
0,CPT000814,7SK,ENSG00000232512,-0.058,br,False,21,25095022.0,25103670.0,25095022.0,25103670.0
1,CPT001846,7SK,ENSG00000232512,0.065,br,False,21,25095022.0,25103670.0,25095022.0,25103670.0
2,X01BR001,7SK,ENSG00000232512,1.036,br,True,21,25095022.0,25103670.0,25095022.0,25103670.0
3,X01BR008,7SK,ENSG00000232512,0.09,br,False,21,25095022.0,25103670.0,25095022.0,25103670.0
4,X01BR009,7SK,ENSG00000232512,0.375,br,True,21,25095022.0,25103670.0,25095022.0,25103670.0
5,X01BR010,7SK,ENSG00000232512,0.211,br,True,21,25095022.0,25103670.0,25095022.0,25103670.0
6,X01BR015,7SK,ENSG00000232512,-0.086,br,False,21,25095022.0,25103670.0,25095022.0,25103670.0
7,X01BR017,7SK,ENSG00000232512,0.192,br,False,21,25095022.0,25103670.0,25095022.0,25103670.0
8,X01BR018,7SK,ENSG00000232512,-0.038,br,False,21,25095022.0,25103670.0,25095022.0,25103670.0
9,X01BR020,7SK,ENSG00000232512,1.117,br,True,21,25095022.0,25103670.0,25095022.0,25103670.0
