# Preprocess Gene Annotations

This notebook creates a table of gene annotations by:
1. Querying Biomart for all Ensembl IDs in the database
2. Querying MyGene for annotation about those IDs
3. Querying Ensembl for the most recent Ensembl release for each ID
4. Building a permalink to the Ensembl archive page for each ID

This gene annotation table is read in by `agoradataprocessing/process.py` to be used in the `gene_info` transformation. 

***Note:*** *This notebook is exploratory and should eventually be converted to a Python script that is run through an automated process.*

## Installation requirements

#### Linux / Windows / Mac

Install R: https://cran.r-project.org/

Install Python and agora-data-tools following the instructions in this repository's README. This notebook assumes it is being run from the same `pipenv` virtual environment as agora-data-tools. 

Then install the following packages using `pip`:
```
pip install rpy2 mygene
```

#### Note for Macs with M1 chips (2020 and newer)

Install as above, but make sure that your R installation is the arm64 version (R-4.X.X-arm64.pkg) so that the architecture matches what pip is using. 
You may also need to install an older version of `rpy2` on the Mac:
```
pip install rpy2==3.5.12
```

In [None]:
from rpy2.robjects import r
import pandas as pd
import mygene
import numpy as np
import requests
import agoradatatools.etl.utils as utils
import agoradatatools.etl.extract as extract
import preprocessing_utils

r(
    'if (!require("BiocManager", character.only = TRUE)) { install.packages("BiocManager") }'
)
r('if (!require("biomaRt")) { BiocManager::install("biomaRt") }')

r.library("biomaRt")

ensembl_ids_filename = "../../output/ensembl_id_list.txt"
archive_filename = "../../output/ensembl_archive_list.csv"
config_filename = "../../../../config.yaml"

# Part 1: Get gene annotation data

## [Deprecated] Query Biomart for a list of all Ensembl IDs in the database of human genes. 

Here we use the R library `biomaRt`. There is no canonical Python library with the features we need for this script. 

*We no longer get all genes from BioMart, so this section is unused. The code is here in case we need it again.*

In [None]:
"""
ensembl_ids_df = preprocessing_utils.r_query_biomart()
ensembl_ids_df = preprocessing_utils.filter_hasgs(
    df=ensembl_ids_df, chromosome_name_column="chromosome_name"
)
print(str(ensembl_ids_df.shape[0]) + " genes remaining after HASG filtering.")
"""

## Get Ensembl IDs from data sets that will be processed by agora-data-tools

Loop through all data sets in the config file to get all Ensembl IDs used in every data set. Exclude `gene_metadata` since that's the file we are building, and `druggability` since that data is deprecated.

In [None]:
file_ensembl_list = preprocessing_utils.get_all_adt_ensembl_ids(
    config_filename=config_filename,
    exclude_files=["gene_metadata", "druggability"],
    token=None,
)
print("")
print(str(len(file_ensembl_list)) + " Ensembl IDs found.")
print(file_ensembl_list[0:5])

Create a data frame with these IDs so it can be merged with the MyGene query results below.

In [None]:
ensembl_ids_df = pd.DataFrame({"ensembl_gene_id": file_ensembl_list})

""" Removed due to no longer getting genes from BioMart, but saving code
# Add Ensembl IDs that are in the files but not in the biomart result
missing = set(file_ensembl_list) - set(ensembl_ids_df["ensembl_gene_id"])
print(
    str(len(missing))
    + " genes from the data files are missing from Biomart results and will be added."
)

missing_df = pd.DataFrame({"ensembl_gene_id": list(missing), "chromosome_name": ""})
ensembl_ids_df = pd.concat([ensembl_ids_df, missing_df])
"""

ensembl_ids_df = ensembl_ids_df.dropna(subset=["ensembl_gene_id"])
print(len(ensembl_ids_df))

In [None]:
# Write to a file to save the list of IDs
ensembl_ids_df.to_csv(
    path_or_buf=ensembl_ids_filename, sep="\t", header=False, index=False
)

## Get info on each gene from mygene

In [None]:
mg = mygene.MyGeneInfo()

mygene_output = mg.getgenes(
    ensembl_ids_df["ensembl_gene_id"],
    fields=["symbol", "name", "summary", "type_of_gene", "alias"],
    as_dataframe=True,
)

mygene_output.index.rename("ensembl_gene_id", inplace=True)
mygene_output.head()

In [None]:
print("Annotations found for " + str(sum(mygene_output["notfound"].isna())) + " genes.")
print(
    "No annotations found for "
    + str(sum(mygene_output["notfound"] == True))
    + " genes."
)

# Part 2: Clean the data

## Join and standardize columns / values

For consistency with the `agora-data-tools` transform process, this uses the etl standardize functions.

In [None]:
gene_table_merged = pd.merge(
    left=ensembl_ids_df,
    right=mygene_output,
    how="left",
    on="ensembl_gene_id",
    validate="many_to_many",
)

gene_table_merged = utils.standardize_column_names(gene_table_merged)
gene_table_merged = utils.standardize_values(gene_table_merged)

print(gene_table_merged.shape)
gene_table_merged.head()

## Fix alias field

Fix `NaN` values in the `alias` field and make sure every alias value is a list, not a string.

In [None]:
gene_table_merged["alias"] = gene_table_merged["alias"].apply(
    preprocessing_utils.standardize_list_item
)

## Remove duplicate Ensembl IDs from the list. 

Duplicates in the list typically have the same Ensembl ID but different gene symbols. This usually happens when a single Ensembl ID maps to multiple Entrez IDs in the NCBI database. For every set of duplicated rows with the same Ensembl ID, we remove all rows but the first row in the set, and the symbols and aliases of the removed rows get added to the "alias" field of the first row.

In [None]:
# For printing only
dupes = gene_table_merged["ensembl_gene_id"].duplicated()
dupe_ids = gene_table_merged.loc[dupes, "ensembl_gene_id"]
print(
    gene_table_merged.loc[
        gene_table_merged["ensembl_gene_id"].isin(dupe_ids),
        ["ensembl_gene_id", "symbol", "alias"],
    ]
)

# Remove duplicates
gene_table_merged = preprocessing_utils.merge_duplicate_ensembl_ids(gene_table_merged)

In [None]:
print(str(len(dupe_ids.drop_duplicates())) + " duplicated genes have been processed.")
print(gene_table_merged.shape)
print(gene_table_merged.loc[gene_table_merged["ensembl_gene_id"].isin(dupe_ids), "alias"])

# Part 3: Create Ensembl archive permalinks

## Get a table of Ensembl archive URLs

This is where we need to use the R biomaRt library specifically, instead of any of the available Python interfaces to Biomart, to get a table of Ensembl release versions and their corresponding archive URLs. 

In [None]:
archive_df = r.listEnsemblArchives()
archive_df.to_csvfile(path=archive_filename, row_names=False, quote=False)

print(archive_df)

## Query Ensembl for each gene's version

Ensembl's REST API can only take 1000 genes at once, so this is looped to query groups of 1000. 

In [None]:
versions = preprocessing_utils.query_ensembl_version_api(
    ensembl_ids=gene_table_merged["ensembl_gene_id"].tolist()
)

versions.tail()

In [None]:
versions.groupby("release").size()

In [None]:
# Check that all IDs are the same between the result and the gene table
print(len(versions["id"]))
print(len(gene_table_merged))
print(
    all(versions["id"].isin(gene_table_merged["ensembl_gene_id"]))
    and all(gene_table_merged["ensembl_gene_id"].isin(versions["id"]))
)

In [None]:
# Make sure everything is GRCh38, not GRCh37
all(versions["assembly"] == "GRCh38")

## Create permalinks based on archive version

**Not all of these versions have an archive.** We can go back to the closest previous archive for these but the link isn't guaranteed to work.

In [None]:
archive_table = pd.read_csv(archive_filename)

# Remove GRCh37 from the archive list
archive_table = archive_table[archive_table["version"] != "GRCh37"].reset_index()

archive_table["numeric_version"] = archive_table["version"].astype(int)


def closest_release(release, archive_table):
    if release in archive_table:
        return release

    return max([V for V in archive_table["numeric_version"] if V <= release])

In [None]:
versions["closest_release"] = 0

releases = versions["release"].drop_duplicates().astype(int)

# Only have to call closest_release once per version, instead of >70k times
for release in releases:
    versions.loc[versions["release"] == str(release), "closest_release"] = (
        closest_release(release, archive_table)
    )

versions.groupby("closest_release").size()

In [None]:
versions["permalink"] = ""

for i in versions.index:
    match = archive_table["numeric_version"] == versions.at[i, "closest_release"]
    url = archive_table.loc[match, "url"].to_string(index=False)
    if len(url) > 0:
        versions.at[i, "permalink"] = (
            url + "/Homo_sapiens/Gene/Summary?db=core;g=" + versions.at[i, "id"]
        )

versions.head()

In [None]:
versions[versions["closest_release"] < 100].head()

In [None]:
print(versions["permalink"][0])
print(versions["permalink"][25])

In [None]:
# Does every gene have an associated URL?
url_base_len = len(archive_table["url"][0]) + 1
all([len(url) > url_base_len for url in versions["permalink"]])

# Part 4: Add permalinks to the gene table

In [None]:
versions = versions[["id", "release", "possible_replacement", "permalink"]]
versions.rename(
    columns={"id": "ensembl_gene_id", "release": "ensembl_release"}, inplace=True
)

gene_table_merged = pd.merge(
    left=gene_table_merged,
    right=versions,
    how="left",
    on="ensembl_gene_id",
    validate="one_to_one",
)

print(gene_table_merged.shape)
gene_table_merged.head()

### Final cleanup
"possible_replacement" entries will either be an empty list or a list of dictionaries. Entries that have data in them need to have the Ensembl IDs pulled out of them as a list of strings.

Remove unneeded columns. 

In [None]:
gene_table_merged["possible_replacement"] = gene_table_merged[
    "possible_replacement"
].apply(lambda pr: pr if len(pr) == 0 else [x["stable_id"] for x in pr])

gene_table_merged = gene_table_merged[
    [
        "ensembl_gene_id",
        "name",
        "alias",
        "summary",
        "symbol",
        "type_of_gene",
        "ensembl_release",
        "possible_replacement",
        "permalink",
    ]
]

gene_table_merged

### Write to a file
This will get uploaded to Synapse as [syn25953363](https://www.synapse.org/#!Synapse:syn25953363).

In [None]:
gene_table_merged = gene_table_merged.sort_values(by="ensembl_gene_id").reset_index(
    drop=True
)
gene_table_merged
gene_table_merged.to_feather("../../output/gene_table_merged_GRCh38.p14.feather")