# Accessing Ensembl sequences from GraphQL and refget

The following python notebook details how to access Ensembl sequences from refget adter resolving a query from the Ensembl GraphQL service.

## Use of async

The following code uses `async` and `await` commands throughout the code due to it executing within a Python notebook. The following code can be editted to run in a synchronous environment. Please consult [`gql`](https://gql.readthedocs.io/en/latest/async/async_usage.html) documentation about the steps used to convert the code into an async compatible tool.

In [1]:
from gql import Client, gql
from gql.transport.aiohttp import AIOHTTPTransport
import requests
import refget

## Global variables

We define global variables to execute here. Specifically the target GraphQL and refget server alongside the assembly and gene we wish to search for. We have specified `GRCh38.p14` and the gene `JAG1`.

In [2]:
# Default endpoints
ensembl_graphql = "https://beta.ensembl.org/data/graphql"
ensembl_refget = "https://beta.ensembl.org/data/refget"

# If set to true we will use the requests version
# to communicate with refget. Otherwise we use
# the python refget client library
use_requests_refget = True

# Entities to query for
gca = "GCA_000001405.29"
gene_symbol = "JAG1"

## Code

The following methods search GraphQL using the `gql` library and can retrieve sequence from refget using a custom method called `resolve_sequence()` or via the official [refget python client](https://pypi.org/project/refget/).

In [3]:
async def get_genome_id(session, gca):
    genome_id_graphql_query = gql(
        """query GetGenomeID($assembly_accession_id: String!) {
    genomes(
      by_keyword: {
        assembly_accession_id: $assembly_accession_id
      }) 
    {
      genome_id
    }
  }"""
    )

    genome_id_result = await session.execute(
        genome_id_graphql_query, variable_values={"assembly_accession_id": gca}
    )
    return genome_id_result["genomes"][0]["genome_id"]


async def get_transcripts(session, symbol, genome_id):
    # Gene query for all transcripts associated with a gene symbol
    gene_query = gql(
        """
      query GetTranscriptSequences($symbol: String!, $genome_id: String!) {
        genes(
          by_symbol: {symbol: $symbol, genome_id: $genome_id}
        ) {
          stable_id
          unversioned_stable_id
          version
          symbol
          transcripts {
            stable_id
            symbol
            type
            metadata {
              mane {
                value
              }
              canonical {
                value
              }
            }
            product_generating_contexts {
              product_type
              cdna {
                sequence {
                  checksum
                }
              }
              cds {
                sequence {
                  checksum
                }
              }
              product_type
              product {
                stable_id
                sequence {
                  checksum
                }
              }
            }
          }
        }
      }
  """
    )
    result = await session.execute(
        gene_query, variable_values={"genome_id": genome_id, "symbol": symbol}
    )
    return result


def resolve_sequence(checksum, refget_url=ensembl_refget, start=None, end=None):
    url = f"{refget_url}/sequence/{checksum}"
    params = {}
    if start:
        params["start"] = start
    if end:
        params["end"] = end
    r = requests.get(url, headers={"Accept": "text/plain"}, params=params)
    r.raise_for_status()
    return r.text


def refget_client_resolve_sequence(checksum, refget_url=ensembl_refget, start=None, end=None):
    rgc = refget.RefGetClient(f"{refget_url}/sequence/")
    sequence = rgc.refget(checksum, start=start, end=end)
    return sequence


## Main method

Our main method will get the `genome_id`, then the associcated transcripts, loop through to find the canonical transcript and print a summary of the sequence.

In [4]:
async def main():
    # Select your transport with a defined url endpoint
    transport = AIOHTTPTransport(url=ensembl_graphql)

    # Create a GraphQL client using the defined transport. We have defined an async session
    async with Client(transport=transport, fetch_schema_from_transport=True) as session:
        genome_id = await get_genome_id(session, gca)
        genes = await get_transcripts(session, symbol=gene_symbol, genome_id=genome_id)

        for gene in genes["genes"]:
            for transcript in gene["transcripts"]:
                if transcript["metadata"]["canonical"]:
                    for pgc in transcript["product_generating_contexts"]:
                        for type in ("cdna", "cds", "product"):
                            checksum = pgc[type]["sequence"]["checksum"]
                            if use_requests_refget:
                                seq = resolve_sequence(checksum)
                            else:
                                seq = refget_client_resolve_sequence(checksum)
                            stable_id = transcript["stable_id"]
                            if "stable_id" in pgc[type]:
                                stable_id = pgc[type]["stable_id"]
                            print(
                                f"{stable_id}\t{type}\tlength:{len(seq)}\tretrieved_checksum:{checksum}"
                            )


In [5]:
# Main code block which runs the main method above
# Note the use of await which has our code await for main() 
# to finish executing
await main()

ENST00000254958.10	cdna	length:5940	retrieved_checksum:d8d95cac5218b333fba43ed6ebe4b017
ENST00000254958.10	cds	length:3657	retrieved_checksum:27ec2b034fb386c69f48653d2f18daf2
ENSP00000254958.4	product	length:1218	retrieved_checksum:84706f0ee2a4d23ae4050036e62dac5d
