# Analyze SARS-CoV-2 Spike Glycoprotein Variants
[Work in progress]

This notebook queries the Knowledge Graph for variants in the SARS-CoV S gene and analyzes mutations that may affect the ACE2 binding to the Spike Glycoprotein as well we mutations at the polybasic cleavage site.

In [1]:
import pandas as pd
from py2neo import Graph

In [2]:
pd.options.display.max_rows = None  # display all rows
pd.options.display.max_columns = None  # display all columsns

In [3]:
graph = Graph("bolt://132.249.238.185:7687", user="reader", password="demo")

## Analysis of missense mutations in the SARS-CoV-2 Spike glycoprotein

![Spike protein](../../docs/Spikeprotein.png)

**a.** ACE2 receptor binding domain of the SARS-CoV-2 spike protein. **b.** Polybasic cleavage site with three predicted O-glycosylation sites.

Reference: Andersen, K.G., Rambaut, A., Lipkin, W.I. et al. The proximal origin of SARS-CoV-2. Nat Med 26, 450–452 (2020). [doi:10.1038/s41591-020-0820-9](https://doi.org/10.1038/s41591-020-0820-9)

### Query KG for the S gene and its gene product

In [4]:
query = """
MATCH (g:Gene{name:'S'})-[:ENCODES]->(p:Protein) 
RETURN g.name AS geneName, p.accession AS proteinAccession, p.sequence AS proteinSequence
"""
df = graph.run(query).to_data_frame()

In [5]:
df.head()

Unnamed: 0,geneName,proteinAccession,proteinSequence
0,S,uniprot:P0DTC2,MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSS...


In [6]:
seq = df['proteinSequence'].iloc[0]

In [7]:
print("Sequence of the Spike glycoprotein:")
print(seq)

Sequence of the Spike glycoprotein:
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVK

### Find strains with mutations in the Spike glycoprotein receptor-binding domain (RBD)

Six amino acids in the RBD of the spike protein have been shown to be critical for binding to ACE2 receptors: 

L455, F486, Q493, S494, N501, Y505

Python uses zero-based indices, so we subtract 1 to find the position in the sequence.

In [8]:
print(seq[455-1], seq[486-1], seq[493-1], seq[494-1], seq[501-1], seq[505-1])

L F Q S N Y


#### The following variants are missense mutations in the RBD

In [9]:
query = """
MATCH (g:Gene{name:'S'})-[:HAS_VARIANT]->(v:Variant{variantConsequence:'missense_variant'})
<-[:HAS_VARIANT]-(s:Strain)-[:FOUND_IN]->(l:Location)
WHERE v.proteinPosition IN [455, 486, 493, 494, 501, 505] AND s.hostTaxonomyId = 'taxonomy:9606'
RETURN DISTINCT v.geneVariant as geneVariant, v.proteinVariant AS proteinVariant, 
s.name AS strainName, s.collectionDate AS collectionDate, l.name AS location, 
labels(l) AS locationType, s.id AS strainId
"""
graph.run(query).to_data_frame()

Unnamed: 0,geneVariant,proteinVariant,strainName,collectionDate,location,locationType,strainId
0,S:c.1480Tca>Cca,QHD43416.1:p.494S>P,hCoV-19/Sweden/20-07044/2020,2020-04-05,Uppsala,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_445230
1,S:c.1478cAa>cTa,QHD43416.1:p.493Q>L,hCoV-19/USA/WI-GMF-M00004/2020,2020-05-24,Vernon County,"[Location, Admin2]",https://www.gisaid.org/EPI_ISL_455581
2,S:c.1478cAa>cTa,QHD43416.1:p.493Q>L,hCoV-19/USA/WI-GMF-04314/2020,2020-05-25,Vernon County,"[Location, Admin2]",https://www.gisaid.org/EPI_ISL_455578
3,S:c.1501Aat>Tat,QHD43416.1:p.501N>Y,hCoV-19/USA/NY-NYUMC836/2020,2020-04-21,Queens,"[Location, City]",https://www.gisaid.org/EPI_ISL_456109


#### Find variants around these critical residues within a 5 amino acid window (+/- 2 residues)

In [10]:
query = """
MATCH (g:Gene{name:'S'})-[:HAS_VARIANT]->(v:Variant{variantConsequence:'missense_variant'})
<-[:HAS_VARIANT]-(s:Strain)-[:FOUND_IN]->(l:Location)
WHERE v.proteinPosition IN [453,454,455,456,457, 484,485,486,487,488, 491,492,493,494,495,496,
                            499,500,501,502,503,504,505,506,507] 
      AND s.hostTaxonomyId = 'taxonomy:9606'
RETURN DISTINCT v.geneVariant as geneVariant, v.proteinVariant AS proteinVariant, 
s.name AS strainName, s.collectionDate AS collectionDate, s.sex AS sex, s.age as age, 
l.name AS location, labels(l) AS locationType, s.id AS strainId
"""
df = graph.run(query).to_data_frame()
df.fillna('', inplace=True)
df

Unnamed: 0,geneVariant,proteinVariant,strainName,collectionDate,sex,age,location,locationType,strainId
0,S:c.1480Tca>Cca,QHD43416.1:p.494S>P,hCoV-19/Sweden/20-07044/2020,2020-04-05,male,66.0,Uppsala,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_445230
1,S:c.1478cAa>cTa,QHD43416.1:p.493Q>L,hCoV-19/USA/WI-GMF-M00004/2020,2020-05-24,,,Vernon County,"[Location, Admin2]",https://www.gisaid.org/EPI_ISL_455581
2,S:c.1478cAa>cTa,QHD43416.1:p.493Q>L,hCoV-19/USA/WI-GMF-04314/2020,2020-05-25,,,Vernon County,"[Location, Admin2]",https://www.gisaid.org/EPI_ISL_455578
3,S:c.1450Gaa>Aaa,QHD43416.1:p.484E>K,hCoV-19/England/NOTT-1115C0/2020,2020-05-18,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_461895
4,S:c.1507Gtt>Ttt,QHD43416.1:p.503V>F,hCoV-19/USA/NY-NYUMC287/2020,2020-04-05,,,Brooklyn,"[Location, City]",https://www.gisaid.org/EPI_ISL_428801
5,S:c.1453Ggt>Agt,QHD43416.1:p.485G>S,hCoV-19/England/LOND-D4A25/2020,2020-04-10,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_444170
6,S:c.1501Aat>Tat,QHD43416.1:p.501N>Y,hCoV-19/USA/NY-NYUMC836/2020,2020-04-21,,,Queens,"[Location, City]",https://www.gisaid.org/EPI_ISL_456109
7,S:c.1452gaA>gaC,QHD43416.1:p.484E>D,hCoV-19/Thailand/Trang_5008/2020,2020-03-23,female,21.0,Trang,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_455588
8,S:c.1452gaA>gaC,QHD43416.1:p.484E>D,hCoV-19/Thailand/Trang_5008/2020,2020-03-23,female,21.0,Trang,"[Location, City]",https://www.gisaid.org/EPI_ISL_455588


### Find strains with mutations in the polybasic cleavage site
This site has three predicted O-glycosylation sites:
    
S673, T678, S686

In [11]:
print(seq[673-1], seq[678-1], seq[686-1])

S T S


In [12]:
query = """
MATCH (g:Gene{name:'S'})-[:HAS_VARIANT]->(v:Variant{variantConsequence:'missense_variant'})
<-[:HAS_VARIANT]-(s:Strain)-[:FOUND_IN]->(l:Location)
WHERE v.proteinPosition IN [673, 678, 686] 
      AND s.hostTaxonomyId = 'taxonomy:9606'
RETURN DISTINCT v.geneVariant as geneVariant, v.proteinVariant AS proteinVariant, 
s.name AS strainName, s.collectionDate AS collectionDate, l.name AS location, 
labels(l) AS locationType, s.id AS strainId
"""
graph.run(query).to_data_frame()

Unnamed: 0,geneVariant,proteinVariant,strainName,collectionDate,location,locationType,strainId
0,S:c.2033aCt>aTt,QHD43416.1:p.678T>I,hCoV-19/USA/CA-SR0181/2020,2020-03-24,San Diego,"[Location, City]",https://www.gisaid.org/EPI_ISL_437596


#### Find variants around these residues within a 5 amino acid window (+/- 2 residues)

In [13]:
query = """
MATCH (g:Gene{name:'S'})-[:HAS_VARIANT]->(v:Variant{variantConsequence:'missense_variant'})
<-[:HAS_VARIANT]-(s:Strain)-[:FOUND_IN]->(l:Location)
WHERE v.proteinPosition IN [671,672,673,674,675, 676,677,678,679,670, 684,685,686,687,688] 
      AND s.hostTaxonomyId = 'taxonomy:9606'
RETURN DISTINCT v.geneVariant as GeneVariant, v.proteinVariant AS proteinVariant, 
s.name AS strainName, s.collectionDate AS collectionDate, s.sex AS sex, s.age as age, l.name AS location, 
labels(l) AS locationType, s.id AS strainId
ORDER BY v.proteinVariant, s.collectionDate
"""
df = graph.run(query).to_data_frame()
df.fillna('', inplace=True)
df

Unnamed: 0,GeneVariant,proteinVariant,strainName,collectionDate,sex,age,location,locationType,strainId
0,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-E8DF8/2020,2020-04-26,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_453558
1,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EB28D/2020,2020-05-03,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457413
2,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EB33F/2020,2020-05-03,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457419
3,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EBD40/2020,2020-05-06,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457504
4,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EBC07/2020,2020-05-06,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457490
5,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EC626/2020,2020-05-07,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457542
6,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EBC34/2020,2020-05-08,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457492
7,S:c.2008Ata>Tta,QHD43416.1:p.670I>L,hCoV-19/England/NORW-EA1AF/2020,2020-05-09,,,England,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_457332
8,S:c.2010atA>atG,QHD43416.1:p.670I>M,hCoV-19/Scotland/CVR2973/2020,2020-04-22,,,Scotland,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_448213
9,S:c.2010atA>atG,QHD43416.1:p.670I>M,hCoV-19/Scotland/CVR3171/2020,2020-04-27,,,Scotland,"[Location, Admin1]",https://www.gisaid.org/EPI_ISL_449201


### Find strains with mutations in the SARS-CoV-2 "S" gene at different locations

In [14]:
query = """
MATCH (g:Gene{name:'S'})-[:HAS_VARIANT]->(v:Variant{variantConsequence:'missense_variant'})
<-[:HAS_VARIANT]-(s:Strain)-[:FOUND_IN]->(l:Location)-[:IN*]->(r:USRegion{name:$region}) 
RETURN v.name AS geneVariant, l.name AS location, r.name AS region
ORDER BY v.name, l.name
"""

#### Variants in the US West Region

In [15]:
region = 'West Region'
west = graph.run(query, region=region).to_data_frame()
west.head(10)

Unnamed: 0,geneVariant,location,region
0,S:c.1042Gca>Aca,Washington,West Region
1,S:c.1042Gca>Aca,Washington,West Region
2,S:c.1151cCt>cTt,King County,West Region
3,S:c.1151cCt>cTt,San Diego,West Region
4,S:c.1153Act>Gct,Washington County,West Region
5,S:c.1183Gtc>Atc,San Diego,West Region
6,S:c.1214gAt>gTt,Arizona,West Region
7,S:c.1240Caa>Gaa,Arizona,West Region
8,S:c.1240Caa>Gaa,Arizona,West Region
9,S:c.1240Caa>Gaa,Arizona,West Region


#### Variants in the US Northeast Region

In [16]:
region = 'Northeast Region'
northeast = graph.run(query, region=region).to_data_frame()
northeast.head(10)

Unnamed: 0,geneVariant,location,region
0,S:c.1076aGc>aAc,Queens,Northeast Region
1,S:c.1150Cct>Tct,Queens,Northeast Region
2,S:c.1150Cct>Tct,Queens,Northeast Region
3,S:c.13Ctt>Ttt,Brooklyn,Northeast Region
4,S:c.13Ctt>Ttt,Brooklyn,Northeast Region
5,S:c.13Ctt>Ttt,Brooklyn,Northeast Region
6,S:c.13Ctt>Ttt,Hudson County,Northeast Region
7,S:c.13Ctt>Ttt,Manhattan,Northeast Region
8,S:c.13Ctt>Ttt,Massachusetts,Northeast Region
9,S:c.13Ctt>Ttt,Massachusetts,Northeast Region


#### Find variants in common

In [17]:
in_common = pd.merge(west, northeast, on='geneVariant')
in_common[['geneVariant']].drop_duplicates().head(25)

Unnamed: 0,geneVariant
0,S:c.13Ctt>Ttt
340,S:c.1558Gca>Tca
348,S:c.162ttG>ttT
350,S:c.1841gAt>gGt
1345444,S:c.205Cat>Tat
1345446,S:c.2576aCt>aTt
1345450,S:c.3301Cac>Tac
1345464,S:c.3352Gac>Tac
1345465,S:c.3788cCa>cTa
1345471,S:c.433Tac>Cac


#### Find unique variants in the West region

In [18]:
in_west_only = pd.merge(west, northeast, on='geneVariant', how='left')
in_west_only.fillna('', inplace=True)
in_west_only.query("location_y == ''", inplace=True)
in_west_only[['geneVariant','location_x', 'region_x']].drop_duplicates().head(25)

Unnamed: 0,geneVariant,location_x,region_x
0,S:c.1042Gca>Aca,Washington,West Region
2,S:c.1151cCt>cTt,King County,West Region
3,S:c.1151cCt>cTt,San Diego,West Region
4,S:c.1153Act>Gct,Washington County,West Region
5,S:c.1183Gtc>Atc,San Diego,West Region
6,S:c.1214gAt>gTt,Arizona,West Region
7,S:c.1240Caa>Gaa,Arizona,West Region
13,S:c.1241cAa>cCa,Arizona,West Region
354,S:c.1424gCc>gTc,Arizona,West Region
356,S:c.1426Ggt>Agt,Idaho,West Region


#### Find unique variants in the Northeast region

In [19]:
in_northeast_only = pd.merge(northeast, west, on='geneVariant', how='left')
in_northeast_only.fillna('', inplace=True)
in_northeast_only.query("location_y == ''", inplace=True)
in_northeast_only[['geneVariant','location_x', 'region_x']].drop_duplicates().head(25)

Unnamed: 0,geneVariant,location_x,region_x
0,S:c.1076aGc>aAc,Queens,Northeast Region
1,S:c.1150Cct>Tct,Queens,Northeast Region
343,S:c.1501Aat>Tat,Queens,Northeast Region
344,S:c.1507Gtt>Ttt,Brooklyn,Northeast Region
353,S:c.1586aAg>aGg,Pittsburgh,Northeast Region
356,S:c.1662gaG>gaT,Manhattan,Northeast Region
357,S:c.1708Gct>Act,Queens,Northeast Region
358,S:c.1709gCt>gTt,Massachusetts,Northeast Region
359,S:c.1749gaG>gaT,Manhattan,Northeast Region
1345455,S:c.1853aCa>aTa,Manhattan,Northeast Region
