# Analysis of Uniprot variant review release

**Test protein:** 
* Uniprot id: [Q9BYX4](https://www.uniprot.org/uniprot/Q9BYX4)
* Ensembl: [IFIH1 (ENSG00000115267)](https://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000115267;r=2:162267074-162318684)

**Conclusions:**

1. Consequences for only one transcript are captured. Although I might get it wrong: on Ensembl this gene has 7 transcripts, out of which 4 are protein coding. When looking for this gene on Uniprot, I see 3 entries ([Q9BYX4](https://www.uniprot.org/uniprot/Q9BYX4), [A0A3B3IRK8](https://www.uniprot.org/uniprot/A0A3B3IRK8), [A0A7P0Z4A9](https://www.uniprot.org/uniprot/A0A7P0Z4A9)) for each a different transcript is associated, except Q9BYX4, which has two transcripts associated: ENST00000421365 and ENST00000649979. So is it a reasonable assumption to expect consequences calculated for these two transcripts?
3. For certain multiallelic variants (eg. `rs778858888`) the consequences are not fully captured. However others (eg. `rs553228814`) are properly represented. Is it due to the fact synonymous variants are not included? (consequences in the files are: `missense`, `frame shift`, `stop gained`) Is there a consequence based filter applied?
4. The key `genomicLocation` is not a genomic location per se, it's HGV variant identifier (which by the way has genomic location), but I feel it would make sense to have both this identifier and the genomic location like Ensembl does: 

insetead:
```
    <genomicLocation>NC_000002.12:g.162310899C&gt;G</genomicLocation>
```
rather:
```
    <genomicLocation seq_region_name="2" start="162310899">
```

In [16]:
import sys
import requests
from itertools import chain
import pandas as pd

import xml.etree.ElementTree as ET


In [17]:
uniprot_file = 'Q9BYX4_mod.xml'

tree = ET.parse(uniprot_file)
assert isinstance(tree, ET.ElementTree)

root = tree.getroot()
assert isinstance(root, ET.Element)


In [147]:
variants = []

# General annotation:
uniprot_id = root.find('accession').text
hgnc_symbol = root.find('geneName').text

for variant in root.findall('feature'):

    db_refs = [{'id': x.get('id'), 'source': x.get('type')} for x in variant.findall('dbReference')]
    location = variant.find('location').find('position').get('position') if variant.find('location').find('position') is not None else None

    # Loop through all variant:
    for v in variant.findall('variant'):
        var_data = {
            'location': location,
            'uniprotId': uniprot_id,
            'geneSymbol': hgnc_symbol,
            'db_refs': db_refs,

            # Polyphen annotation:
            'polyPhen_type': v.find("variantPrediction[@predAlgorithmNameType='PolyPhen']").get('predictionValType') if v.find("variantPrediction[@predAlgorithmNameType='PolyPhen']") is not None else None,
            'polyPhen_score': v.find("variantPrediction[@predAlgorithmNameType='PolyPhen']").get('score') if v.find("variantPrediction[@predAlgorithmNameType='PolyPhen']") is not None else None,

            # SIFT annotation:
            'sift_type': v.find("variantPrediction[@predAlgorithmNameType='SIFT']").get('predictionValType') if v.find("variantPrediction[@predAlgorithmNameType='PolyPhen']") is not None else None,
            'sift_score': v.find("variantPrediction[@predAlgorithmNameType='SIFT']").get('score') if v.find("variantPrediction[@predAlgorithmNameType='PolyPhen']") is not None else None,

            # Genomic location:
            'HGVS': v.find('genomicLocation').text,
            'cytogenicBand': v.find('cytogeneticBand').text,
            'variantLocation': [{'id': x.get('seqId'), 'loc': x.get('loc')} for x in v.findall('variantLocation')],
            
            # Effects:
            'codon': v.find('codon').text if v.find('codon')  is not None else None,
            'consequence': v.find('consequenceType').text if v.find('consequenceType')  is not None else None,
            
            # Phenotype annotation:
            'somaticStatus': v.find('somaticStatus').text if v.find('somaticStatus')  is not None else None,
        }
        
        
        variants.append(var_data)

xf = pd.DataFrame(variants)
xf.loc[xf.codon.notna()]

Unnamed: 0,location,uniprotId,geneSymbol,db_refs,polyPhen_type,polyPhen_score,sift_type,sift_score,HGVS,cytogenicBand,variantLocation,codon,consequence,somaticStatus
0,3,Q9BYX4,IFIH1,"[{'id': 'rs778858888', 'source': 'ExAC'}]",benign,0.015,tolerated - low confidence,0.22,NC_000002.12:g.162318299A>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Asn3Lys'}...",AAT/AAA,missense,false
1,4,Q9BYX4,IFIH1,"[{'id': 'RCV000729683', 'source': 'ClinVar'}, ...",benign,0.001,tolerated - low confidence,0.4,NC_000002.12:g.162318297C>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Gly4Glu'}...",GGG/GAG,missense,false
2,5,Q9BYX4,IFIH1,"[{'id': 'rs1485039163', 'source': 'TOPMed'}]",benign,0.0,tolerated - low confidence,0.18,NC_000002.12:g.162318294T>C,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Tyr5Cys'}...",TAT/TGT,missense,false
3,6,Q9BYX4,IFIH1,"[{'id': 'rs1576243763', 'source': 'Ensembl'}]",probably damaging,0.936,deleterious - low confidence,0.0,NC_000002.12:g.162318291G>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Ser6Tyr'}...",TCC/TAC,missense,false
4,9,Q9BYX4,IFIH1,"[{'id': 'rs375179358', 'source': 'ESP'}, {'id'...",benign,0.0,tolerated,0.56,NC_000002.12:g.162318283C>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Glu9Lys'}...",GAG/AAG,missense,false
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
944,,Q9BYX4,IFIH1,"[{'id': 'rs777469992', 'source': 'ExAC'}, {'id...",benign,0.0,tolerated,1.0,NC_000002.12:g.162267225C>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Cys1018Ty...",TGC/TAC,missense,false
945,,Q9BYX4,IFIH1,"[{'id': 'rs755779570', 'source': 'ExAC'}, {'id...",benign,0.228,tolerated - low confidence,0.7,NC_000002.12:g.162267218T>G,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Leu1020Ph...",TTA/TTC,missense,false
946,,Q9BYX4,IFIH1,"[{'id': 'rs1345435216', 'source': 'TOPMed'}]",benign,0.118,tolerated - low confidence,0.41,NC_000002.12:g.162267219A>G,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Leu1020Se...",TTA/TCA,missense,false
947,,Q9BYX4,IFIH1,"[{'id': 'rs1385374454', 'source': 'gnomAD'}]",probably damaging,0.999,deleterious - low confidence,0.0,NC_000002.12:g.162267212A>C,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Ser1022Ar...",AGT/AGG,missense,false


## Checking overlapping transcript consequences

In this test we look for a given variant, if we have consequence terms calculated for one or more overlapping transcripts.

In [126]:
variant_id = 'rs778858888'

# Checking parsed xml data:
print(f'Number of variant entries for {variant_id}: {len(xf.loc[xf.rsId == variant_id])}')
print(xf.loc[xf.rsId == variant_id,['rsId', 'transcript', 'consequence', 'codon']])

# Checking consequences in ENSEMBL REST API:
url = f'https://rest.ensembl.org/vep/human/id/{variant_id}?content-type=application/json'
consequences = requests.get(url)
consequences_data = consequences.json()
consequences_df = (
    pd.DataFrame(consequences_data[0]['transcript_consequences'])
    .loc[lambda df: df.gene_symbol == hgnc_symbol, ['consequence_terms', 'gene_id', 'transcript_id', 'codons']]
    .assign(alleles = consequences_data[0]['allele_string'])
)

consequences_df

Number of variant entries for rs778858888: 1
          rsId         transcript consequence    codon
0  rs778858888  [ENST00000649979]    missense  AAT/AAA


Unnamed: 0,consequence_terms,gene_id,transcript_id,codons,alleles
0,[synonymous_variant],ENSG00000115267,ENST00000421365,aaT/aaC,A/G/T
1,[missense_variant],ENSG00000115267,ENST00000421365,aaT/aaA,A/G/T
8,[synonymous_variant],ENSG00000115267,ENST00000648433,aaT/aaC,A/G/T
9,[missense_variant],ENSG00000115267,ENST00000648433,aaT/aaA,A/G/T
10,[synonymous_variant],ENSG00000115267,ENST00000649979,aaT/aaC,A/G/T
11,[missense_variant],ENSG00000115267,ENST00000649979,aaT/aaA,A/G/T
12,[upstream_gene_variant],ENSG00000115267,ENST00000679938,,A/G/T
13,[upstream_gene_variant],ENSG00000115267,ENST00000679938,,A/G/T


So, apparently not all transcripts are covered for `rs778858888`:
* In the xml file: `ENST00000649979`
* From Ensembl REST API: `ENST00000421365`, `ENST00000648433`, `ENST00000649979`, `ENST00000679938`

Moreover, `rs778858888` is a multiallelic variant, and only the missense (`AAT/AAA`) is captured, while the synonymous (`aaT/aaC`) is not. How about the other variant with multiallelic nature eg. `rs553228814`:

In [127]:
print(xf.loc[xf.rsId == 'rs553228814',['rsId', 'transcript', 'consequence', 'codon', 'HGVS']])


           rsId         transcript consequence    codon  \
17  rs553228814  [ENST00000649979]    missense  AGG/AGT   
18  rs553228814  [ENST00000649979]    missense  AGG/AGC   

                           HGVS  
17  NC_000002.12:g.162318245C>A  
18  NC_000002.12:g.162318245C>G  


In [137]:
multiallelics = xf.rsId.value_counts().loc[lambda x: x > 2].index
xf.loc[xf.rsId.isin(multiallelics), ['rsId', 'transcript', 'consequence', 'codon', 'HGVS']]

Unnamed: 0,rsId,transcript,consequence,codon,HGVS
153,rs1178412309,[ENST00000649979],missense,GGT/GCT,NC_000002.12:g.162310899C>G
154,rs1178412309,[ENST00000649979],missense,GGT/GAT,NC_000002.12:g.162310899C>T
156,rs1178412309,[ENST00000649979],missense,GGT/GTT,NC_000002.12:g.162310899C>A


In [150]:
xf.loc[xf.location.notna()]

Unnamed: 0,location,uniprotId,geneSymbol,db_refs,polyPhen_type,polyPhen_score,sift_type,sift_score,HGVS,cytogenicBand,variantLocation,codon,consequence,somaticStatus
0,3,Q9BYX4,IFIH1,"[{'id': 'rs778858888', 'source': 'ExAC'}]",benign,0.015,tolerated - low confidence,0.22,NC_000002.12:g.162318299A>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Asn3Lys'}...",AAT/AAA,missense,false
1,4,Q9BYX4,IFIH1,"[{'id': 'RCV000729683', 'source': 'ClinVar'}, ...",benign,0.001,tolerated - low confidence,0.4,NC_000002.12:g.162318297C>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Gly4Glu'}...",GGG/GAG,missense,false
2,5,Q9BYX4,IFIH1,"[{'id': 'rs1485039163', 'source': 'TOPMed'}]",benign,0.0,tolerated - low confidence,0.18,NC_000002.12:g.162318294T>C,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Tyr5Cys'}...",TAT/TGT,missense,false
3,6,Q9BYX4,IFIH1,"[{'id': 'rs1576243763', 'source': 'Ensembl'}]",probably damaging,0.936,deleterious - low confidence,0.0,NC_000002.12:g.162318291G>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Ser6Tyr'}...",TCC/TAC,missense,false
4,9,Q9BYX4,IFIH1,"[{'id': 'rs375179358', 'source': 'ESP'}, {'id'...",benign,0.0,tolerated,0.56,NC_000002.12:g.162318283C>T,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Glu9Lys'}...",GAG/AAG,missense,false
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536,589,Q9BYX4,IFIH1,"[{'id': 'RCV000652114', 'source': 'ClinVar'}, ...",,,,,NC_000002.12:g.162278213dup,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Ala589fs'}]",,frameshift,false
548,596,Q9BYX4,IFIH1,"[{'id': 'RCV000768232', 'source': 'ClinVar'}, ...",,,,,NC_000002.12:g.162277662_162277672del,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Lys596fs'}]",,frameshift,false
630,673,Q9BYX4,IFIH1,"[{'id': 'RCV000704039', 'source': 'ClinVar'}, ...",,,,,NC_000002.12:g.162277443del,2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Asp673fs'}]",,frameshift,false
632,674,Q9BYX4,IFIH1,"[{'id': 'RCV000419118', 'source': 'ClinVar'}, ...",,,,,NC_000002.12:g.162277436_162277439ATCT[1],2q24.2,"[{'id': 'ENST00000649979', 'loc': 'p.Arg674fs'}]",,frameshift,false


In [68]:
assoc.count()

454045

```xml
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="entryFeature">
    <xs:complexType>
      <xs:sequence>
        <xs:element type="xs:string" name="name"/>
        <xs:element type="xs:string" name="accession"/>
        <xs:element type="xs:string" name="proteinName"/>
        <xs:element type="xs:string" name="geneName"/>
        <xs:element type="xs:byte" name="proteinExistence"/>
        <xs:element name="sequence">
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base="xs:string">
                <xs:attribute type="xs:long" name="checksum"/>
                <xs:attribute type="xs:byte" name="version"/>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
        <xs:element type="xs:short" name="taxid"/>
        <xs:element type="xs:string" name="organismName"/>
        <xs:element name="feature" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="location">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="position" minOccurs="0">
                      <xs:complexType>
                        <xs:simpleContent>
                          <xs:extension base="xs:string">
                            <xs:attribute type="xs:short" name="position" use="optional"/>
                          </xs:extension>
                        </xs:simpleContent>
                      </xs:complexType>
                    </xs:element>
                    <xs:element name="begin" minOccurs="0">
                      <xs:complexType>
                        <xs:simpleContent>
                          <xs:extension base="xs:string">
                            <xs:attribute type="xs:short" name="position" use="optional"/>
                          </xs:extension>
                        </xs:simpleContent>
                      </xs:complexType>
                    </xs:element>
                    <xs:element name="end" minOccurs="0">
                      <xs:complexType>
                        <xs:simpleContent>
                          <xs:extension base="xs:string">
                            <xs:attribute type="xs:short" name="position" use="optional"/>
                          </xs:extension>
                        </xs:simpleContent>
                      </xs:complexType>
                    </xs:element>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
              <xs:element name="evidence" maxOccurs="unbounded" minOccurs="0">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="dbReference">
                      <xs:complexType>
                        <xs:simpleContent>
                          <xs:extension base="xs:string">
                            <xs:attribute type="xs:int" name="id" use="optional"/>
                            <xs:attribute type="xs:string" name="type" use="optional"/>
                          </xs:extension>
                        </xs:simpleContent>
                      </xs:complexType>
                    </xs:element>
                  </xs:sequence>
                  <xs:attribute type="xs:string" name="code" use="optional"/>
                </xs:complexType>
              </xs:element>
              <xs:element name="dbReference" maxOccurs="unbounded" minOccurs="0">
                <xs:complexType>
                  <xs:simpleContent>
                    <xs:extension base="xs:string">
                      <xs:attribute type="xs:string" name="id" use="optional"/>
                      <xs:attribute type="xs:string" name="type" use="optional"/>
                    </xs:extension>
                  </xs:simpleContent>
                </xs:complexType>
              </xs:element>
              <xs:element name="variant">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element type="xs:string" name="cytogeneticBand"/>
                    <xs:element type="xs:string" name="genomicLocation"/>
                    <xs:element name="variantLocation" maxOccurs="unbounded" minOccurs="0">
                      <xs:complexType>
                        <xs:simpleContent>
                          <xs:extension base="xs:string">
                            <xs:attribute type="xs:string" name="loc" use="optional"/>
                            <xs:attribute type="xs:string" name="seqId" use="optional"/>
                            <xs:attribute type="xs:string" name="source" use="optional"/>
                          </xs:extension>
                        </xs:simpleContent>
                      </xs:complexType>
                    </xs:element>
                    <xs:element type="xs:string" name="codon" minOccurs="0"/>
                    <xs:element type="xs:string" name="consequenceType"/>
                    <xs:element type="xs:string" name="wildType"/>
                    <xs:element type="xs:string" name="mutatedType"/>
                    <xs:element name="populationFrequencies" minOccurs="0">
                      <xs:complexType>
                        <xs:simpleContent>
                          <xs:extension base="xs:string">
                            <xs:attribute type="xs:string" name="populationName" use="optional"/>
                            <xs:attribute type="xs:float" name="frequency" use="optional"/>
                            <xs:attribute type="xs:string" name="source" use="optional"/>
                          </xs:extension>
                        </xs:simpleContent>
                      </xs:complexType>
                    </xs:element>
                    <xs:element name="variantPrediction" maxOccurs="unbounded" minOccurs="0">
                      <xs:complexType>
                        <xs:sequence>
                          <xs:element type="xs:string" name="source"/>
                        </xs:sequence>
                        <xs:attribute type="xs:string" name="predictionValType" use="optional"/>
                        <xs:attribute type="xs:string" name="predictorType" use="optional"/>
                        <xs:attribute type="xs:float" name="score" use="optional"/>
                        <xs:attribute type="xs:string" name="predAlgorithmNameType" use="optional"/>
                      </xs:complexType>
                    </xs:element>
                    <xs:element type="xs:string" name="somaticStatus"/>
                    <xs:element name="variantClinicalSignificance" maxOccurs="unbounded" minOccurs="0">
                      <xs:complexType>
                        <xs:sequence>
                          <xs:element type="xs:string" name="source" maxOccurs="unbounded" minOccurs="0"/>
                        </xs:sequence>
                        <xs:attribute type="xs:string" name="type" use="optional"/>
                      </xs:complexType>
                    </xs:element>
                    <xs:element name="association" maxOccurs="unbounded" minOccurs="0">
                      <xs:complexType>
                        <xs:sequence>
                          <xs:element type="xs:string" name="name"/>
                          <xs:element type="xs:string" name="description"/>
                          <xs:element name="evidence" maxOccurs="unbounded" minOccurs="0">
                            <xs:complexType>
                              <xs:sequence>
                                <xs:element name="dbReference">
                                  <xs:complexType>
                                    <xs:simpleContent>
                                      <xs:extension base="xs:string">
                                        <xs:attribute type="xs:string" name="id" use="optional"/>
                                        <xs:attribute type="xs:string" name="type" use="optional"/>
                                      </xs:extension>
                                    </xs:simpleContent>
                                  </xs:complexType>
                                </xs:element>
                              </xs:sequence>
                              <xs:attribute type="xs:string" name="code" use="optional"/>
                            </xs:complexType>
                          </xs:element>
                          <xs:element name="dbReference">
                            <xs:complexType>
                              <xs:simpleContent>
                                <xs:extension base="xs:string">
                                  <xs:attribute type="xs:int" name="id" use="optional"/>
                                  <xs:attribute type="xs:string" name="type" use="optional"/>
                                </xs:extension>
                              </xs:simpleContent>
                            </xs:complexType>
                          </xs:element>
                        </xs:sequence>
                        <xs:attribute type="xs:string" name="isDisease" use="optional"/>
                      </xs:complexType>
                    </xs:element>
                    <xs:element name="description" maxOccurs="unbounded" minOccurs="0">
                      <xs:complexType>
                        <xs:sequence>
                          <xs:element type="xs:string" name="value"/>
                          <xs:element type="xs:string" name="source"/>
                        </xs:sequence>
                      </xs:complexType>
                    </xs:element>
                  </xs:sequence>
                  <xs:attribute type="xs:string" name="sourceType" use="optional"/>
                </xs:complexType>
              </xs:element>
            </xs:sequence>
            <xs:attribute type="xs:string" name="type" use="optional"/>
            <xs:attribute type="xs:string" name="id" use="optional"/>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
```