# TBX1 Cohort Analysis Report

TBX1 or T box 1 is a transcription factor crucial for the prenatal establishment of the thymus [Lindsay et al. (2001)](https://pubmed.ncbi.nlm.nih.gov/11242049/). In humans, monoallelic loss-of-function mutations in TBX1 have been associated with the development of velo-cardio-facial, or DiGeorge, syndrome [Merscher et al. (2001)](https://pubmed.ncbi.nlm.nih.gov/11239417/). Interestingly, missense mutations leading to increased TBX1 activity also result in phenotypically similar syndrome [Zweier et al. (2007)](https://pubmed.ncbi.nlm.nih.gov/17273972/) - characterized by velopharyngeal insufficiency, hypoplastic thymus, immunodeficiency, hypoparathyroidism, craniofacial dysmorphia, hearing impairment and cardiac defect [Papangeli et al. (2013)](https://pubmed.ncbi.nlm.nih.gov/23799583/). All mutations in TBX1 seem highly variably penetrant, with some patients presenting without signs of immunodeficiency, while others present with life-threatening congenital athymia requiring corrective treatment with a thymic transplant [Kreins et al. (2021)](https://pubmed.ncbi.nlm.nih.gov/33815417/).

In [1]:
import hpotk
import gpsea

store = hpotk.configure_ontology_store()
hpo = store.load_minimal_hpo(release='v2024-08-13')
print(f'Loaded HPO v{hpo.version}')
print(f"Using gpsea version {gpsea.__version__}")


Loaded HPO v2024-08-13
Using gpsea version 0.9.1


### TBX1
We used the [Matched Annotation from NCBI and EMBL-EBI (MANE)](https://www.ncbi.nlm.nih.gov/refseq/MANE/) transcript and the corresponding protein identifier for TBX1.

In [20]:
gene_symbol = 'TBX1'
mane_tx_id =  'NM_001379200.1' # the MANE select transcript.
mane_protein_id ='NP_001366129.1'

In [21]:
from ppktstore.registry import configure_phenopacket_registry
from gpsea.preprocessing import configure_caching_cohort_creator, load_phenopackets

phenopacket_store_release = '0.1.23' 
registry = configure_phenopacket_registry()

with registry.open_phenopacket_store(release=phenopacket_store_release) as ps:
    phenopackets = tuple(ps.iter_cohort_phenopackets(gene_symbol))

cohort_creator = configure_caching_cohort_creator(hpo)

cohort, qc = load_phenopackets(
    phenopackets=phenopackets, 
    cohort_creator=cohort_creator,
)

qc.summarize()

Individuals Processed: 100%|██████████| 26/26 [00:00<00:00, 794.06individuals/s]
Validated under permissive policy


In [22]:
from gpsea.view import CohortViewer

viewer = CohortViewer(hpo)
viewer.process(cohort=cohort, transcript_id=mane_tx_id)

n,HPO Term
20,Abnormal facial shape
14,Hypoparathyroidism
13,Low-set ears
13,Velopharyngeal insufficiency
12,Hypertelorism
11,Micrognathia
9,Blepharophimosis
8,Global developmental delay
5,Sensorineural hearing impairment
5,Narrow nose

n,Disease
26,DiGeorge syndrome

n,Variant key,HGVS,Variant Class
5,22_19766631_19766632_TA_T,c.1280del (p.Tyr427PhefsTer42),FRAMESHIFT_VARIANT
4,22_19766537_19766538_CG_T,c.1185_1186delinsT (p.Gly396AlafsTer73),FRAMESHIFT_VARIANT
3,22_19766003_19766003_G_C,c.1036+1G>C (None),SPLICE_DONOR_VARIANT
3,22_19766601_19766602_TC_T,c.1250del (p.Ser417TrpfsTer52),FRAMESHIFT_VARIANT
3,22_19766671_19766694_GCCGGCCGGCGCCCTACCCGCTGC_G,c.1326_1348del (p.Pro444TrpfsTer174),FRAMESHIFT_VARIANT
2,22_19764224_19764224_C_G,c.609C>G (p.His203Gln),MISSENSE_VARIANT
1,22_19766651_19766659_CCACTATCT_C,c.1301_1308del (p.His434ArgfsTer189),FRAMESHIFT_VARIANT
1,22_19765959_19765959_G_GAACCCCGTGGC,c.994_1004dup (p.Ser336ThrfsTer47),FRAMESHIFT_VARIANT
1,22_19765921_19765921_G_A,c.955G>A (p.Gly319Ser),MISSENSE_VARIANT
1,22_19766765_19766765_C_--31bp--,c.1426_1455dup (p.Ala476_Ala485dup),INFRAME_INSERTION

Variant effect,Count
FRAMESHIFT_VARIANT,17 (65%)
MISSENSE_VARIANT,4 (15%)
SPLICE_DONOR_VARIANT,3 (12%)
INFRAME_INSERTION,1 (4%)
INFRAME_DELETION,1 (4%)


In [24]:
# The corresponding UnitProt entry (O43435) has a Disordered Region	(23-72) and compositional bias at residues 37-72, and the length of the sequence is 398. 
# The API code that we use to generate the protein graphic is failing because no proteins found for ID NP_001366129.1 (the MANE select)

In [25]:
from gpsea.view import CohortVariantViewer
cvv = CohortVariantViewer(tx_id=mane_tx_id)
cvv.process(cohort=cohort)

Count,Variant key,HGVS,Overlapping Exons,Effects
5,22_19766631_19766632_TA_T,c.1280del (p.Tyr427PhefsTer42),7,frameshift
4,22_19766537_19766538_CG_T,c.1185_1186delinsT (p.Gly396AlafsTer73),7,frameshift
3,22_19766671_19766694_GCCGGCCGGCGCCCTACCCGCTGC_G,c.1326_1348del (p.Pro444TrpfsTer174),7,frameshift
3,22_19766003_19766003_G_C,c.1036+1G>C (-),-,splice donor
3,22_19766601_19766602_TC_T,c.1250del (p.Ser417TrpfsTer52),7,frameshift
2,22_19764224_19764224_C_G,c.609C>G (p.His203Gln),3,missense
1,22_19766765_19766765_C_--31bp--,c.1426_1455dup (p.Ala476_Ala485dup),7,inframe insertion
1,22_19760998_19761055_--58bp--_C,c.173_229del (p.Arg58_Pro76del),1,inframe deletion
1,22_19765921_19765921_G_A,c.955G>A (p.Gly319Ser),6,missense
1,22_19765959_19765959_G_GAACCCCGTGGC,c.994_1004dup (p.Ser336ThrfsTer47),6,frameshift


## Genotype-Phenotype Correlation (GPC) Analysis

In [27]:
from gpsea.analysis.pcats import configure_hpo_term_analysis
from gpsea.analysis.clf import prepare_classifiers_for_terms_of_interest

analysis = configure_hpo_term_analysis(hpo)

pheno_clfs = prepare_classifiers_for_terms_of_interest(
    cohort=cohort,
    hpo=hpo,
)

In [28]:
from gpsea.model import VariantEffect
from gpsea.view import MtcStatsViewer
from gpsea.analysis.clf import monoallelic_classifier
from gpsea.analysis.predicate import variant_effect

missense = variant_effect(VariantEffect.MISSENSE_VARIANT, mane_tx_id)

missense_clf = monoallelic_classifier(
    a_predicate=missense,
    b_predicate=~missense,
    a_label="Missense",
    b_label="Other"
)

missense_result = analysis.compare_genotype_vs_phenotypes(
    cohort=cohort,
    gt_clf=missense_clf,
    pheno_clfs=pheno_clfs,
)

viewer = MtcStatsViewer()
viewer.process(missense_result)


Code,Reason,Count
HMF01,Skipping term with maximum frequency that was less than threshold 0.4,13
HMF03,Skipping term because of a child term with the same individual counts,11
HMF05,Skipping term because one genotype had zero observations,1
HMF08,Skipping general term,42
HMF09,Skipping term with maximum annotation frequency that was less than threshold 0.4,41


In [29]:
from gpsea.view import summarize_hpo_analysis

summarize_hpo_analysis(hpo=hpo, result=missense_result)

Allele group,Missense,Missense,Other,Other,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,Count,Percent,Count,Percent,Corrected p values,p values
Blepharophimosis [HP:0000581],4/4,100%,5/14,36%,0.477124,0.082353
Hypoparathyroidism [HP:0000829],1/4,25%,13/17,76%,0.477124,0.087719
Micrognathia [HP:0000347],4/4,100%,7/14,50%,0.477124,0.119281
Low-set ears [HP:0000369],4/4,100%,9/14,64%,0.833333,0.277778
Sensorineural hearing impairment [HP:0000407],1/1,100%,4/13,31%,0.857143,0.357143
Abnormal facial shape [HP:0001999],4/4,100%,16/21,76%,1.0,0.549407
Hypertelorism [HP:0000316],2/2,100%,10/13,77%,1.0,1.0
Abnormal location of ears [HP:0000357],4/4,100%,9/9,100%,1.0,1.0
Aplasia/Hypoplasia of the mandible [HP:0009118],4/4,100%,7/7,100%,1.0,1.0
Abnormal axial skeleton morphology [HP:0009121],4/4,100%,7/7,100%,1.0,1.0


In [30]:
from gpsea.analysis.predicate import variant_key

Tyr418PhefsTer42 = variant_key("22_19766631_19766632_TA_T") #  c.1253del; p.Tyr418PhefsTer42; most common variant

Tyr418PhefsTer42_clf = monoallelic_classifier(
    a_predicate=Tyr418PhefsTer42,
    b_predicate=~Tyr418PhefsTer42,
    a_label="Tyr418PhefsTer42",
    b_label="Other"
)

Tyr418PhefsTer42_result = analysis.compare_genotype_vs_phenotypes(
    cohort=cohort,
    gt_clf=Tyr418PhefsTer42_clf,
    pheno_clfs=pheno_clfs,
)

summarize_hpo_analysis(hpo=hpo, result=Tyr418PhefsTer42_result)


Allele group,Tyr418PhefsTer42,Tyr418PhefsTer42,Other,Other,Unnamed: 5_level_0,Unnamed: 6_level_0
Unnamed: 0_level_1,Count,Percent,Count,Percent,Corrected p values,p values
Global developmental delay [HP:0001263],5/5,100%,3/20,15%,0.014756,0.001054
Narrow nose [HP:0000460],5/5,100%,0/6,0%,0.015152,0.002165
Blepharophimosis [HP:0000581],5/5,100%,4/13,31%,0.137255,0.029412
Velopharyngeal insufficiency [HP:0000220],5/5,100%,8/19,42%,0.144022,0.041149
Micrognathia [HP:0000347],5/5,100%,6/13,46%,0.28366,0.101307
Low-set ears [HP:0000369],5/5,100%,8/13,62%,0.580882,0.24895
Hypertelorism [HP:0000316],5/5,100%,7/10,70%,0.902875,0.505495
Abnormal facial shape [HP:0001999],5/5,100%,15/20,75%,0.902875,0.544043
Sensorineural hearing impairment [HP:0000407],1/5,20%,4/9,44%,0.902875,0.58042
Abnormal location of ears [HP:0000357],5/5,100%,8/8,100%,1.0,1.0


# Summary

In [31]:
from gpseacs.report import GpseaAnalysisReport, GPAnalysisResultSummary

f_results = (
  GPAnalysisResultSummary.from_multi( result=missense_result,  ),
  GPAnalysisResultSummary.from_multi( result=Tyr418PhefsTer42_result,  ),
)

report = GpseaAnalysisReport(name=gene_symbol, 
                             cohort=cohort, 
                             fet_results=f_results,
                             gene_symbol=gene_symbol,
                             mane_tx_id=mane_tx_id,
                             mane_protein_id=mane_protein_id,
                             caption="")

In [32]:
from gpseacs.report import GpseaNotebookSummarizer
summarizer = GpseaNotebookSummarizer(hpo=hpo, gpsea_version=gpsea.__version__)
summarizer.summarize_report(report=report)

Genotype (A),Genotype (B),Tests performed,Significant tests
Missense,Other,12,0

HPO Term,Tyr418PhefsTer42,Other,p-val,adj. p-val
Global developmental delay [HP:0001263],5/5 (100%),3/20 (15%),0.001,0.015
Narrow nose [HP:0000460],5/5 (100%),0/6 (0%),0.002,0.015


In [33]:
summarizer.process_latex(report=report)

Output to ../../supplement/tex/TBX1_summary_draft.tex
