Current schema:


```
root
 |-- hallmarks: struct (nullable = true)
 |    |-- attributes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- pmid: long (nullable = true)
 |    |    |    |-- attribute_name: string (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |-- cancer_hallmarks: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- pmid: long (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- label: string (nullable = true)
 |    |    |    |-- promote: boolean (nullable = true)
 |    |    |    |-- suppress: boolean (nullable = true)
 |    |-- function_summary: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- pmid: long (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 ```
 
 Cosmic team uploads the hallmarks data to google bucket: [gs://otar007-cosmic/archives/CTTV007-10-02-2021_v94_hallmarks.tsv.gz](gs://otar007-cosmic/archives/CTTV007-10-02-2021_v94_hallmarks.tsv.gz)
 
 
Graph QL address: [https://api-beta-dot-open-targets-eu-dev.appspot.com/api/v4/graphql/browser](https://api-beta-dot-open-targets-eu-dev.appspot.com/api/v4/graphql/browser)

Query:

```
query OpenTargetsGeneticsQuery {
  target(ensemblId: "ENSG00000136754") {
    hallmarks {
      attributes {
        name
        reference {
          pubmedId
          description
        }
      }
      rows {
        promote
        suppress
        label
      }
      functions{
        description
        pubmedId
      }
    }
  }
}
```

In [3]:
%%bash

DATA_DIR='~/project_data/ot/hallmarks'

# Create directory:
mkdir -p ${DATA_DIR}

# Fetching data:
gsutil cp -r gs://otar007-cosmic/archives/CTTV007-10-02-2021_v94_hallmarks.tsv.gz ${DATA_DIR}

gzcat "${DATA_DIR}/CTTV007-10-02-2021_v94_hallmarks.tsv.gz" | head -3 | column -t

gzcat "${DATA_DIR}/CTTV007-10-02-2021_v94_hallmarks.tsv.gz" | wc -l


GENE_SYMBOL  CELL_TYPE       PUBMED_PMID  HALLMARK  IMPACT         DESCRIPTION  CELL_LINE
ABI1         hepatocellular  carcinoma    28339046  proliferative  signalling   promotes    overexpression  of              ABI1  increases  and        KD   decreases  cell       proliferation  HepG2      and  MHCC97H
ABI1         hepatocellular  carcinoma    28339046  invasion       and          metastasis  promotes        overexpression  of    ABI1       increases  and  KD         decreases  cell           migration  and  invasion  HepG2  and  MHCC97H
    3716


So there are 3716 rows in the hallmark file. Let's read it in a pandas dataframe

In [1]:
import pandas as pd
import json
import gzip

hallmark_file = '/Users/dsuveges/project_data/ot/hallmarks/CTTV007-10-02-2021_v94_hallmarks.tsv.gz'

hallmark_df = pd.read_csv(hallmark_file, sep='\t', compression='gzip', encoding='cp1252', dtype=str)

print(hallmark_df.head())
len(hallmark_df)


  GENE_SYMBOL                 CELL_TYPE PUBMED_PMID                  HALLMARK  \
0        ABI1  hepatocellular carcinoma    28339046  proliferative signalling   
1        ABI1  hepatocellular carcinoma    28339046   invasion and metastasis   
2        ABI1                       NaN    16025998            role in cancer   
3        ABI1                       NaN     9694699            role in cancer   
4        ABI1                       NaN    23552839            role in cancer   

     IMPACT                                        DESCRIPTION  \
0  promotes  overexpression of ABI1 increases and KD decrea...   
1  promotes  overexpression of ABI1 increases and KD decrea...   
2       TSG                                                TSG   
3    fusion                                             fusion   
4       TSG                                                TSG   

           CELL_LINE  
0  HepG2 and MHCC97H  
1  HepG2 and MHCC97H  
2                NaN  
3                NaN  
4

3715

In [13]:
hallmark_df.loc[hallmark_df.GENE_SYMBOL == 'ABI1']
# hallmark_df.loc[hallmark_df.HALLMARK.isna()] # -> zero rows


Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
0,ABI1,hepatocellular carcinoma,28339046,proliferative signalling,promotes,overexpression of ABI1 increases and KD decrea...,HepG2 and MHCC97H
1,ABI1,hepatocellular carcinoma,28339046,invasion and metastasis,promotes,overexpression of ABI1 increases and KD decrea...,HepG2 and MHCC97H
2,ABI1,,16025998,role in cancer,TSG,TSG,
3,ABI1,,9694699,role in cancer,fusion,fusion,
4,ABI1,,23552839,role in cancer,TSG,TSG,
5,ABI1,,9010225,suppression of growth,promotes,fibroblasts overexpressing e3B1 have reduced g...,
6,ABI1,,9694699,function summary,,Abl-interacting adaptor protein,
7,ABI1,AML,9694699,fusion partner,,KMT2A,
8,ABI1,mouse pre-B,18453543,invasion and metastasis,promotes,KD inhibits the Bcr-Abl-stimulated abnormal cy...,baf3
9,ABI1,,23552839,mouse model,,development of prostatic intraepithelial neopl...,


In [16]:
hallmark_df.HALLMARK.unique()

array(['proliferative signalling', 'invasion and metastasis',
       'role in cancer', 'suppression of growth', 'function summary',
       'fusion partner', 'mouse model', 'angiogenesis',
       'differentiation and development',
       'global regulation of gene expression',
       'change of cellular energetics',
       'genome instability and mutations',
       'escaping programmed cell death', 'types of alteration in cancer',
       'clinical impact', 'impact of mutation on function',
       'cell division control', 'tumour promoting inflammation',
       'cell replicative immortality', 'senescence',
       'escaping immune response to cancer', 'interaction with pathogen'],
      dtype=object)

In [22]:
hallmark_annotations = ['proliferative signalling', 'invasion and metastasis',
       'suppression of growth',
        'angiogenesis','change of cellular energetics',
       'genome instability and mutations',
       'escaping programmed cell death',
       'tumour promoting inflammation',
       'cell replicative immortality', 
       'escaping immune response to cancer']

hallmarks_only = hallmark_df.loc[hallmark_df.HALLMARK.isin(hallmark_annotations)]

# Is this 
print(
    hallmarks_only.loc[hallmarks_only.GENE_SYMBOL == 'ABI1',['HALLMARK', 'IMPACT']]
    .drop_duplicates()
)


print(
    hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'ABI1') & (~hallmark_df.IMPACT.isin(['promotes','suppresses']))]
    .drop_duplicates()
)

                    HALLMARK      IMPACT
0   proliferative signalling    promotes
1    invasion and metastasis    promotes
5      suppression of growth    promotes
10   invasion and metastasis  suppresses
   GENE_SYMBOL                 CELL_TYPE PUBMED_PMID          HALLMARK  \
2         ABI1                       NaN    16025998    role in cancer   
3         ABI1                       NaN     9694699    role in cancer   
4         ABI1                       NaN    23552839    role in cancer   
6         ABI1                       NaN     9694699  function summary   
7         ABI1                       AML     9694699    fusion partner   
9         ABI1                       NaN    23552839       mouse model   
11        ABI1                       NaN    10499589  function summary   
15        ABI1  hepatocellular carcinoma    28339046    role in cancer   

      IMPACT                                        DESCRIPTION  \
2        TSG                                                T

In [23]:
print(
    hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'ABI1') & (hallmark_df.IMPACT.isin(['promotes','suppresses']))]
    .drop_duplicates()
)

   GENE_SYMBOL                 CELL_TYPE PUBMED_PMID  \
0         ABI1  hepatocellular carcinoma    28339046   
1         ABI1  hepatocellular carcinoma    28339046   
5         ABI1                       NaN     9010225   
8         ABI1              mouse pre-B     18453543   
10        ABI1              glioblastoma    26473374   
12        ABI1         colorectal cancer    24913355   
13        ABI1             breast cancer    17951403   
14        ABI1             breast cancer    17951403   

                    HALLMARK      IMPACT  \
0   proliferative signalling    promotes   
1    invasion and metastasis    promotes   
5      suppression of growth    promotes   
8    invasion and metastasis    promotes   
10   invasion and metastasis  suppresses   
12   invasion and metastasis    promotes   
13   invasion and metastasis    promotes   
14  proliferative signalling    promotes   

                                          DESCRIPTION          CELL_LINE  
0   overexpression of A

In [25]:
hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'ABI1') & (hallmark_df.HALLMARK == 'role in cancer')]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
2,ABI1,,16025998,role in cancer,TSG,TSG,
3,ABI1,,9694699,role in cancer,fusion,fusion,
4,ABI1,,23552839,role in cancer,TSG,TSG,
15,ABI1,hepatocellular carcinoma,28339046,role in cancer,oncogene,oncogene,HepG2 and MHCC97H


In [28]:
hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'ABI1') & (~hallmark_df.HALLMARK.isin(hallmark_annotations))]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
2,ABI1,,16025998,role in cancer,TSG,TSG,
3,ABI1,,9694699,role in cancer,fusion,fusion,
4,ABI1,,23552839,role in cancer,TSG,TSG,
6,ABI1,,9694699,function summary,,Abl-interacting adaptor protein,
7,ABI1,AML,9694699,fusion partner,,KMT2A,
9,ABI1,,23552839,mouse model,,development of prostatic intraepithelial neopl...,
11,ABI1,,10499589,function summary,,participates in the transduction of signals fr...,
15,ABI1,hepatocellular carcinoma,28339046,role in cancer,oncogene,oncogene,HepG2 and MHCC97H


In [57]:
not_known = (
    pd.Series(
        hallmark_df
        .HALLMARK
        .unique()
    )
    .rename('cica')
    .to_frame()
    .loc[ lambda x: ~x.cica.isin(hallmark_annotations + ['function summary', 'role in cancer', 'fusion partner', 'mouse model'])]
    .cica
    .tolist()
)

In [60]:
hallmark_df.loc[hallmark_df.HALLMARK.isin(not_known)].HALLMARK.value_counts()

types of alteration in cancer           369
impact of mutation on function          217
differentiation and development         171
clinical impact                         125
cell division control                    90
global regulation of gene expression     51
senescence                               33
interaction with pathogen                 9
Name: HALLMARK, dtype: int64

In [68]:
hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'ABI1') &
                (hallmark_df.HALLMARK != 'role in cancer') &
               (~hallmark_df.HALLMARK.isin(hallmark_annotations))]



Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
6,ABI1,,9694699,function summary,,Abl-interacting adaptor protein,
7,ABI1,AML,9694699,fusion partner,,KMT2A,
9,ABI1,,23552839,mouse model,,development of prostatic intraepithelial neopl...,
11,ABI1,,10499589,function summary,,participates in the transduction of signals fr...,


In [69]:
hallmark_df.loc[(hallmark_df.HALLMARK == 'function summary')]


Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
6,ABI1,,9694699,function summary,,Abl-interacting adaptor protein,
11,ABI1,,10499589,function summary,,participates in the transduction of signals fr...,
17,ABL1,,23316053,function summary,,"protein-tyrosine kinase, plays a prominent rol...",
20,ABL1,,25999467,function summary,,phosphorylates RAS proteins to allosterically ...,
33,ACKR3,vascular endothelium,26119946,function summary,,scavenger receptor for chemokine CXCL12,
...,...,...,...,...,...,...,...
3526,PTPN11,,12826400,function summary,,Src homology-2 domain-containing protein tyros...,
3598,PTPRB,,19451274,function summary,,a receptor-type tyrosine phosphatase involved ...,
3629,FOXL2,,24049064,function summary,,a forkhead type transcription factor involved ...,
3657,ACVR2A,colorectal cancer,26497569,function summary,,Activin ACVR2A signaling stabilises p21 via SMAD4,SW480


In [87]:
(
    hallmark_df
    .groupby(['HALLMARK'])
    .agg(
        {
            "CELL_LINE": lambda x: x.dropna().unique().tolist(),
            "CELL_TYPE": lambda x: x.dropna().unique().tolist(),
        }
    )
)




Unnamed: 0_level_0,CELL_LINE,CELL_TYPE
HALLMARK,Unnamed: 1_level_1,Unnamed: 2_level_1
angiogenesis,"[HepG2, PDVC57, MCF-7, BGC-823 and MKN-28, RH3...","[hepatocellular carcinoma, papillary thyroid c..."
cell division control,"[Hela and U2OS, NIH-3T3, KMS11, KMM1, Cal27, S...","[mouse brain, mouse, mouse fibroblasts, endoth..."
cell replicative immortality,"[HONE1, BJ, U87MG]","[haematopoietic stem cells, triple negative br..."
change of cellular energetics,"[Raji, MCF-7, RPMI-8826, MCF7, SK-OV-3, 3AO, H...","[prostate cancer, adipocytes and L6 myotubes, ..."
clinical impact,"[A549, H1993, HCC827, H1975, A549, Pc9, Pc9 GR...","[GBM, gallbladder cancer, renal cell carcinoma..."
differentiation and development,"[32D.C10, NB4-55 ,HL-60, U-937, NB-39-nu, NGP,...","[mouse, murine myeloid, sensory dorsal root ga..."
escaping immune response to cancer,"[FO-1, HepG2/C3A, SNU-423, SK-HEP-1, PLC/PRF/...","[melanoma, colorectal cancer, pancreatic ducta..."
escaping programmed cell death,"[SW480 and HT-29, MiaPaCa-2, H460, HCT-15, and...","[colorectal cancer, glioma-initiating cells, m..."
function summary,"[LNCaP , MCF-7, HeLa, HEK293, SW480]","[vascular endothelium, lung cancer, Ewing sarc..."
fusion partner,"[UACC812, ZR-75-30, Karpas 231, VAL]","[AML, CML, lipoma, prostate cancer, ALL, papil..."


In [89]:
len(hallmark_df.loc[hallmark_df.HALLMARK == 'function summary'])

3715

In [98]:
(
    hallmark_df
    .groupby('GENE_SYMBOL')
    .agg({'HALLMARK': lambda x: len(x.loc[x == 'function summary'])})
    .value_counts('HALLMARK')
)

HALLMARK
1    217
2     73
3     14
0      8
4      3
5      2
dtype: int64

In [104]:
print('\n'.join(hallmark_df.loc[hallmark_df.GENE_SYMBOL == 'NPM1'].HALLMARK.unique()))

impact of mutation on function
types of alteration in cancer
interaction with pathogen
fusion partner
cell division control
differentiation and development
function summary
genome instability and mutations
escaping programmed cell death
proliferative signalling
invasion and metastasis
change of cellular energetics
suppression of growth
senescence
role in cancer
clinical impact


In [105]:
hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'NPM1') & (hallmark_df.HALLMARK == 'role in cancer')]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
2353,NPM1,mouse fibroblasts,19033198,role in cancer,oncogene,oncogene,MEFs
2354,NPM1,lymphoid malignancies,18212245,role in cancer,TSG,TSG,
2355,NPM1,,21278791,role in cancer,"oncogene, TSG, fusion","oncogene, TSG, fusion",
2356,NPM1,ALCL,8122112,role in cancer,,fusion,


In [106]:
hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'NPM1')]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
2324,NPM1,AML,16455950,impact of mutation on function,,most common forms of NPM1 mutation in exon 12 ...,
2325,NPM1,salivary gland adenoid cystic carcinoma,27501253,types of alteration in cancer,,frequently overexpressed and higher levels ass...,ACCM
2326,NPM1,,23271972,interaction with pathogen,,escorts an Epstein-Barr Virus nuclear antigen ...,
2327,NPM1,MDS and AML,8570204,fusion partner,,MLF1,
2328,NPM1,APL,8562957,fusion partner,,RARA,
2329,NPM1,ALCL,8122112,fusion partner,,ALK,
2330,NPM1,colorectal cancer,23536448,cell division control,promotes,KD results in block of cell cycle progression ...,"MIP101, RKO, HCT116"
2331,NPM1,mouse fibroblast,11051553,cell division control,,"binds to unduplicated centrosomes, dissociates...",swiss 3T3
2332,NPM1,acute myeloid leukaemia,27669739,differentiation and development,suppresses,W288fs mutant inhibits myeloid differentiation...,OCI-AML3
2333,NPM1,AML,28111462,types of alteration in cancer,,patients with frameshift mutation in NPM1 have...,


In [128]:
annotation_examples = (
    hallmark_df
    .groupby(['HALLMARK'])
    .agg({
        "GENE_SYMBOL": lambda x: x.iloc[0], # Select first
        "DESCRIPTION": lambda x: x.iloc[0] # Select first
    })
    .reset_index()
    .assign(ANNOT_TYPE = lambda df: df.HALLMARK.apply(lambda x: "hallmark" if x in hallmark_annotations else 'function summary' if x == 'function summary' else 'other annotation'))
    .sort_values('ANNOT_TYPE')
    .to_markdown()
)

#row = annotation_examples.iloc[1]
#row['DESCRIPTION']
print(annotation_examples)

|    | HALLMARK                             | GENE_SYMBOL   | DESCRIPTION                                                                                                                                                                                                                   | ANNOT_TYPE       |
|---:|:-------------------------------------|:--------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
|  8 | function summary                     | ABI1          | Abl-interacting adaptor protein                                                                                                                                                                                               | function summary |
|  0 | angiogenesis                         | ABL1          | loss of Abl kinases lea

In [118]:
annotation_examples.DESCRIPTION.apply(lambda x: "hallmark" if x in hallmark_annotations else 'function summary' if x == 'function summary' else 'other annotation')

HALLMARK
angiogenesis                            other annotation
cell division control                   other annotation
cell replicative immortality            other annotation
change of cellular energetics           other annotation
clinical impact                         other annotation
differentiation and development         other annotation
escaping immune response to cancer      other annotation
escaping programmed cell death          other annotation
function summary                        other annotation
fusion partner                          other annotation
genome instability and mutations        other annotation
global regulation of gene expression    other annotation
impact of mutation on function          other annotation
interaction with pathogen               other annotation
invasion and metastasis                 other annotation
mouse model                             other annotation
proliferative signalling                other annotation
role in cancer        

In [123]:
annotation_examples.index.to_series().apply(lambda x: "hallmark" if x in hallmark_annotations else 'function summary' if x == 'function summary' else 'other annotation')

HALLMARK
angiogenesis                                    hallmark
cell division control                   other annotation
cell replicative immortality                    hallmark
change of cellular energetics                   hallmark
clinical impact                         other annotation
differentiation and development         other annotation
escaping immune response to cancer              hallmark
escaping programmed cell death                  hallmark
function summary                        function summary
fusion partner                          other annotation
genome instability and mutations                hallmark
global regulation of gene expression    other annotation
impact of mutation on function          other annotation
interaction with pathogen               other annotation
invasion and metastasis                         hallmark
mouse model                             other annotation
proliferative signalling                        hallmark
role in cancer        

In [130]:
print(
    hallmark_df
    .head()
    .to_markdown(index=False)
)

| GENE_SYMBOL   | CELL_TYPE                |   PUBMED_PMID | HALLMARK                 | IMPACT   | DESCRIPTION                                                                   | CELL_LINE         |
|:--------------|:-------------------------|--------------:|:-------------------------|:---------|:------------------------------------------------------------------------------|:------------------|
| ABI1          | hepatocellular carcinoma |      28339046 | proliferative signalling | promotes | overexpression of ABI1 increases and KD decreases cell proliferation          | HepG2 and MHCC97H |
| ABI1          | hepatocellular carcinoma |      28339046 | invasion and metastasis  | promotes | overexpression of ABI1 increases and KD decreases cell migration and invasion | HepG2 and MHCC97H |
| ABI1          | nan                      |      16025998 | role in cancer           | TSG      | TSG                                                                           | nan               |
| ABI

In [160]:
def describe_column(colname):
    
    # Narrowing the df to the non-na values:
    s = hallmark_df[colname].loc[lambda x: x.notna()]
    
    print(f'Not null values: {len(s)}')
    print(f'Unique values: {len(s.unique())}')
    
    
colname = 'DESCRIPTION'
describe_column(colname)






Not null values: 3715
Unique values: 2966


In [165]:
hallmark_df.GENE_SYMBOL.value_counts()

PTPN11      73
CEBPA       56
MET         51
CYLD        49
GATA1       41
            ..
ARHGEF12     1
SNX9         1
XPA          1
ABL2         1
CDX2         1
Name: GENE_SYMBOL, Length: 317, dtype: int64

In [144]:
ensembl_id = 'ENSG00000179295'

3715

In [169]:
import json

data = '''
{
  "data": {
    "target": {
      "hallmarks": {
        "attributes": [
          {
            "name": "senescence",
            "reference": {
              "pubmedId": 29505847,
              "description": "hepatocyte-specific KD induces hepatocyte senescence in mice with oncogene-driven hepatocellular carcinoma"
            }
          },
          {
            "name": "senescence",
            "reference": {
              "pubmedId": 25736378,
              "description": "KD induces senescence in mammospheres"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 25865556,
              "description": "oncogene"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 15710330,
              "description": "oncogene"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 28814887,
              "description": "oncogene"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 26622699,
              "description": "oncogene"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 27582544,
              "description": "TSG"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 27582544,
              "description": "oncogene"
            }
          },
          {
            "name": "role in cancer",
            "reference": {
              "pubmedId": 21575863,
              "description": "TSG"
            }
          },
          {
            "name": "clinical impact",
            "reference": {
              "pubmedId": 32944401,
              "description": "elevated expression is associated with reduced survival"
            }
          },
          {
            "name": "clinical impact",
            "reference": {
              "pubmedId": 25865556,
              "description": "high expression is associated with reduced survival"
            }
          },
          {
            "name": "clinical impact",
            "reference": {
              "pubmedId": 24297342,
              "description": "high expression is associated with reduced survival"
            }
          },
          {
            "name": "clinical impact",
            "reference": {
              "pubmedId": 26622699,
              "description": "elevated expression correlates with lymph node metastasis at presentation"
            }
          },
          {
            "name": "mouse model",
            "reference": {
              "pubmedId": 29323748,
              "description": "p.E76K conditional knock-in promotes colitis-associated colorectal cancer development"
            }
          },
          {
            "name": "types of alteration in cancer",
            "reference": {
              "pubmedId": 25865556,
              "description": "frequently increased expression (typically poorly differentiated tumours)"
            }
          },
          {
            "name": "types of alteration in cancer",
            "reference": {
              "pubmedId": 30371878,
              "description": "missense mutation in 42% of tumours, the most common being p.E76K"
            }
          },
          {
            "name": "types of alteration in cancer",
            "reference": {
              "pubmedId": 24297342,
              "description": "frequently increased expression"
            }
          },
          {
            "name": "types of alteration in cancer",
            "reference": {
              "pubmedId": 30371878,
              "description": "missense mutations in 9% of tumours"
            }
          },
          {
            "name": "types of alteration in cancer",
            "reference": {
              "pubmedId": 28814887,
              "description": "frequently increased expression"
            }
          },
          {
            "name": "types of alteration in cancer",
            "reference": {
              "pubmedId": 26622699,
              "description": "frequently increased expression"
            }
          },
          {
            "name": "global regulation of gene expression",
            "reference": {
              "pubmedId": 12425940,
              "description": "acts as a negative regulator of LIF signalling-mediated neuronal gene expression"
            }
          },
          {
            "name": "differentiation and development",
            "reference": {
              "pubmedId": 30764849,
              "description": "downregulation enhances cell differentiation via SRC and CTNNB1"
            }
          },
          {
            "name": "differentiation and development",
            "reference": {
              "pubmedId": 29659837,
              "description": "differentiation of primary chondrocytes from mice heterozygous for D61G mutation is impaired via ERK activation"
            }
          },
          {
            "name": "differentiation and development",
            "reference": {
              "pubmedId": 21670473,
              "description": "KD inhibits myeloid and erythroid differentiation of CD34 positive cells"
            }
          },
          {
            "name": "differentiation and development",
            "reference": {
              "pubmedId": 32751109,
              "description": "involved in the regulation of several processes central to embryo development"
            }
          },
          {
            "name": "cell division control",
            "reference": {
              "pubmedId": 32944401,
              "description": "KD delays cell G1 to S phase transition via PI3K/AKT/GSK3_ pathway-mediated degradation of cyclin D"
            }
          },
          {
            "name": "cell division control",
            "reference": {
              "pubmedId": 31807022,
              "description": "overexpression increases cell transition from G1 to S, a larger effect is observed for a E76K mutation-containing gene"
            }
          },
          {
            "name": "cell division control",
            "reference": {
              "pubmedId": 18640765,
              "description": "D61Y and E76K gain of function mutations promote cell progression to S and G2, in association with cyclin D upregulation and downregulation of p27 and p21"
            }
          },
          {
            "name": "interaction with pathogen",
            "reference": {
              "pubmedId": 11743164,
              "description": "activation by sh2 domain binding of tyrosine phosphorylated Helicobacter pylori protein CagA is required for CagA-induced gastric carcinoma-associated morphological changes"
            }
          },
          {
            "name": "impact of mutation on function",
            "reference": {
              "pubmedId": 31807022,
              "description": "E76K mutation enhances cell proliferation, migration and invasion effects effected by overexpression"
            }
          },
          {
            "name": "impact of mutation on function",
            "reference": {
              "pubmedId": 30375388,
              "description": "most commonly reported cancer mutation E76K changes basal autoinhibited conformation to open substrate accessible state"
            }
          }
        ],
        "rows": [
          {
            "promote": true,
            "suppress": false,
            "label": "genome instability and mutations"
          },
          {
            "promote": false,
            "suppress": true,
            "label": "genome instability and mutations"
          },
          {
            "promote": false,
            "suppress": true,
            "label": "angiogenesis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "angiogenesis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "angiogenesis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "angiogenesis"
          },
          {
            "promote": false,
            "suppress": true,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping programmed cell death"
          },
          {
            "promote": false,
            "suppress": true,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping programmed cell death"
          },
          {
            "promote": false,
            "suppress": true,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping programmed cell death"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "invasion and metastasis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "invasion and metastasis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "invasion and metastasis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "invasion and metastasis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "invasion and metastasis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "invasion and metastasis"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "tumour promoting inflammation"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "tumour promoting inflammation"
          },
          {
            "promote": false,
            "suppress": true,
            "label": "tumour promoting inflammation"
          },
          {
            "promote": false,
            "suppress": false,
            "label": "change of cellular energetics"
          },
          {
            "promote": false,
            "suppress": false,
            "label": "change of cellular energetics"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "suppression of growth"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "suppression of growth"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "suppression of growth"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "proliferative signalling"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping immune response to cancer"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping immune response to cancer"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping immune response to cancer"
          },
          {
            "promote": true,
            "suppress": false,
            "label": "escaping immune response to cancer"
          }
        ],
        "functions": [
          {
            "description": "WT and E76K mutant PTPN11 dephosphorylates SRC phosphorylated KRAS WT and G12V mutant, thereby enhancing RAF binding to RAS",
            "pubmedId": 30644389
          },
          {
            "description": "Src homology-2 domain-containing protein tyrosine phosphatase involved in growth factor and cytokine-dependent intracellular signal transduction regulating multiple processes",
            "pubmedId": 12826400
          }
        ]
      }
    }
  }
}
'''

d = json.loads(data)
d

{'data': {'target': {'hallmarks': {'attributes': [{'name': 'senescence',
      'reference': {'pubmedId': 29505847,
       'description': 'hepatocyte-specific KD induces hepatocyte senescence in mice with oncogene-driven hepatocellular carcinoma'}},
     {'name': 'senescence',
      'reference': {'pubmedId': 25736378,
       'description': 'KD induces senescence in mammospheres'}},
     {'name': 'role in cancer',
      'reference': {'pubmedId': 25865556, 'description': 'oncogene'}},
     {'name': 'role in cancer',
      'reference': {'pubmedId': 15710330, 'description': 'oncogene'}},
     {'name': 'role in cancer',
      'reference': {'pubmedId': 28814887, 'description': 'oncogene'}},
     {'name': 'role in cancer',
      'reference': {'pubmedId': 26622699, 'description': 'oncogene'}},
     {'name': 'role in cancer',
      'reference': {'pubmedId': 27582544, 'description': 'TSG'}},
     {'name': 'role in cancer',
      'reference': {'pubmedId': 27582544, 'description': 'oncogene'}},
   

In [223]:
pd.DataFrame(d['data']['target']['hallmarks']['rows']).label.unique()

array(['genome instability and mutations', 'angiogenesis',
       'escaping programmed cell death', 'proliferative signalling',
       'invasion and metastasis', 'tumour promoting inflammation',
       'change of cellular energetics', 'suppression of growth',
       'escaping immune response to cancer'], dtype=object)

In [180]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

In [216]:
target_file = '/Users/dsuveges/project_data/ot/target_index/target_21.04/targets'
###
### Getting the unique list of hallmark labels: it should be only 10, but there are more.
###
(
    spark.read.parquet(target_file)
    .select(
        F.explode(F.col('hallMarks.cancer_hallmarks')).alias('cancer_hallmarks')
    )
    .select(
        F.col('cancer_hallmarks.label')
    )
    .distinct()
    .show()
)

+--------------------+
|               label|
+--------------------+
|tumour promoting ...|
|cell replicative ...|
|escaping immune r...|
|genome instabilit...|
|invasion and meta...|
|proliferative sig...|
|        angiogenesis|
|escaping programm...|
|change of cellula...|
|suppression of gr...|
+--------------------+



In [221]:
print(
    spark.read.parquet(target_file)
    .select(
        F.explode(F.col('hallMarks.attributes')).alias('attributes')
    )
    .select(
        F.col('attributes.attribute_name')
    )
    .distinct()
    .toPandas()
    .to_markdown(index=False)
)

| attribute_name                       |
|:-------------------------------------|
| fusion partner                       |
| global regulation of gene expression |
| differentiation and development      |
| impact of mutation on function       |
| cell division control                |
| mouse model                          |
| types of alteration in cancer        |
| senescence                           |
| clinical impact                      |
| interaction with pathogen            |
| role in cancer                       |


In [226]:
hallmark_df.loc[(hallmark_df.GENE_SYMBOL == 'PTPN11') & (hallmark_df.HALLMARK.isin(hallmark_annotations)),['HALLMARK','IMPACT','PUBMED_PMID']]

Unnamed: 0,HALLMARK,IMPACT,PUBMED_PMID
3527,genome instability and mutations,promotes,26755576
3528,genome instability and mutations,suppresses,22890240
3529,angiogenesis,suppresses,23065156
3530,angiogenesis,promotes,26004555
3531,angiogenesis,promotes,17962719
3532,angiogenesis,promotes,19008228
3533,escaping programmed cell death,suppresses,12594211
3534,escaping programmed cell death,promotes,26004555
3535,escaping programmed cell death,suppresses,17330819
3536,escaping programmed cell death,promotes,29207183


In [229]:
hallmark_df.loc[(hallmark_df.HALLMARK == 'change of cellular energetics') & (hallmark_df.IMPACT.isna()),['GENE_SYMBOL', 'HALLMARK', 'IMPACT']]

Unnamed: 0,GENE_SYMBOL,HALLMARK,IMPACT
3557,PTPN11,change of cellular energetics,
3558,PTPN11,change of cellular energetics,


In [231]:
hallmark_df.loc[(hallmark_df.HALLMARK.isin(hallmark_annotations)) & (hallmark_df.IMPACT.isna()),['GENE_SYMBOL', 'HALLMARK', 'IMPACT', 'DESCRIPTION']]

Unnamed: 0,GENE_SYMBOL,HALLMARK,IMPACT,DESCRIPTION
929,CYLD,tumour promoting inflammation,,liver-specific expression of deubiquitinase de...
1496,FGFR2,invasion and metastasis,,"splice isoforms b and c, with preference for d..."
1978,LIFR,invasion and metastasis,,induction of LIFR confers a dormancy phenotype...
2412,PAX5,escaping programmed cell death,,suppression of B isoform of Pax5 leads to an i...
2769,RABEP1,invasion and metastasis,,disruption of Rabaptin-5 Ser407 phosphorylatio...
2819,RAP1GDS1,invasion and metastasis,,stimulates migration
2849,RBM10,escaping programmed cell death,,expression positively correlates with the expr...
2850,RBM10,escaping programmed cell death,,expression positively correlates with the expr...
3557,PTPN11,change of cellular energetics,,mice expressing a brain-specific exon 4 deleti...
3558,PTPN11,change of cellular energetics,,mice expressing protein tyrosine phosphatase d...


In [237]:
attributes_df = hallmark_df.loc[~hallmark_df.HALLMARK.isin(hallmark_annotations + ['functions'])]

In [244]:
for attribute in attributes_df.HALLMARK.unique():
    print(f'\nProcessing: {attribute}')
    print(attributes_df.loc[attributes_df.HALLMARK == attribute].drop(['CELL_TYPE','PUBMED_PMID','CELL_LINE'], axis=1).head())


Processing: role in cancer
   GENE_SYMBOL        HALLMARK    IMPACT DESCRIPTION
2         ABI1  role in cancer       TSG         TSG
3         ABI1  role in cancer    fusion      fusion
4         ABI1  role in cancer       TSG         TSG
15        ABI1  role in cancer  oncogene    oncogene
28        ABL1  role in cancer    fusion      fusion

Processing: function summary
   GENE_SYMBOL          HALLMARK IMPACT  \
6         ABI1  function summary    NaN   
11        ABI1  function summary    NaN   
17        ABL1  function summary    NaN   
20        ABL1  function summary    NaN   
33       ACKR3  function summary    NaN   

                                          DESCRIPTION  
6                     Abl-interacting adaptor protein  
11  participates in the transduction of signals fr...  
17  protein-tyrosine kinase, plays a prominent rol...  
20  phosphorylates RAS proteins to allosterically ...  
33            scavenger receptor for chemokine CXCL12  

Processing: fusion partner
 

In [245]:
hallmark_df.loc[ hallmark_df.HALLMARK == "interaction with pathogen"].DESCRIPTION.iloc[0]

'cellular corepressor that inhibits the expression of HPV-encoded E6 and E7 oncoproteins which antagonise p53 and pRB tumour suppressor activity'

In [247]:

hallmark_df.loc[ (hallmark_df.HALLMARK == 'role in cancer') & (hallmark_df.DESCRIPTION != hallmark_df.IMPACT)]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
882,CUX1,,25190083,role in cancer,"TSG,oncogene","TSG, oncogene",
1330,EWSR1,,17415412,role in cancer,"oncogene, fusion","fusion, oncogene",
1355,EZH2,,26845405,role in cancer,,"oncogene, TSG",
1465,FCGR2B,,11753646,role in cancer,"oncogene, fusion","fusion, oncogene",
2356,NPM1,ALCL,8122112,role in cancer,,fusion,
2405,PAX5,,17851532,role in cancer,TSG,haploinsufficient TSG,
2474,PDCD1LG2,,24497532,role in cancer,"oncogene, fusion","fusion, oncogene",
3080,SMAD4,cholangiocellular carcinoma,16767220,role in cancer,,TSG,
3081,SMAD4,prostate cancer,21289624,role in cancer,,TSG,
3186,TET2,murine myeloid malignancies,21803851,role in cancer,TSG,TSG,


In [248]:
hallmark_df.loc[ (hallmark_df.HALLMARK == 'role in cancer') & (hallmark_df.GENE_SYMBOL == 'ARHGAP26')]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
176,ARHGAP26,,10908648,role in cancer,TSG,TSG,
3687,ARHGAP26,,8649427,role in cancer,,a GTPase activating protein that binds to foca...,
3693,ARHGAP26,gastric cancer,26146084,role in cancer,fusion,fusion,
3694,ARHGAP26,infant acute monocytic leukaemia,15382263,role in cancer,fusion,fusion,
3695,ARHGAP26,JMML,10908648,role in cancer,fusion,fusion,
3696,ARHGAP26,ovarian cancer,31004081,role in cancer,TSG,TSG,


In [249]:
hallmark_df.loc[ (~hallmark_df.HALLMARK.isin(hallmark_annotations)) & (hallmark_df.DESCRIPTION.isna())]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE


In [250]:
hallmark_df.loc[hallmark_df.HALLMARK.isin(hallmark_annotations)]

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
0,ABI1,hepatocellular carcinoma,28339046,proliferative signalling,promotes,overexpression of ABI1 increases and KD decrea...,HepG2 and MHCC97H
1,ABI1,hepatocellular carcinoma,28339046,invasion and metastasis,promotes,overexpression of ABI1 increases and KD decrea...,HepG2 and MHCC97H
5,ABI1,,9010225,suppression of growth,promotes,fibroblasts overexpressing e3B1 have reduced g...,
8,ABI1,mouse pre-B,18453543,invasion and metastasis,promotes,KD inhibits the Bcr-Abl-stimulated abnormal cy...,baf3
10,ABI1,glioblastoma,26473374,invasion and metastasis,suppresses,loss of the Abi1 gene enhances the Crk Tyr251 ...,
...,...,...,...,...,...,...,...
3688,ARHGAP26,ductus arteriosus smooth muscle cells,30592323,proliferative signalling,promotes,KD decreases cell proliferation,
3689,ARHGAP26,ovarian cancer,31004081,suppression of growth,promotes,KD leads to increased proliferation,SKOV3
3690,ARHGAP26,colorectal cancer,28834752,invasion and metastasis,suppresses,KD promotes membrane blebbing-based invasion a...,SW480
3691,ARHGAP26,ductus arteriosus smooth muscle cells,30592323,invasion and metastasis,promotes,KD decreases cell migration,


In [2]:
hallmark_df.loc[hallmark_df.HALLMARK=='role in cancer']

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
2,ABI1,,16025998,role in cancer,TSG,TSG,
3,ABI1,,9694699,role in cancer,fusion,fusion,
4,ABI1,,23552839,role in cancer,TSG,TSG,
15,ABI1,hepatocellular carcinoma,28339046,role in cancer,oncogene,oncogene,HepG2 and MHCC97H
28,ABL1,,19260121,role in cancer,fusion,fusion,
...,...,...,...,...,...,...,...
3687,ARHGAP26,,8649427,role in cancer,,a GTPase activating protein that binds to foca...,
3693,ARHGAP26,gastric cancer,26146084,role in cancer,fusion,fusion,
3694,ARHGAP26,infant acute monocytic leukaemia,15382263,role in cancer,fusion,fusion,
3695,ARHGAP26,JMML,10908648,role in cancer,fusion,fusion,


In [3]:
hallmark_df.loc[hallmark_df.PUBMED_PMID=='9694699']

Unnamed: 0,GENE_SYMBOL,CELL_TYPE,PUBMED_PMID,HALLMARK,IMPACT,DESCRIPTION,CELL_LINE
3,ABI1,,9694699,role in cancer,fusion,fusion,
6,ABI1,,9694699,function summary,,Abl-interacting adaptor protein,
7,ABI1,AML,9694699,fusion partner,,KMT2A,
