```json
      "properties": {
        "datasourceId": {
          "const": "europepmc"
        },
        "datatypeId": {
          "$ref": "#/definitions/datatypeId"
        },
        "diseaseFromSource": {
          "$ref": "#/definitions/diseaseFromSource"
        },
        "diseaseFromSourceId": {
          "$ref": "#/definitions/diseaseFromSourceId"
        },
        "literature": {
          "$ref": "#/definitions/literature"
        },
        "resourceScore": {
          "$ref": "#/definitions/resourceScore"
        },
        "targetFromSource": {
          "$ref": "#/definitions/targetFromSource"
        },
        "textMiningSentences": {
          "$ref": "#/definitions/textMiningSentences"
        },
        "diseaseFromSourceMappedId": {
          "$ref": "#/definitions/diseaseFromSourceMappedId"
        }
      },
```

How the `textminingSentences` looks like:

```json
    "textMiningSentences": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "dEnd": {
            "type": "integer",
            "description": "Index position where disease name ends in sentence"
          },
          "dStart": {
            "type": "integer",
            "description": "Index position where disease name starts in sentence"
          },
          "section": {
            "type": "string",
            "description": "Section in which sentence occurs",
            "enum": [
              "title",
              "table",
              "other",
              "figure",
              "appendix",
              "abstract"
            ]
          },
          "tEnd": {
            "type": "integer",
            "description": "Index position where target name ends in sentence"
          },
          "text": {
            "type": "string",
            "description": "Sentence text"
          },
          "tStart": {
            "type": "integer",
            "description": "Index position where target name starts in sentence"
          }
        }
      },
      "uniqueItems": true
    },
```

In [2]:
import pyspark.sql
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import re

global spark

# SparkContext.setSystemProperty('spark.executor.memory', '20g')

spark = (pyspark.sql.SparkSession
    .builder
    .appName("phenodigm_parser")
    .config("spark.executor.memory", '10g')
     .config("spark.driver.bindAddress", "localhost")
    .config("spark.driver.memory", '10g')
    .getOrCreate()
)

#   


print('Spark version: ', spark.version)


Spark version:  3.0.0


In [3]:
df = spark.read.parquet('/Users/dsuveges/project_data/epmc_evidence/part-00432-aa7d3f34-a39f-4dbf-856a-f97d17ecba7e-c000.snappy.parquet')
df.show()

+--------------------+----------+---------+--------------------+--------------------+----+----+--------------+--------+----------+-------------+--------------------+-------------+--------+------+------+-----+-----+-----+-----------------+----+----------------+--------+-------+----+--------+----------------------+----+
|           organisms|      pmid|  pubDate|             section|                text|end1|end2|evidence_score|isMapped|keywordId1|   keywordId2|              label1|       label2|relation|start1|start2| type|type1|type2|AlteredExpression| Any|GeneticVariation|Negative|Neutral|  No|Positive|RegulatoryModification| Yes|
+--------------------+----------+---------+--------------------+--------------------+----+----+--------------+--------+----------+-------------+--------------------+-------------+--------+------+------+-----+-----+-----+-----------------+----+----------------+--------+-------+----+--------+----------------------+----+
|[microbes, Entero...|PMC6272440|2015-4-

In [7]:
df.printSchema()

root
 |-- organisms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- pmid: string (nullable = true)
 |-- pubDate: string (nullable = true)
 |-- section: string (nullable = true)
 |-- text: string (nullable = true)
 |-- end1: long (nullable = true)
 |-- end2: long (nullable = true)
 |-- evidence_score: double (nullable = true)
 |-- isMapped: boolean (nullable = true)
 |-- keywordId1: string (nullable = true)
 |-- keywordId2: string (nullable = true)
 |-- label1: string (nullable = true)
 |-- label2: string (nullable = true)
 |-- relation: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- endr: long (nullable = true)
 |    |    |-- labelr: string (nullable = true)
 |    |    |-- startr: long (nullable = true)
 |    |    |-- typer: string (nullable = true)
 |-- start1: long (nullable = true)
 |-- start2: long (nullable = true)
 |-- type: string (nullable = true)
 |-- type1: string (nullable = true)
 |-- type2: string (nulla

In [23]:
(
    df
    .filter(
        (col('isMapped') == 'True') & 
        (df.keywordId2.contains('EFO'))
    )
    .limit(10)
    .toPandas()
    .to_json('cicaful.json', orient='records', lines=True)
)

In [24]:
%%bash 
cat cicaful.json | jq

{
  "organisms": [
    "Acer nikoense",
    "Fabaceae",
    "lianas",
    "alstonia",
    "tree",
    "Schisandraceae",
    "Alstonia macrophylla",
    "A. nikoense",
    "Schisandra chinensis",
    "Gnetaceae",
    "rats",
    "willow",
    "cinchona",
    "Schisandrae Chinensis Fructus",
    "Gnetum",
    "Aceraceae",
    "S. chinensis",
    "SCF",
    "apple",
    "Apocynaceae",
    "Sophora flavescens",
    "animal",
    "A. macrophylla",
    "G. gnemonoides",
    "S. flavescens",
    "Gnetum gnemonoides",
    "human",
    "plant"
  ],
  "pmid": "PMC6273509",
  "pubDate": "2016-8-27",
  "section": "1. \nIntroduction",
  "text": "Diabetes mellitus is a chronic condition characterized by hyperglycemia, due to the metabolic impairment of insulin production/secretion and/or utilization in human body.",
  "end1": 115,
  "end2": 17,
  "evidence_score": 1,
  "isMapped": true,
  "keywordId1": "ENSG00000254647",
  "keywordId2": "EFO_0000400",
  "label1": "insulin",
  "label2": "Diabetes mel

In [65]:
'''
The expression below is used to generate evidence
'''

(
    df
    .filter(col('type') == "GP-DS")
    .withColumn('tmp',col('pmid'))
    .withColumnRenamed("keywordId1", "targetFromSourceId")
    .withColumnRenamed("keywordId2", "diseaseFromSourceMappedId")
    .groupBy(['pmid', 'targetFromSourceId', 'diseaseFromSourceMappedId'])
    .agg(
        first(col("label1")).alias("targetFromSource"),
        first(col("label2")).alias("diseaseFromSource"),
        collect_set(col('tmp')).alias('literature'),
        collect_set(col('evidence_score')).alias('resourceScores'),
        collect_set(
            struct(  
                col("text"),
                col('start1').alias('tStart'),
                col("end1").alias('tEnd'),
                col('start2').alias('dStart'),
                col("end2").alias('dEnd'), 
                col('section'),
                col('evidence_score').alias('score')
            )
        ).alias('textMiningSentences')
    )
    .withColumn('datasourceId',lit('europepmc'))
    .withColumn('datatypeId',lit('literature'))
    .drop(*['pmid'])
    .select(["datasourceId", "datatypeId", "targetFromSource", "targetFromSourceId",
            "diseaseFromSource","diseaseFromSourceMappedId","literature",
             "textMiningSentences", 'resourceScores'])
    .write.format('json').mode('overwrite').option("compression", "org.apache.hadoop.io.compress.GzipCodec")
    .save('test.json.gz')
)

## Rercapitulating scores from old evidence

In [67]:
import json
import pandas as pd 

evid_str = '''
{
  "type": "literature",
  "sourceID": "europepmc",
  "target": {
    "activity": "http://identifiers.org/cttv.activity/up_or_down",
    "id": "http://identifiers.org/uniprot/O14813",
    "target_name": "ARIX",
    "target_type": "http://identifiers.org/cttv.target/protein_evidence"
  },
  "disease": {
    "id": "http://www.ebi.ac.uk/efo/EFO_1001985",
    "name": "CFEOM"
  },
  "evidence": {
    "unique_experiment_reference": "http://europepmc.org/abstract/MED/11882252",
    "resource_score": {
      "type": "summed_total",
      "value": 15.2,
      "method": {
        "description": "Custom text-mining method for target-disease association"
      }
    },
    "literature_ref": {
      "lit_id": "http://europepmc.org/abstract/MED/11882252",
      "mined_sentences": [
        {
          "section": "abstract",
          "t_start": 218,
          "t_end": 221,
          "d_start": 192,
          "d_end": 196,
          "text": "We now identify additional pedigrees with CFEOM1 to determine if the disorder is genetically heterogeneous and, if so, if any affected members of CFEOM1 pedigrees or sporadic cases of classic CFEOM harbor mutations in ARIX, the CFEOM2 disease gene."
        },
        {
          "section": "abstract",
          "t_start": 327,
          "t_end": 330,
          "d_start": 302,
          "d_end": 306,
          "text": "All demonstrated autosomal dominant inheritance, and nine were consistent with linkage to FEOM1. Two small CFEOM1 families were not linked to FEOM1, and both were consistent with linkage to FEOM3. We screened two CFEOM1 families consistent with linkage to FEOM2 and 5 sporadic individuals with classic CFEOM and did not detect ARIX mutations."
        },
        {
          "section": "abstract",
          "t_start": 33,
          "t_end": 36,
          "d_start": 128,
          "d_end": 132,
          "text": "Thus far, we have not identified ARIX mutations in any affected members of CFEOM1 pedigrees or in any sporadic cases of classic CFEOM."
        },
        {
          "section": "other",
          "t_start": 97,
          "t_end": 100,
          "d_start": 51,
          "d_end": 55,
          "text": "Neuropathology studies of DS [2,3] and one form of CFEOM (CFEOM1) [4], and the identification of ARIX as the gene mutated in a second form of CFEOM (CFEOM2) [5], however, support our hypothesis that CFEOM results from maldevelopment of the oculomotor (nIII) and/or trochlear (nIV) nuclei and DS results from maldevelopment of the abducens (nVI) nucleus."
        },
        {
          "section": "other",
          "t_start": 373,
          "t_end": 376,
          "d_start": 493,
          "d_end": 497,
          "text": "To determine if CFEOM1 is indeed genetically homogeneous, we identified all unpublished CFEOM1 pedigrees in our database, analyzed them for linkage to the FEOM loci, and found that most but not all were consistent with linkage to FEOM1. The two small pedigrees not linked to FEOM1 were consistent with linkage to FEOM3. In addition, to further define the spectrum of human ARIX mutations, we identified all CFEOM1 families consistent with linkage to FEOM2 or sporadic individuals with classic CFEOM and determined that none harbored mutations in the ARIX gene."
        },
        {
          "section": "other",
          "t_start": 153,
          "t_end": 156,
          "d_start": 102,
          "d_end": 106,
          "text": "Genomic DNA samples from affected member of pedigree BC and K and 5 sporadic individuals with classic CFEOM were used as templates to sequence the three ARIX exons and flanking introns."
        },
        {
          "section": "other",
          "t_start": 43,
          "t_end": 46,
          "d_start": 77,
          "d_end": 81,
          "text": "We now find that we are unable to identify ARIX mutations underlying classic CFEOM in either sporadic cases or in individuals from CFEOM1 families."
        },
        {
          "section": "other",
          "t_start": 66,
          "t_end": 69,
          "d_start": 42,
          "d_end": 46,
          "text": "Lastly, sporadic individuals with classic CFEOM were screened for ARIX mutations."
        }
      ]
    },
    "provenance_type": {
      "database": {
        "version": "2021-01-25",
        "id": "EuropePMC"
      }
    },
    "is_associated": true,
    "date_asserted": "2002-01-01T00:00:00Z",
    "evidence_codes": [
      "http://www.targetvalidation.org/evidence/literature_mining",
      "http://purl.obolibrary.org/obo/ECO_0000213"
    ]
  },
  "validated_against_schema_version": "1.7.5",
  "access_level": "public",
  "unique_association_fields": {
    "target_id": "http://identifiers.org/uniprot/O14813",
    "publication_id": "http://europepmc.org/abstract/MED/11882252",
    "disease_id": "http://www.ebi.ac.uk/efo/EFO_1001985"
  },
  "literature": {
    "references": [
      {
        "lit_id": "http://europepmc.org/abstract/MED/11882252"
      }
    ]
  }
}
'''

evid = json.loads(evid_str)
score = evid['evidence']['resource_score']['value']

evid_df = pd.DataFrame(evid['evidence']['literature_ref']['mined_sentences'])

evid_df.head()

Unnamed: 0,section,t_start,t_end,d_start,d_end,text
0,abstract,218,221,192,196,We now identify additional pedigrees with CFEO...
1,abstract,327,330,302,306,All demonstrated autosomal dominant inheritanc...
2,abstract,33,36,128,132,"Thus far, we have not identified ARIX mutation..."
3,other,97,100,51,55,"Neuropathology studies of DS [2,3] and one for..."
4,other,373,376,493,497,To determine if CFEOM1 is indeed genetically h...


In [69]:
evid_df.section.value_counts()

other       5
abstract    3
Name: section, dtype: int64

In [71]:
(
    df
    .filter(col('type') == "GP-DS")
    .withColumn('tmp',col('pmid'))
    .withColumnRenamed("keywordId1", "targetFromSourceId")
    .withColumnRenamed("keywordId2", "diseaseFromSourceMappedId")
    .groupBy(['pmid', 'targetFromSourceId', 'diseaseFromSourceMappedId'])
    .agg(
        first(col("label1")).alias("targetFromSource"),
        first(col("label2")).alias("diseaseFromSource"),
        collect_set(col('tmp')).alias('literature'),
        collect_set(col('evidence_score')).alias('resourceScores'),
        collect_set(
            struct(  
                col("text"),
                col('start1').alias('tStart'),
                col("end1").alias('tEnd'),
                col('start2').alias('dStart'),
                col("end2").alias('dEnd'), 
                col('section'),
                col('evidence_score').alias('score')
            )
        ).alias('textMiningSentences')
    )
    .withColumn('datasourceId',lit('europepmc'))
    .withColumn('datatypeId',lit('literature'))
    .filter(size(col('resourceScores')) > 3)
    .show()
)

+----------+------------------+-------------------------+--------------------+--------------------+------------+--------------------+--------------------+------------+----------+
|      pmid|targetFromSourceId|diseaseFromSourceMappedId|    targetFromSource|   diseaseFromSource|  literature|      resourceScores| textMiningSentences|datasourceId|datatypeId|
+----------+------------------+-------------------------+--------------------+--------------------+------------+--------------------+--------------------+------------+----------+
|PMC6683429|              null|              EFO_0000616|                 CD8|              tumour|[PMC6683429]|[2.0, 1.0, 10.0, ...|[[The analysis of...|   europepmc|literature|
|PMC6366006|   ENSG00000001626|             Orphanet_586|                CFTR|                  CF|[PMC6366006]|[2.0, 5.0, 1.0, 1...|[[The validation ...|   europepmc|literature|
|PMC6305104|              null|              EFO_1001951|                 JAK|colorectal carcinoma|[PMC63

The ePMC id `PMC6868240` has quite many evidences supporting the interaction between `ENSG00000169174` and `Orphanet_139396`. The same evidence is fished out from the old evidence file. At first the PMC id is looked up:

```
PMC6868240 -> 31748600
```

In [72]:
%%bash

gzcat /Users/dsuveges/project_data/ot/evidence_input/21.02/epmc/cttv025-25-01-2021.json.gz \
    | grep 31748600  \
    | grep ENSG00000169174 \
    | grep Orphanet_139396 \
    > old_pmid:31748600_ensembl:ENSG00000169174_efo:Orphanet_139396.json
    


Process is interrupted.


In [6]:
df.select(col('section')).distinct().show()

+--------------------+
|             section|
+--------------------+
|2.7. 
Standard Bi...|
|Simultaneous upre...|
|Determination of ...|
|Resveratrol pretr...|
|        Western blot|
|Evaluation of oth...|
|Establishment of ...|
|FcγR-TLR Cross-Ta...|
|3.8. 
Effect of t...|
|    Serologic Assays|
|Measurement of ce...|
|  Material & Methods|
|The anti-cancerou...|
|Inflammatory medi...|
|Association of TC...|
|Combination Of Su...|
|Integration of va...|
| Patient eligibility|
|Dynamic response ...|
|Ex-vivo vascular ...|
+--------------------+
only showing top 20 rows

