# Updating the validation lab data model

Based on the discussion we had on 2022.01.12 with David, Stuart and Andrew, we were extensively discuss the data model of the evidence coming from the validation lab.


The data model I arrived to the meeting:

```json
{
  "targetFromSource": "UBE2C",
  "resourceScore": 4.02,
  "confidence": "Failed validation",
  "microsatelliteStabilityStatus": "MSI",
  "diseaseSubtype": "A",
  "diseaseCellLines": [
    {
      "name": "KM12",
      "id": "SIDM00150",
      "tissue": "Large Intestine",
      "tissueId": "UBERON_0000059"
    }
  ],
  "biomarkers": [
    {
      "name": "TP53mut",
      "variants": [
        {
          "name": "TP53:H179R",
          "functionalConsequenceId": "SO_0001583"
        },
        {
          "name": "TP53:V73fs*50",
          "functionalConsequenceId": "SO_0000865"
        }
      ]
    },
    {
      "name": "APCwt",
      "variants": [
        {
          "name": "APC:N1818fs*2",
          "functionalConsequenceId": "SO_0000865"
        }
      ]
    },
    {
      "name": "KRASwt"
    }
  ]
}
```

Proposed changes:
* The microsatellites and the CRIS status are put into the `biomarkers` object. The APC, KRAS, PK53 variants are removed.
* Resource score is a float.
* Adding new column: `sourceProject`. This column tells which project provides the association to be validated. For the colon cancer dataset it is constant: `OTAR015`. Further datasets might have different sources.
* Adding new column: effect direction. For colon cancer dataset is constant `positive` (`statisticalTestTail` can hold this information?)
* Adding new columns:
    * `contrast` - will be sourced from Stuart.
    * `studyOverview` - will be sourced from Stuart.
    * `statisticalTestTail` - for this datasource it is `upper tail` all the time.
    
    
Based on further discussions with Panos from Validation lab, these changes were propsed:

* New field to contain hypothesis -> list with name and description
* New field: expectedConfidence -> flag to indicate if the validation true for the hypothesis
* Add project code
* Add project name

For this round, we don't care much about robustness, we just make sure the code is running, we don't have to be robust and flexible.
    
    
How it should look like:

```json
{
  "datasrouceId": "crispr_validation",
  "dataTypeId": "ot_validation",
  "diseaseFromSourceMappedId": "EFO_0005842",
  "targetFromSource": "UBE2C",
  "resourceScore": 4.02,
  "confidence": "Failed validation",
  "expectedConfidence": "Passed validation",
  "diseaseCellLines": [
    {
      "name": "KM12",
      "id": "SIDM00150",
      "tissue": "Large Intestine",
      "tissueId": "UBERON_0000059"
    }
  ],
  "biomarkers": [
    {
      "name": "TP53mut"
    },
    {
      "name": "APCwt"
    },
    {
      "name": "KRASwt"
    },
    {
      "name": "Microsatellite instability: MSI"
    },
    {
      "name": "CRIS subtype: A"
    }
  ],
  "statisticalTestTail": "upper tail",
  "contrast": "Loss of cell viability vs control.",
  "studyOverview": "CellTitreGio measurement",
  "validationHypotheses": [
      {
          "hypothesis": "MSI",
          "description": "This description will be provided by the validation lab"
      }
   ],
  "projectId": "OTAR015",
  "projectDescription": "CRISPR Cas9 Target ID"
}
```

In [29]:
# Using exclusively pyspark:
import pandas as pd
import json
import requests
from functools import reduce

from pyspark.conf import SparkConf
from pyspark.sql.types import ArrayType, StringType, IntegerType, StructType
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, struct, lit, when, udf, array, expr

sparkConf = (
    SparkConf()
    .set('spark.driver.memory', '15g')
    .set('spark.executor.memory', '15g')
    .set('spark.driver.maxResultSize', '0')
    .set('spark.debug.maxToStringFields', '2000')
    .set('spark.sql.execution.arrow.maxRecordsPerBatch', '500000')
)
spark = (
    SparkSession.builder
    .config(conf=sparkConf)
    .master('local[*]')
    .getOrCreate()
)


cell_lines_file = '/Users/dsuveges/project_data/validation_lab/COlines.txt'
cell_lines_annotation_file = '/Users/dsuveges/project_data/validation_lab/model_list_20211124.csv'
cell_cancer_driver_mutation = '/Users/dsuveges/project_data/validation_lab/mutations_summary_20211124.csv'
validation_file = '/Users/dsuveges/project_data/validation_lab/CTG_CO_Partner-Preview-Matrix_v6a.txt'

Preparing cell line annotation data:

In [8]:
annot = (
    spark.read
    .option("multiline",True)
    .csv(cell_lines_annotation_file, header=True, sep=',', quote='"')
    .persist()
)

""" This is a map that provides recipie to generate the biomarker objects
If a value cannot be found in the map, the value will be returned.
"""
biomarkerMaps = {
    
    'MS_status': {
        'description': 'Micro satellite stability'
    },
    'CRIS_subtype': {
        'description': 'colorectal cancer intrinsic subtypes (CRIS) defined by distinctive molecular, functional and phenotypic peculiarities',
        'prefix': 'CRIS-'
    },
    'KRAS_status': {
        'description': 'KRAS mutation status',
        'prefix': 'KRAS-'
    },
    'TP53_status': {
        'description': 'TP53 mutation status'
        'prefix': 'TP53-'
    },
    'APC_status': {
        'description': 'APC mutation status'
        'prefix': 'ACP-'
    }
}


diseaseCellLines_model = (
    annot
    .select(
        col('model_name').alias('name'), 
        col('model_id').alias('id'), 
        col('tissue')
    )
)


validation_lab_cell_lines = (
    # Reading cell metadata from validation lab:
    spark.read.csv(cell_lines_file, sep='\t', header=True)
    
    # Renaming columns:
    .withColumnRenamed('CO_line', 'name')
    
    # Updating some of the cell lines' name:
    .withColumn('name',
         when(col('name') == 'HT29', 'HT-29')
        .when(col('name') == 'HCT116', 'HCT-116')
        .when(col('name') == 'LS180', 'LS-180')
        .otherwise(col('name'))
    )  
    
    # Joining dataset with cell model data read downloaded from Sanger website:
    .join(diseaseCellLines_model, on='name', how='left')
    
    # Adding UBERON code to tissues (it's constant colon)
    .withColumn('tissueID', lit('UBERON_0000059'))
    
    # generating disease cell lines object:
    .withColumn(
        'diseaseCellLines',
        array(struct(col('name'), col('id'), col('tissue'), col('tissueId')))
    )
    # .drop(*['id', 'tissue', 'tissueId', 'CRIS_subtype'])
    .persist()
)


validation_lab_cell_lines.show()

+-------+---------+------------+-----------+-----------+----------+---------+---------------+--------------+--------------------+
|   name|MS_status|CRIS_subtype|KRAS_status|TP53_status|APC_status|       id|         tissue|      tissueID|    diseaseCellLines|
+-------+---------+------------+-----------+-----------+----------+---------+---------------+--------------+--------------------+
|  SW626|      MSS|           ?|        mut|        mut|       mut|SIDM01168|Large Intestine|UBERON_0000059|[{SW626, SIDM0116...|
|  HT-29|      MSS|           B|         wt|        mut|       mut|SIDM00136|Large Intestine|UBERON_0000059|[{HT-29, SIDM0013...|
|  SW837|      MSS|           B|        mut|        mut|       mut|SIDM00833|Large Intestine|UBERON_0000059|[{SW837, SIDM0083...|
|  MDST8|      MSS|           D|         wt|         wt|       mut|SIDM00527|Large Intestine|UBERON_0000059|[{MDST8, SIDM0052...|
|HCT-116|      MSI|           D|        mut|         wt|        wt|SIDM00783|Large Intesti

In [76]:
c = 'KRAS_status'

@udf
def get_biomarker(columnName, biomarker):
    if biomarker == '?':
        return None
    
    # If no data is provided, we'll return the value as it is:
    if biomarkerMaps[columnName] == {}:
        return biomarker
    
    return biomarkerMaps[columnName]['prefix'] + biomarker

(
    validation_lab_cell_lines
    .withColumn(c, struct( get_biomarker(lit(c), col(c))).alias('name'))
    .show(1, vertical=True)
#     .printSchema()
)

-RECORD 0--------------------------------
 name             | SW626                
 MS_status        | MSS                  
 CRIS_subtype     | ?                    
 KRAS_status      | mut                  
 TP53_status      | mut                  
 APC_status       | mut                  
 id               | SIDM01168            
 tissue           | Large Intestine      
 tissueID         | UBERON_0000059       
 diseaseCellLines | [{SW626, SIDM0116... 
 cica             | {KRAS-mut}           
only showing top 1 row



In [66]:
# This list covers all the biomarkers:
biomarker_list = [get_biomarker(lit(x), col(x)) for x in biomarkerMaps.keys()]

annotated_cell_lines = (

    # Reading cell metadata from validation lab:
    spark.read.csv(cell_lines_file, sep='\t', header=True)
    
    # Renaming columns:
    .withColumnRenamed('CO_line', 'name')
    
    # Updating some of the cell lines' name:
    .withColumn('name',
         when(col('name') == 'HT29', 'HT-29')
        .when(col('name') == 'HCT116', 'HCT-116')
        .when(col('name') == 'LS180', 'LS-180')
        .otherwise(col('name'))
    )  
    
    # Joining dataset with cell model data read downloaded from Sanger website:
    .join(diseaseCellLines_model, on='name', how='left')
    
    # Adding UBERON code to tissues (it's constant colon)
    .withColumn('tissueID', lit('UBERON_0000059'))
    
    # generating disease cell lines object:
    .withColumn(
        'diseaseCellLines',
        array(struct(col('name'), col('id'), col('tissue'), col('tissueId')))
    )

    # Collecting biomarkers:
    .withColumn('biomarkers', array(*biomarker_list))
    
    # Select relevant columns:
    .select('name', 'diseaseCellLines', 'biomarkers')
    
    .persist()
)

In [81]:
# Generating list of specific expressions to parse biomarkers:
expressions = map(lambda biomarker: (biomarker, struct( get_biomarker(lit(biomarker), col(biomarker))).alias('name')), biomarkerMaps.keys())


res_df = reduce(lambda DF,value: DF.withColumn(*value), expressions, validation_lab_cell_lines)
res_df.select('name',array(*biomarkerMaps.keys()).alias('biomarkers')).show()


+-------+--------------------+
|   name|          biomarkers|
+-------+--------------------+
|  SW626|[{MSS}, {null}, {...|
|  HT-29|[{MSS}, {CRIS-B},...|
|  SW837|[{MSS}, {CRIS-B},...|
|  MDST8|[{MSS}, {CRIS-D},...|
|HCT-116|[{MSI}, {CRIS-D},...|
|   KM12|[{MSI}, {CRIS-A},...|
|    RKO|[{MSI}, {null}, {...|
| LS-180|[{MSI}, {CRIS-A},...|
+-------+--------------------+



In [63]:
# Generating list of specific expressions to parse biomarkers:
expressions = map(lambda biomarker: (biomarker, struct( get_biomarker(lit(biomarker), col(biomarker))).alias('name')), biomarkerMaps.keys())


# evidence = (
(
    spark.read.csv(validation_file, sep='\t', header=True)
    .withColumnRenamed('gene', 'targetFromSource')
    .withColumnRenamed('cell-line', 'name')
    .withColumn('resourceScore', col('effect-size').cast("double"))
    .withColumn(
        'confidence',
        when(col('resourceScore') >= 38, lit('Passed validation'))
        .otherwise(lit('Failed validation'))
    )
    .withColumn(
        'expectedConfidence',
        when(col('expected-to-pass') == 'TRUE', lit('Passed validation'))
        .otherwise(lit('Failed validation'))
    )
    .join(annotated_cell_lines, on='name', how='left')
    .drop(*['name', 'pass-fail', 'expected-to-pass', 'effect-size'])
    
    # Adding constants:
    .withColumn('statisticalTestTail', lit('upper tail'))
    .withColumn('contrast', lit('Loss of cell viability vs control'))
    .withColumn('studyOverview', lit('CellTitreGio measurement'))
    
    # This column is specific for this dataset:
    .withColumn('datasourceId', 'ot_crispr_validation')
    .withColumn('datatypeId', 'ot_validation_lab')
    .withColumn("diseaseFromSourceMappedId", "EFO_0005842")
    
    # This should be added to the crispr dataset as well:
    .withColumn('projectId', 'OTAR015')
    .withColumn('projectDescription', 'CRISPR Cas9 Target ID')
    
    # This column is specific for genes, will be updated later:
    .withColumn('validationHypotheses', 
                struct(
                    lit('MSI').alias('hypothesis'), 
                    lit('This description will be provided by the validation lab').alias('description')
                )
               )
    .write.format('json').mode('overwrite').option('compression', 'gzip').save('validation.json.gz')
#     .show(1, vertical=True, truncate=False)
)

#   "confidence": "Failed validation",
#   "expectedConfidence": "Passed validation",
#   "statisticalTestTail": "upper tail",
#   "contrast": "Loss of cell viability vs control.",
#   "studyOverview": "CellTitreGio measurement",
#   "validationHypotheses": [
#       {
#           "hypothesis": "MSI",
#           "description": "This description will be provided by the validation lab"
#       }
#    ],
#   "projectId": "OTAR015",
#   "projectDescription": "CRISPR Cas9 Target ID"

In [64]:
%%bash

gzcat validation.json.gz/*gz | head -n1 | jq

{
  "targetFromSource": "ARHGEF7",
  "resourceScore": 59.9,
  "confidence": "Passed validation",
  "expectedConfidence": "Passed validation",
  "diseaseCellLines": [
    {
      "name": "SW626",
      "id": "SIDM01168",
      "tissue": "Large Intestine",
      "tissueId": "UBERON_0000059"
    }
  ],
  "biomarkers": [
    "{name=MSS}",
    null,
    "{name=KRAS-mut}",
    "{name=TP53-mut}",
    "{name=ACP-mut}"
  ],
  "statisticalTestTail": "upper tail",
  "contrast": "Loss of cell viability vs control",
  "studyOverview": "CellTitreGio measurement",
  "validationHypotheses": {
    "hypothesis": "MSI",
    "description": "This description will be provided by the validation lab"
  }
}


```json
{
  "datasourceId": "ot_crispr_validation",
  "dataTypeId": "ot_validation_lab",
  "projectId": "OTAR015",
  "projectDescription": "CRISPR Cas9 Target ID",
  "targetFromSource": "ARHGEF7",
  "targetId": "ENSG00000102606",
  "diseaseFromSourceMappedId": "EFO_0005842",
  "diseaseId": "EFO_0005842",
  "resourceScore": 59.9,
  "confidence": "significant",
  "expectedConfidence": "significant",
  "diseaseCellLines": [
    {
      "name": "SW626",
      "id": "SIDM01168",
      "tissue": "Large Intestine",
      "tissueId": "UBERON_0000059"
    }
  ],
  "biomarkers": [
    {
      "name": "MSS",
      "description": "Micro-satellite stability"
    },
    {
      "name": "KRAS-mut",
      "description": "KRAS mutation status"
    },
    {
      "name": "TP53-mut",
      "description": "TP53 mutation status"
    },
    {
      "name": "ACP-mut",
      "description": "ACP mutation status"
    }
  ],
  "statisticalTestTail": "upper tail",
  "contrast": "Loss of cell viability vs control",
  "studyOverview": "CellTitreGio measurement",
  "validationHypotheses": [
    {
      "hypothesis": "MSI",
      "description": "Microsatellite stability."
    }
  ]
}
```