# Checking target object

* **Date:** 2021.05.18
* **Version:** v8



## Fetching data

The V8 release of the target object is downloaded from the google bucket.

In [None]:
%%bash


gsutil cp -r gs://ot-team/jarrod/target-outputs/v8 /Users/dsuveges/project_data/target_index/

In [7]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, IntegerType, TimestampType, StructType

# establish spark connection
spark = (
    SparkSession.builder
    .getOrCreate()
)


new_target = (
    spark.read.json('/Users/dsuveges/project_data/target_index/v8/target-beta')
    .persist()
)

new_target.show()

+----------------+--------------------+--------------+--------------+--------------------+--------------------+--------------------+--------------------+----+--------------------+--------------------+---------------+--------------------+--------------------+--------------------+--------------------+--------------------+----+--------------------+--------------------+
|alternativeGenes|        approvedName|approvedSymbol|       biotype|          constraint|             dbXrefs|functionDescriptions|     genomicLocation|  go|           hallmarks|          homologues|             id|          proteinIds|   safetyLiabilities|subcellularLocations|            synonyms|         targetClass| tep|        tractability|       transcriptIds|
+----------------+--------------------+--------------+--------------+--------------------+--------------------+--------------------+--------------------+----+--------------------+--------------------+---------------+--------------------+--------------------+----

In [8]:
new_target.printSchema()

root
 |-- alternativeGenes: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- approvedName: string (nullable = true)
 |-- approvedSymbol: string (nullable = true)
 |-- biotype: string (nullable = true)
 |-- constraint: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- constraintType: string (nullable = true)
 |    |    |-- exp: double (nullable = true)
 |    |    |-- obs: long (nullable = true)
 |    |    |-- oe: double (nullable = true)
 |    |    |-- oeLower: double (nullable = true)
 |    |    |-- oeUpper: double (nullable = true)
 |    |    |-- score: double (nullable = true)
 |    |    |-- upperBin: long (nullable = true)
 |    |    |-- upperBin6: long (nullable = true)
 |    |    |-- upperRank: long (nullable = true)
 |-- dbXrefs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |-- functionDescriptions

## Conclusions

number of target in the target object: 17,643

1. The `alternativeGenes` is a good idea to have all gene ID for the same location, but different builds kept together. (Only 50 such genes exists)
2. Interestingly there are ambiguous `approvedSymbol` (one symbol is shared by multiple targets) and 
 * There are 40 such symbols
 * there are 14 symbols, where one symbol is shared by more than 2 targets.
 * When one symbol is shared by two targets, one of the targets are mapped to conventional location, while the rest are mapped to alternative scaffolds. Interestingly, these entries are not linked via the `alternativeGenes` column. 

In [70]:
(
    new_target
    .filter(F.col('alternativeGenes').isNotNull())
    .select('alternativeGenes')
    .count()
)

50

In [62]:
dataset_size = new_target.count()

colname = 'approvedSymbol'
multi_symbol = (
    new_target
    .groupby(colname)
    .agg(
        F.count(F.col('approvedSymbol')).alias('count'),
        F.first(F.col('approvedName'))
    )
    .filter(F.col('count')>2)
    .persist()
)
print(multi_symbol.count())
multi_symbol.show()
(
    new_target
    .filter(F.col('alternativeGenes').isNotNull())
    .select('id', 'alternativeGenes', 'approvedName', 'approvedSymbol', 'genomicLocation')
    .show(100)
)


(
    new_target
    .filter(F.col('approvedSymbol')=='U2')
    .select('id', 'alternativeGenes', 'approvedName', 'approvedSymbol', 'genomicLocation')
    .show()
)

14
+--------------+-----+--------------------------+
|approvedSymbol|count|first(approvedName, false)|
+--------------+-----+--------------------------+
|            U2|    4|      U2 spliceosomal RNA |
|       5S_rRNA|    3|         5S ribosomal RNA |
|            U4|    3|      U4 spliceosomal RNA |
|           7SK|    3|                  7SK RNA |
|            U3|   20|      Small nucleolar R...|
|            U6|   11|      U6 spliceosomal RNA |
|     5_8S_rRNA|    3|       5.8S ribosomal RNA |
|   Metazoa_SRP|   54|      Metazoan signal r...|
|       SNORA70|    8|      small nucleolar R...|
|            U8|   12|      U8 small nucleola...|
|       SNORA63|    5|      Small nucleolar R...|
|       SNORA72|    3|      Small nucleolar R...|
|       SNORA75|    3|      Small nucleolar R...|
|         Y_RNA|  239|                    Y RNA |
+--------------+-----+--------------------------+

+---------------+--------------------+--------------------+--------------+--------------------+


In [67]:
(
    new_target
    .filter(F.col('approvedSymbol') == 'CCL4L2')
    .select('id', 'alternativeGenes', 'approvedName', 'approvedSymbol', 'genomicLocation')
    .show(truncate=False)
)

(
    new_target
    .filter(F.col('alternativeGenes').isNotNull())
    .select('id', 'alternativeGenes', 'approvedName', 'approvedSymbol', 'genomicLocation')
    .show(100)
)

+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|id             |alternativeGenes                  |approvedName                        |approvedSymbol|genomicLocation                             |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+
|ENSG00000275313|[ENSG00000276125, ENSG00000282604]|C-C motif chemokine ligand 4 like 2 |CCL4L2        |[CHR_HSCHR17_10_CTG4, 36314484, 36312669, 1]|
|ENSG00000276070|null                              |C-C motif chemokine ligand 4 like 2 |CCL4L2        |[17, 36212878, 36210924, 1]                 |
+---------------+----------------------------------+------------------------------------+--------------+--------------------------------------------+

+---------------+--------------------+--------------------+--------------+--------------------+
|  

* biotype is always set
* there are 286 targets with empty biotype. Why? - probably missing annotation.
* 

In [76]:
column = 'biotype'

(
    new_target
    .groupby(column)
    .count()
    .show(30)
)


(
    new_target
    .filter(F.col(column) == '')
    .select('id', 'biotype','alternativeGenes', 'approvedName', 'approvedSymbol', 'genomicLocation')
    .show(100)
)

+--------------------+-----+
|             biotype|count|
+--------------------+-----+
|polymorphic_pseud...|   15|
|transcribed_unita...|   41|
|     IG_C_pseudogene|    1|
|                rRNA|   12|
|                sRNA|    1|
|              lncRNA| 4870|
|     TR_V_pseudogene|    8|
|               snRNA|  575|
|transcribed_proce...|  149|
|           IG_C_gene|    3|
|     TR_J_pseudogene|    1|
|           IG_J_gene|    6|
|  unitary_pseudogene|   29|
|           TR_V_gene|   23|
|      protein_coding| 5850|
|processed_pseudogene| 2947|
|               miRNA|  535|
|           TR_J_gene|   22|
|     IG_J_pseudogene|    1|
|transcribed_unpro...|  278|
|           TR_C_gene|    2|
|              snoRNA|  297|
|     rRNA_pseudogene|  136|
|          pseudogene|    4|
|     IG_V_pseudogene|   52|
|             Mt_tRNA|    6|
|           IG_V_gene|   34|
|unprocessed_pseud...|  781|
|            misc_RNA|  651|
|                    |  286|
+--------------------+-----+
only showing t

* constraint is set for 5.4k targets.

In [84]:
column = 'constraint'


(
    new_target
    .filter(F.col(column).isNotNull())
    .count()
)

(
    new_target
    .filter(F.col(column).isNotNull())
    .show(1, vertical=True, truncate=False)
)

(
    new_target
    .filter(F.col(column).isNotNull())
    .select('approvedSymbol', F.explode(F.col("constraint")))
    .printSchema()
#     .show(1, vertical=True, truncate=False)
)

(
    new_target
    .filter(F.col(column).isNotNull())
    .select('approvedSymbol', F.explode(F.col("constraint")).alias('constraint'))
    .select('approvedSymbol', 'constraint.constraintType')
    .show()
)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 alternativeGenes     | null                                                                                                                                                                                                                                                                                                             

In [97]:
column = 'targetClass'

(
    new_target
    .filter(F.col(column).isNotNull())
    .select('approvedSymbol', F.explode(column).alias(column))
    .select('approvedSymbol', F.col('targetClass.id'), F.col('targetClass.label'), F.col('targetClass.level'))
    .show(truncate=False)
)


+--------------+----+------------------------------------------+-----+
|approvedSymbol|id  |label                                     |level|
+--------------+----+------------------------------------------+-----+
|CFH           |3   |Secreted protein                          |l1   |
|CFLAR         |601 |Unclassified protein                      |l1   |
|NOS2          |1   |Enzyme                                    |l1   |
|MMP25         |82  |Metallo protease                          |l3   |
|MMP25         |82  |Metallo protease M10A subfamily           |l5   |
|MMP25         |82  |Metallo protease MAM clan                 |l4   |
|MMP25         |82  |Protease                                  |l2   |
|MMP25         |82  |Enzyme                                    |l1   |
|LYPLA2        |646 |Enzyme                                    |l1   |
|LYPLA2        |646 |Hydrolase                                 |l2   |
|GABRA3        |1173|Ion channel                               |l1   |
|GABRA