OK, there is a problem:
1. Due to some messing around in the GWAS Catalog, the study table is not properly exported, a handful of studies are now missing.
2. Once the issue was fixed Jack generated the study table again, however the manual mappings were not added for the finngen and ukbb studies.
3. Things are now needs to be sorted out before the platform pipelines are run.

Steps:
* Fetch old and new study table.
* join tables together.
* Verify table content.
* Start up cluster.
* Generate evidence
* Validate evidence

In [51]:
%%bash

DESC='/Users/dsuveges/project_data/genetics/study_tables'

# Create folder for study files:
mkdir -p ${DESC}
rm -rf ${DESC}/*

OLD='gs://genetics-portal-dev-staging/v2d/210601/studies.parquet'
NEW='gs://genetics-portal-dev-staging/v2d/210922/studies.parquet'

# Fetch first dataset:
gsutil cp -r $OLD $DESC/studies_old.parquet
gsutil cp -r $NEW $DESC/studies_new.parquet

ls -la $DESC

total 6656
drwxr-xr-x  4 dsuveges  384566875      128  7 Nov 23:24 .
drwxr-xr-x  4 dsuveges  384566875      128  7 Nov 21:43 ..
-rw-r--r--  1 dsuveges  384566875  1746889  7 Nov 23:24 studies_new.parquet
-rw-r--r--  1 dsuveges  384566875  1657489  7 Nov 23:24 studies_old.parquet


Copying gs://genetics-portal-dev-staging/v2d/210601/studies.parquet...
/ [0 files][    0.0 B/  1.6 MiB]                                                / [1 files][  1.6 MiB/  1.6 MiB]                                                -
Operation completed over 1 objects/1.6 MiB.                                      
Copying gs://genetics-portal-dev-staging/v2d/210922/studies.parquet...
/ [0 files][    0.0 B/  1.7 MiB]                                                / [1 files][  1.7 MiB/  1.7 MiB]                                                
Operation completed over 1 objects/1.7 MiB.                                      


In [2]:
import pyspark.sql
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext
import re

global spark

# SparkContext.setSystemProperty('spark.executor.memory', '20g')

spark = (pyspark.sql.SparkSession
    .builder
    .appName("phenodigm_parser")
    .config("spark.executor.memory", '10g')
     .config("spark.driver.bindAddress", "localhost")
    .config("spark.driver.memory", '10g')
    .getOrCreate()
)

In [52]:
old_file = '/Users/dsuveges/project_data/genetics/study_tables/studies_old.parquet'
new_file = '/Users/dsuveges/project_data/genetics/study_tables/studies_new.parquet'
merged_file = '/Users/dsuveges/project_data/genetics/study_tables/studies_merged.parquet'

old_df = spark.read.parquet(old_file).persist()
new_df = spark.read.parquet(new_file).persist()

print(f'Old study count: {old_df.count()}')
print(f'New study count: {new_df.count()}')

Old study count: 33781
New study count: 36395


In [28]:
# 
merged_data = (
    new_df
    # Dropping rows from the new dataset with missing phenotype annotation
    .filter(col('trait_efos').isNotNull())
    
    # Joining with old dataset to complement the old dataset:
    .union(old_df)
    
    # Drop duplicated lines:
    .distinct()
    .coalesce(1)
    .persist()
)

merged_data.write.format('parquet').mode('overwrite').save(merged_file)

In [29]:
merged_data.count()

36402

In [31]:
merged_data.filter(col('trait_efos').isNull()).count()

0

Uploading the merged dataset to the google buckets we have:

In [32]:
%%bash

MERGED='/Users/dsuveges/project_data/genetics/study_tables/studies_merged.parquet'

gsutil cp -r $MERGED gs://genetics-portal-dev-staging/v2d/210922/

gsutil ls -la gs://genetics-portal-dev-staging/v2d/210922/

  18596009  2021-09-28T07:43:18Z  gs://genetics-portal-dev-staging/v2d/210922/ld_analysis_input.tsv#1632814998129248  metageneration=1
  74041364  2021-09-28T07:43:18Z  gs://genetics-portal-dev-staging/v2d/210922/locus_overlap.parquet#1632814998973176  metageneration=1
   1737179  2021-11-05T13:20:09Z  gs://genetics-portal-dev-staging/v2d/210922/prev_studies.parquet#1636118409652139  metageneration=1
   1746889  2021-11-05T16:09:39Z  gs://genetics-portal-dev-staging/v2d/210922/studies.parquet#1636128579763206  metageneration=1
   8632246  2021-09-28T07:43:18Z  gs://genetics-portal-dev-staging/v2d/210922/toploci.parquet#1632814998077861  metageneration=1
    860630  2021-10-25T12:34:18Z  gs://genetics-portal-dev-staging/v2d/210922/trait_efo-2021-10-25.parquet#1635165258148742  metageneration=1
    860657  2021-11-05T18:55:38Z  gs://genetics-portal-dev-staging/v2d/210922/trait_efo-2021-11-05.parquet#1636138538105813  metageneration=1
                                 gs://genetics-portal-

Copying file:///Users/dsuveges/project_data/genetics/study_tables/studies_merged.parquet/._SUCCESS.crc [Content-Type=application/octet-stream]...
/ [0 files][    0.0 B/    8.0 B]                                                / [1 files][    8.0 B/    8.0 B]                                                Copying file:///Users/dsuveges/project_data/genetics/study_tables/studies_merged.parquet/part-00000-ba9d2827-0e36-46d1-b4df-8a697531086a-c000.snappy.parquet [Content-Type=application/octet-stream]...
/ [1 files][    8.0 B/  2.2 MiB]                                                -- [1 files][  1.3 MiB/  2.2 MiB]                                                \\ [2 files][  2.2 MiB/  2.2 MiB]                                                |Copying file:///Users/dsuveges/project_data/genetics/study_tables/studies_merged.parquet/.part-00000-ba9d2827-0e36-46d1-b4df-8a697531086a-c000.snappy.parquet.crc [Content-Type=application/octet-stream]...
| [2 files][  2.2 MiB/  2.2 MiB]      

In [69]:
%%bash

gcloud dataproc clusters create \
  genetics-evidence-generation \
  --image-version=1.4 \
  --properties=spark:spark.debug.maxToStringFields=100,spark:spark.executor.cores=31,spark:spark.executor.instances=1 \
  --master-machine-type=n1-standard-32 \
  --master-boot-disk-size=1TB \
  --zone=europe-west1-d \
  --single-node \
  --max-idle=5m \
  --region=europe-west1 \
  --project=open-targets-genetics-dev
  


Waiting on operation [projects/open-targets-genetics-dev/regions/europe-west1/operations/6c5e0291-e401-35d9-ac00-64ae1c80d49d].
Waiting for cluster creation operation...
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................done.
Created [https://dataproc.googleapis.com/v1/projects/open-targets-genetics-dev/regions/europe-west1/clusters/genetics-evidence-generation] Cluster placed in zone [europe-west1-d].


In [None]:
%%ding 
%%bash


# Set Genetics Portal envidence source files:
export timeStamp='2021-10-25'

L2G_FILE="gs://genetics-portal-dev-staging/l2g/211015/predictions/l2g.full.211015.parquet"
STUDY_FILE="gs://genetics-portal-dev-staging/v2d/210922/studies_merged.parquet"
TOPLOCI_FILE="gs://genetics-portal-dev-staging/v2d/210922/toploci.parquet"
VARIANTINDEX="gs://genetics-portal-staging/variant-annotation/190129/variant-annotation.parquet"
ECO_CODES="gs://genetics-portal-data/lut/vep_consequences.tsv"
OUT_FILE="gs://genetics-portal-analysis/l2g-platform-export/data/genetics_portal-${timeStamp}"

SCRIPT_DIR='/Users/dsuveges/repositories/evidence_datasource_parsers/'

gcloud dataproc jobs submit pyspark \
  --cluster=genetics-evidence-generation \
  --project=open-targets-genetics-dev \
  --region=europe-west1 \
  ${SCRIPT_DIR}/modules/GeneticsPortal.py -- \
  --locus2gene ${L2G_FILE} \
  --toploci ${TOPLOCI_FILE} \
  --study ${STUDY_FILE} \
  --threshold 0.05 \
  --variantIndex ${VARIANTINDEX}  \
  --ecoCodes ${ECO_CODES} \
  --outputFile ${OUT_FILE}

In [42]:
%%ding
%%bash

OUTFILE='gs://genetics-portal-analysis/l2g-platform-export/data/genetics_portal-2021-10-25'
TEMP_DIR='/Users/dsuveges/project_data/genetics/evid'

mkdir -p $TEMP_DIR

gsutil cp -r $OUTFILE $TEMP_DIR

Copying gs://genetics-portal-analysis/l2g-platform-export/data/genetics_portal-2021-10-25/_SUCCESS...
/ [0 files][    0.0 B/    0.0 B]                                                / [1 files][    0.0 B/    0.0 B]                                                Copying gs://genetics-portal-analysis/l2g-platform-export/data/genetics_portal-2021-10-25/part-00000-ad727899-d556-4607-abfa-6d3b4cdc97ed-c000.json.gz...
/ [1 files][    0.0 B/286.5 KiB]                                                / [2 files][286.5 KiB/286.5 KiB]                                                Copying gs://genetics-portal-analysis/l2g-platform-export/data/genetics_portal-2021-10-25/part-00001-ad727899-d556-4607-abfa-6d3b4cdc97ed-c000.json.gz...
/ [2 files][286.5 KiB/576.2 KiB]                                                -- [3 files][576.2 KiB/576.2 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you

In [45]:
(
    merged_data
    .filter(col('study_id') == 'NEALE2_30270_raw')
    .show(1, vertical=True, truncate=False)
)

-RECORD 0----------------------------------------
 study_id             | NEALE2_30270_raw         
 pmid                 |                          
 pub_date             | 2018-08-01               
 pub_journal          |                          
 pub_title            |                          
 pub_author           | UKB Neale v2             
 trait_reported       | Mean sphered cell volume 
 trait_efos           | []                       
 ancestry_initial     | [European=344729]        
 ancestry_replication | []                       
 n_initial            | 344729                   
 n_replication        | 0                        
 n_cases              | null                     
 trait_category       | Uncategorised            
 num_assoc_loci       | 494                      
 has_sumstats         | true                     



In [46]:
(
    new_df
    .filter(col('study_id') == 'NEALE2_30270_raw')
    .show(1, vertical=True, truncate=False)
)

-RECORD 0----------------------------------------
 study_id             | NEALE2_30270_raw         
 pmid                 |                          
 pub_date             | 2018-08-01               
 pub_journal          |                          
 pub_title            |                          
 pub_author           | UKB Neale v2             
 trait_reported       | Mean sphered cell volume 
 trait_efos           | []                       
 ancestry_initial     | [European=344729]        
 ancestry_replication | []                       
 n_initial            | 344729                   
 n_replication        | 0                        
 n_cases              | null                     
 trait_category       | Uncategorised            
 num_assoc_loci       | 494                      
 has_sumstats         | true                     



In [65]:
(
    new_df
    .filter((size(col("trait_efos")) != 0) & (~col('study_id').startswith('GCST')))
    .select('study_id', 'trait_reported', 'trait_efos')
    .show(truncate=False)
    
)

+----------------------------------------+-----------------------------------------------------------+---------------------------------------+
|study_id                                |trait_reported                                             |trait_efos                             |
+----------------------------------------+-----------------------------------------------------------+---------------------------------------+
|FINNGEN_R5_AB1_AMOEBIASIS               |Amoebiasis                                                 |[EFO_0007144]                          |
|FINNGEN_R5_AB1_ANOGENITAL_HERPES_SIMPLEX|Anogenital herpesviral [herpes simplex] infection          |[EFO_0007282]                          |
|FINNGEN_R5_AB1_ARTHROPOD                |Arthropod-borne viral fevers and viral haemorrhagic fevers |[EFO_0000763, MONDO_0018087]           |
|FINNGEN_R5_AB1_ASPERGILLOSIS            |Aspergillosis                                              |[EFO_0007157]                          |

In [68]:
(
    new_df
    .filter(size(col("trait_efos")) == 0)
    .select('trait_category')
    .distinct()
    .show(truncate=False)
    
)

+--------------+
|trait_category|
+--------------+
|Uncategorised |
+--------------+

