# Gerating CRISPR evidence

* Study and data files stored in google bucket.
* Study table contains study level annotation + points to the correct file.
* location: `gs://ot-team/dsuveges/CRISPR_screens/*`

## Process

1. Fetching files
2. Reading study file
3. Looping through every rows/every study:
    * Read data file.
    * Select hits - genes that reach significance level
    * Do this on both tails of the distribution - whichever is indicated in the study file
    * Pool hits together
    * Explode annotated diseases
    * Map diseases to EFO
4. Pool formatted data into single dataframe
5. Save as `json.gz`

In [9]:
%%bash

gsutil cp -r gs://ot-team/dsuveges/CRISPR_screens/* .
ls -lah

total 3608
drwxrwxr-x   6 dsuveges  384566875   192B 14 Jun 22:58 .
drwxrwxr-x  48 dsuveges  384566875   1.5K 14 Jun 22:27 ..
drwxrwxr-x   3 dsuveges  384566875    96B 14 Jun 22:28 .ipynb_checkpoints
-rw-rw-r--   1 dsuveges  384566875   1.7M 14 Jun 22:58 OTAR036_Screen4_5_NEGPOSvBFP.gene_summary.tsv
-rw-rw-r--   1 dsuveges  384566875   429B 14 Jun 22:58 OTAR_crispr_studies.tsv
-rw-rw-r--   1 dsuveges  384566875    14K 14 Jun 22:58 Prototyping crispr evidence.ipynb


Copying gs://ot-team/dsuveges/CRISPR_screens/OTAR036_Screen4_5_NEGPOSvBFP.gene_summary.tsv...
/ [0 files][    0.0 B/  1.7 MiB]                                                -- [1 files][  1.7 MiB/  1.7 MiB]                                                Copying gs://ot-team/dsuveges/CRISPR_screens/OTAR_crispr_studies.tsv...
- [1 files][  1.7 MiB/  1.7 MiB]                                                - [2 files][  1.7 MiB/  1.7 MiB]                                                
Operation completed over 2 objects/1.7 MiB.                                      


In [1]:
import pandas as pd
import gzip
import json
import ontoma
import logging


class disease_map(object):

    def __init__(self):
        self.ontoma = ontoma.interface.OnToma()

    def map_disease(self, disease_name):
        logging.info(f"Mapping '{disease_name}'")

        # Search disease name using OnToma and accept perfect matches
        ontoma_mapping = self.ontoma.find_term(disease_name, verbose=True)

        # If there's some mapping available:
        if ontoma_mapping:

            # Extracting term if no action is required:
            if ontoma_mapping['action'] is None:
                return ontoma_mapping['term'].split('/')[-1]

            # When there is an exact match, but action is required:
            elif ontoma_mapping['quality'] == "match":

                # Match in HP or ORDO, check if there is a match in MONDO too. If so, give preference to MONDO hit
                mondo_mapping = self.search_mondo(disease_name)

                if mondo_mapping:
                    # Mondo mapping good - return
                    if mondo_mapping['exact']:
                        return mondo_mapping['term'].split('/')[-1]
                    # Mondo mapping bad - return ontoma
                    else:
                        return ontoma_mapping['term'].split('/')[-1]
                else:
                    # Mondo mapping bad - return ontoma
                    return ontoma_mapping['term'].split('/')[-1]

            else:

                # xref search didn't work, try MONDO as the last resort
                mondo_mapping = self.search_mondo(disease_name)
                if mondo_mapping:
                    if mondo_mapping['exact']:
                        return mondo_mapping['term'].split('/')[-1]
                    else:
                        return None
                else:
                    # Record the unmapped disease
                    return None

        else:
            # No match in EFO, HP or ORDO
            mondo_mapping = self.search_mondo(disease_name)
            if mondo_mapping:
                if mondo_mapping['exact']:
                    return mondo_mapping['term'].split('/')[-1]
                else:
                    return None
            else:
                return None
            
    def search_mondo(self, disease_name):

        disease_name = disease_name.lower()

        # mondo_lookup works like a dictionary lookup so if disease is not in there it raises and error instead of returning `None`
        try:
            mondo_term = self.ontoma.mondo_lookup(disease_name)
            return {
                'id': mondo_term, 
                'name': self.ontoma.get_mondo_label(mondo_term),
                'exact': True
            }
        except KeyError as e:
            exact_ols_mondo = self.ontoma._ols.besthit(disease_name,
                                                       ontology=['mondo'], field_list=['iri', 'label'], exact=True)

            if exact_ols_mondo:
                return {'term': exact_ols_mondo['iri'], 'name': exact_ols_mondo['label'], 'exact':True}

            else:
                ols_mondo = self.ontoma._ols.besthit(disease_name,
                                                     ontology=['mondo'],
                                                     field_list=['iri', 'label'],
                                                     bytype='class')
                if ols_mondo:
                    return {'term': ols_mondo['iri'], 'name': ols_mondo['label'], 'exact': False}
                else:
                    return None
            

## Study file:

In [32]:

# Initialize disease mapping:
dm = disease_map()

# tsv file with study level information:
study_file = 'OTAR_crispr_studies.tsv'

# Read study file:
study_df = pd.read_csv(study_file, sep='\t')

pd.set_option('display.max_columns', None)  
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 90)

study_df.iloc[0]

INFO     - ontoma.downloaders - ZOOMA to EFO mappings - Parsed 3663 rows
INFO:ontoma.downloaders:ZOOMA to EFO mappings - Parsed 3663 rows
INFO     - ontoma.downloaders - OMIM to EFO mappings - Parsed 8561 rows
INFO:ontoma.downloaders:OMIM to EFO mappings - Parsed 8561 rows


studyName                                                                                  OTAR036_TAU_uptake_1
description           Comparison of functional populations of iPSC neurons based on their ability to take up...
diseases              Neurodegeneration|Dementia|Tauopathy(ies)|Alzheimer’s Disease|Frontotemporal dementia|...
cellTypes                                                                                  iPSC derived neurons
cellLineBackground                                                                                          NaN
lowerTail                                                                                                 False
upperTail                                                                                                  True
threshold                                                                                                  0.01
dataFile                                                          OTAR036_Screen4_5_NEGPOSvBFP.gene_summ

### Study file:

In [5]:
data_file = study_df.dataFile[0]
data = pd.read_csv(data_file, sep='\t')
print(data.head())
print(data.iloc[0].to_markdown())

        id  num  neg|score  neg|p-value   neg|fdr  neg|rank  neg|goodsgrna  \
0     PIGF    5   0.000005     0.000026  0.463696         1              5   
1   sgSGK1    5   0.000013     0.000058  0.463696         2              5   
2    CDCA3    4   0.000020     0.000069  0.463696         3              3   
3  sgRAD21    5   0.000029     0.000130  0.654703         4              4   
4  sgRPL19    5   0.000052     0.000246  0.766520         5              4   

   neg|lfc  pos|score  pos|p-value   pos|fdr  pos|rank  pos|goodsgrna  pos|lfc  
0 -6.12980    0.99999      1.00000  0.999995     20081              0 -6.12980  
1 -6.39920    0.99999      0.99999  0.999995     20080              0 -6.39920  
2 -0.87142    0.94927      0.94914  0.999795     19056              0 -0.87142  
3 -7.27630    0.24794      0.45240  0.999795      9006              1 -7.27630  
4 -8.09340    0.96851      0.96859  0.999795     19435              0 -8.09340  
|               | 0                  |
|:----

In [192]:
def process_study(row):
    # Extract file name and load data:
    data_file = row['dataFile']
    data_df = pd.read_csv(data_file, sep='\t')

    # Create empty df to store hits:
    hits = pd.DataFrame(columns=data_df.columns).rename(columns={'neg|fdr': 'resourceScore', 'pos|fdr': 'effectDirection'})

    # Filtering for the lower tail:
    if row['lowerTail']:
        filtered_df = (
            data_df.loc[data_df['neg|fdr'] <= row['threshold']].copy()
            .rename(columns={'neg|fdr': 'resourceScore'})
            .drop('pos|fdr', axis=1)
            .assign(effectDirection='lower')
        )
        hits = pd.concat([hits, filtered_df])

    # Filtering for the upper tail:
    if row['upperTail']:
        filtered_df = (
            data_df.loc[data_df['pos|fdr'] <= row['threshold']].copy()
            .rename(columns={'pos|fdr': 'resourceScore'})
            .drop('neg|fdr', axis=1)
            .assign(effectDirection='upper')
        )
        hits = pd.concat([hits, filtered_df])
        
    # No significant hit in this study:
    if len(hits) == 0:
        return None

    # Format data:
    hits = (
        hits
        # Renaming columns:
        .rename(columns={
            'id': 'targetFromSourceId',
            'description': 'studyOverview',
            
        })
        
        # Dropping unused columns:
        .drop(['num', 'neg|score', 'neg|p-value', 'neg|rank',
               'neg|goodsgrna', 'neg|lfc', 'pos|score', 'pos|p-value',
               'pos|rank', 'pos|goodsgrna', 'pos|lfc'], axis=1)
        
        # Creating a few columns from study table
        .assign(
            studyId=row['studyName'],
            description=row['description'],
            cohortPhenotypes= row['diseases'],
            diseaseCellLines=row['cellTypes'],
            datasourceId='ot_crispr',
            datatypeId='affected_pathway'
        )
        
        # Generating list columns for cells and diseases:
        .assign(
            cohortPhenotypes = lambda x: x['cohortPhenotypes'].str.replace(' ', '').str.split('|'),
            diseaseFromSource = lambda x: x['cohortPhenotypes'],  # Will be exploded
            diseaseCellLines= lambda x: x['diseaseCellLines'].str.replace(' ', '').str.split('|'),
        )
        
        # Rows are exploded into each disease term:
        .explode('diseaseFromSource', ignore_index=True)
    )

    # Each disease term is then mapped to EFO:
#     hits['diseaseFromSourceMappedId'] = hits['diseaseFromSource'].apply(dm.map_disease)
    
    
    return hits


# Looping through all studies in the study table
# Read the corresponding data file
# Process data according study file annotation
# Map disease to EFO 
# Collect data:
hits_list = (
    # Processing data:
    study_df.apply(process_study, axis=1)
    
    # Dropping studies with no hits:
    .dropna()
    
    # Convert to list
    .tolist()
)

# Pooling hits together:
hits_df = pd.concat(hits_list)
print(hits_df.head())

# Save gzipped json:
(
    hits_df
    .to_json('otar_crispr.json.gz', compression='gzip', orient='records', lines=True)
)

FileNotFoundError: [Errno 2] No such file or directory: 'Screen4_5_NEGPOSvPOSPOS.gene_summary.txt'

In [50]:
%%bash

gzcat otar_crispr.json.gz | head -n1 | jq 

{
  "targetFromSourceId": "AP2M1",
  "resourceScore": 0.00495,
  "effectDirection": "upper",
  "studyId": "OTAR036_TAU_uptake_1",
  "description": "Comparison of functional populations of iPSC neurons based on their ability to take up monomeric/aggregated tau protein",
  "cohortPhenotypes": [
    "Neurodegeneration",
    "Dementia",
    "Tauopathy(ies)",
    "Alzheimer’s Disease",
    "Frontotemporal dementia",
    "Pick’s Disease"
  ],
  "diseaseCellLines": [
    "iPSC derived neurons"
  ],
  "datasourceId": "ot_crispr",
  "datatypeId": "affected_pathway",
  "diseaseFromSource": "Neurodegeneration",
  "diseaseFromSourceMappedId": "HP_0002180"
}


In [48]:
%%bash

gsutil cp otar_crispr.json.gz gs://ot-team/dsuveges/CRISPR_screens/

Copying file://otar_crispr.json.gz [Content-Type=application/json]...
/ [0 files][    0.0 B/  498.0 B]                                                / [1 files][  498.0 B/  498.0 B]                                                
Operation completed over 1 objects/498.0 B.                                      


In [8]:
import pandas as pd

## Input:
datafile = '/Users/dsuveges/Downloads/Hosted Partner Platform Instance/Sample data/Shared_Data/OTAR033/hacat.il17a.dropout.gene_summary.txt'
filterColumn = 'pos|fdr'
threshold = 0.01

# Read data, filter and rename columns:
mageck_df = (
    pd.read_csv(datafile, sep='\t')
    .rename(columns={filterColumn: 'resourceScore', 'id': 'targetFromSourceId'})
    [['targetFromSourceId', 'resourceScore']]
    .loc[lambda df: df.resourceScore <= threshold]
)
mageck_df.head()

Unnamed: 0,targetFromSourceId,resourceScore
6087,CREBBP,0.000495
7380,PTEN,0.000495
8239,PHF12,0.000495
12083,SIRT6,0.006855
12178,EP300,0.000495


Unnamed: 0,id,num,neg|score,neg|p-value,neg|fdr,neg|rank,neg|goodsgrna,neg|lfc,pos|score,pos|p-value,pos|fdr,pos|rank,pos|goodsgrna,pos|lfc
0,RPS19,10,2.1450000000000002e-17,2.7466e-07,1.5e-05,1,10,-5.291,1.0,1.0,1.0,18011,0,-5.291
1,CENPM,15,7.286400000000001e-17,2.7466e-07,1.5e-05,2,15,-3.3295,1.0,1.0,1.0,18024,0,-3.3295
2,EIF4A3,10,3.0187e-15,2.7466e-07,1.5e-05,3,10,-4.1431,1.0,1.0,1.0,18023,0,-4.1431
3,NHP2L1,10,5.2812e-15,2.7466e-07,1.5e-05,4,10,-4.4097,1.0,1.0,1.0,18022,0,-4.4097
4,RBMX,11,6.8757e-15,2.7466e-07,1.5e-05,5,11,-3.0211,1.0,1.0,1.0,18021,0,-3.0211


In [193]:
import logging 

input_file = '/Users/dsuveges/repositories/evidence_datasource_parsers/CRISPR_screens/OTAR_crispr_studies.tsv'
data_folders = '/Users/dsuveges/repositories/evidence_datasource_parsers/CRISPR_screens'
output_file = 'otar_crispr-2021-08-04.json.gz'

# Reading study file:
study_df = pd.read_csv(input_file, sep='\t')

if 'Studied cell type' in study_df.loc[0].tolist():
    logging.info('Dropping column descriptions')
    study_df.drop(0, axis=0, inplace=True)
    
# Processing study data:
study_df = (
    study_df
    
    # drop rows with no study id or data file:
    .loc[study_df.studyId.notna() & study_df.dataFile.notna()]
    .assign(
        # Explode mapped diseases:
        diseaseFromSourceMappedId=study_df.diseases.str.replace(' ', '').str.split('|'),
        
        # Explode data files:
        dataFiles=study_df.dataFile.str.replace(' ', '').str.split('|')
    )
    # rename columns:
    .rename(
        columns={
            'isDerived': 'isCellTypeDerived',
            'library': 'crisprScreenLibrary',
            'mode': 'crisprStudyMode',
            'populations': 'contrast',
            'studyDescription': 'studyOverview'
        }
    )
)

# Reading all data file and filter for significant hits:
hits = (
    study_df[['studyId', 'dataFiles', 'dataFileType', 'filterColumn', 'threshold']]
    .explode('dataFiles')
    .assign(dataFile=lambda df: df.dataFiles.apply(lambda x: f'{data_folders}/{x}'))
    .apply(parse_MAGeCK_file, axis=1)
)

# Concatenate all hits into one single dataframe:
hits_df = (
    pd.concat(hits.to_list())
    .reset_index(drop=True)
)

# Merging:
merged_dataset = (
    study_df
    .assign(direction=lambda df: df.filterColumn.map(FILTER_COLUMN_MAP))
    .merge(hits_df, on='studyId', how='inner')
    .explode('diseaseFromSourceMappedId')
    .assign(
        datasourceId='ot_crispr',
        datatypeId='ot_partner'
    )
    [[
        'targetFromSourceId', 'diseaseFromSourceMappedId', 
        'projectId', 'studyId', 'studyOverview', 'contrast', 'crisprScreenLibrary',
        'cellType', 'cellLineBackground', 'geneticBackground',
        'direction','resourceScore', 
        'datasourceId', 'datatypeId'
    ]]
)

# Save file:
(
    merged_dataset
#     .apply(lambda row: [row.dropna()], axis=1)
    .to_json(output_file, compression='gzip', orient='records', lines=True)
)

In [194]:
FILTER_COLUMN_MAP = {
    'pos|fdr': 'upper tail',
    'neg|fdr': 'lower tail',
    'neg|p-value': 'lower tail',
    'pos|p-value': 'upper tail'
}

def parse_MAGeCK_file(row) -> pd.DataFrame:
    """This function returns with a pandas dataframe with the datafile and with properly named columns"""
    datafile = row['dataFile']
    filterColumn = row['filterColumn']
    threshold = float(row['threshold'])
    studyId = row['studyId']
    
    # Read data, filter and rename columns:
    try:
        mageck_df = (
            pd.read_csv(datafile, sep='\t')
            .rename(columns={filterColumn: 'resourceScore', 'id': 'targetFromSourceId'})
            .loc[lambda df: df.resourceScore <= threshold]
            [['targetFromSourceId', 'resourceScore']]
            .assign(studyId=studyId)
        )
        logging.info(f'Number of genes reach threshold: {len(mageck_df)}')
        return mageck_df
    except FileNotFoundError:
        logging.info(f'Study skipped as file was not found: {datafile}')

parse_MAGeCK_file(row)

Unnamed: 0,targetFromSourceId,resourceScore,studyId
414,YIF1B,6.322700e-03,OTAR036_TAU_uptake_1
507,TCP10L2,6.379900e-03,OTAR036_TAU_uptake_1
638,ABCC9,8.230800e-03,OTAR036_TAU_uptake_1
2084,ART3,2.747500e-03,OTAR036_TAU_uptake_1
2454,IL15,2.269300e-03,OTAR036_TAU_uptake_1
...,...,...,...
20076,WDR7,1.259700e-04,OTAR036_TAU_uptake_1
20077,PIK3R4,1.232600e-06,OTAR036_TAU_uptake_1
20078,AP2M1,2.465300e-07,OTAR036_TAU_uptake_1
20079,LRP1,2.465300e-07,OTAR036_TAU_uptake_1


In [195]:
pd.concat(hits.to_list()).reset_index(drop=True)

Unnamed: 0,targetFromSourceId,resourceScore,studyId
0,YIF1B,6.322700e-03,OTAR036_TAU_uptake_1
1,TCP10L2,6.379900e-03,OTAR036_TAU_uptake_1
2,ABCC9,8.230800e-03,OTAR036_TAU_uptake_1
3,ART3,2.747500e-03,OTAR036_TAU_uptake_1
4,IL15,2.269300e-03,OTAR036_TAU_uptake_1
...,...,...,...
8518,ENSG00000186298,4.733800e-02,OTAR2054_RBM3
8519,ENSG00000102710,1.986400e-02,OTAR2054_RBM3
8520,ENSG00000119801,2.620000e-07,OTAR2054_RBM3
8521,ENSG00000115875,7.870000e-07,OTAR2054_RBM3


In [196]:
%%bash

gzcat otar_crispr-2021-08-04.json.gz | head -n1 | jq

{
  "targetFromSourceId": "YIF1B",
  "diseaseFromSourceMappedId": "EFO_0000249",
  "projectId": "OTAR036",
  "studyId": "OTAR036_TAU_uptake_1",
  "studyOverview": "Comparison of functional populations of iPSC neurons based on their ability to take up monomeric/aggregated tau protein",
  "contrast": "TauNEGTransPOS (only transferrin taken up) vs TauPOSTransPOS  (both proteins taken up)",
  "crisprScreenLibrary": "Kosuke v1.1 (Behan et al., 2018)",
  "cellType": "iPSC derived cortical neurons",
  "cellLineBackground": "KOLF2-C1",
  "geneticBackground": null,
  "direction": "upper tail",
  "resourceScore": 0.0063227,
  "datasourceId": "ot_crispr",
  "datatypeId": "ot_partner"
}


In [197]:
import json
import gzip

# 
json_list = [json.dumps(row.dropna().to_dict()) for _, row in merged_dataset.iterrows()]
with gzip.open(output_file, 'wt') as f:
    f.write('\n'.join(json_list))



In [188]:
len(json_list)

8966

In [189]:
%%bash 
gzcat otar_crispr-2021-08-04.json.gz | head -n1 | jq


{
  "targetFromSourceId": "YIF1B",
  "diseaseFromSourceMappedId": "EFO_0000249",
  "projectId": "OTAR036",
  "studyId": "OTAR036_TAU_uptake_1",
  "studyOverview": "Comparison of functional populations of iPSC neurons based on their ability to take up monomeric/aggregated tau protein",
  "contrast": "TauNEGTransPOS (only transferrin taken up) vs TauPOSTransPOS  (both proteins taken up)",
  "crisprScreenLibrary": "Kosuke v1.1 (Behan et al., 2018)",
  "cellType": "iPSC derived cortical neurons",
  "cellLineBackground": "KOLF2-C1",
  "direction": "upper tail",
  "resourceScore": 0.0063227,
  "datasourceId": "ot_crispr",
  "datatypeId": "ot_partner"
}


In [148]:
json.dump(merged_dataset.iloc[0].dropna().to_dict(), output_fh)

In [163]:
merged_dataset.resourceScore.max()

0.199708

In [164]:
hits_df

Unnamed: 0,targetFromSourceId,resourceScore,studyId
0,YIF1B,6.322700e-03,OTAR036_TAU_uptake_1
1,TCP10L2,6.379900e-03,OTAR036_TAU_uptake_1
2,ABCC9,8.230800e-03,OTAR036_TAU_uptake_1
3,ART3,2.747500e-03,OTAR036_TAU_uptake_1
4,IL15,2.269300e-03,OTAR036_TAU_uptake_1
...,...,...,...
8518,ENSG00000186298,4.733800e-02,OTAR2054_RBM3
8519,ENSG00000102710,1.986400e-02,OTAR2054_RBM3
8520,ENSG00000119801,2.620000e-07,OTAR2054_RBM3
8521,ENSG00000115875,7.870000e-07,OTAR2054_RBM3


In [190]:
merged_dataset.columns.sort_values()

Index(['cellLineBackground', 'cellType', 'contrast', 'crisprScreenLibrary',
       'datasourceId', 'datatypeId', 'direction', 'diseaseFromSourceMappedId',
       'geneticBackground', 'projectId', 'resourceScore', 'studyId',
       'studyOverview', 'targetFromSourceId'],
      dtype='object')

In [175]:
merged_dataset.projectId.unique()

array(['OTAR036', 'OTAR033', 'OTAR035', 'OTAR2054'], dtype=object)

In [191]:
%%bash

pwd

/Users/dsuveges/project/random_notebooks/issue-1597_CRISPR_evidence


In [201]:
(
    study_df
    .head()
    .explode('dataFiles')
    .assign(
        dataFile=lambda df: df.apply(lambda x: f'{data_folders}/{x["projectId"]}/{x["dataFiles"]}', axis=1)
    )
    .iloc[2]['dataFile']
)






'/Users/dsuveges/repositories/evidence_datasource_parsers/CRISPR_screens/OTAR033/hacat.il17a.facs.gene_summary.txt'

In [1]:
merged.head()

NameError: name 'merged' is not defined

In [3]:
import pandas as pd
import json
import gzip


evidence_file = '/Users/dsuveges/repositories/evidence_datasource_parsers/otar_crispr-2021-08-12.json.gz'
evid_data = []

with gzip.open(evidence_file, 'rt') as f:
    for line in f:
        evid_data.append(json.loads(line))
        
evid_df = pd.DataFrame(evid_data)
evid_df.head()

Unnamed: 0,targetFromSourceId,diseaseFromSourceMappedId,projectId,studyId,studyOverview,contrast,crisprScreenLibrary,cellType,cellLineBackground,statisticalTestTail,resourceScore,datasourceId,datatypeId
0,YIF1B,EFO_0000249,OTAR036,OTAR036_TAU_uptake_1,Comparison of functional populations of iPSC n...,TauNEGTransPOS (only transferrin taken up) vs ...,"Kosuke v1.1 (Behan et al., 2018)",iPSC derived cortical neurons,KOLF2-C1,upper tail,0.006323,ot_crispr,ot_partner
1,YIF1B,EFO_0003096,OTAR036,OTAR036_TAU_uptake_1,Comparison of functional populations of iPSC n...,TauNEGTransPOS (only transferrin taken up) vs ...,"Kosuke v1.1 (Behan et al., 2018)",iPSC derived cortical neurons,KOLF2-C1,upper tail,0.006323,ot_crispr,ot_partner
2,TCP10L2,EFO_0000249,OTAR036,OTAR036_TAU_uptake_1,Comparison of functional populations of iPSC n...,TauNEGTransPOS (only transferrin taken up) vs ...,"Kosuke v1.1 (Behan et al., 2018)",iPSC derived cortical neurons,KOLF2-C1,upper tail,0.00638,ot_crispr,ot_partner
3,TCP10L2,EFO_0003096,OTAR036,OTAR036_TAU_uptake_1,Comparison of functional populations of iPSC n...,TauNEGTransPOS (only transferrin taken up) vs ...,"Kosuke v1.1 (Behan et al., 2018)",iPSC derived cortical neurons,KOLF2-C1,upper tail,0.00638,ot_crispr,ot_partner
4,ABCC9,EFO_0000249,OTAR036,OTAR036_TAU_uptake_1,Comparison of functional populations of iPSC n...,TauNEGTransPOS (only transferrin taken up) vs ...,"Kosuke v1.1 (Behan et al., 2018)",iPSC derived cortical neurons,KOLF2-C1,upper tail,0.008231,ot_crispr,ot_partner


In [4]:
len(evid_df)

7276

In [5]:
len(evid_df.drop_duplicates())

7276

In [20]:
uniq_fields = ['diseaseFromSourceMappedId', 'targetFromSourceId', 'studyId', 'resourceScore']


len(evid_df[uniq_fields].drop_duplicates())

7276

In [17]:
evid_df[uniq_fields].loc[evid_df[uniq_fields].duplicated()]

Unnamed: 0,diseaseFromSourceMappedId,targetFromSourceId,studyId,statisticalTestTail
1001,EFO_0000676,BMPR1A,OTAR033_IL17A-TNFa_1,upper tail
1376,EFO_0000676,RPS19,OTAR033_IL17A-TNFa_2,lower tail
1379,EFO_0000676,NHP2L1,OTAR033_IL17A-TNFa_2,lower tail
1380,EFO_0000676,RBMX,OTAR033_IL17A-TNFa_2,lower tail
1381,EFO_0000676,SNRPD3,OTAR033_IL17A-TNFa_2,lower tail
...,...,...,...,...
5564,EFO_0000676,KMT2D,OTAR033_IL4_2,lower tail
5573,EFO_0000676,TAF2,OTAR033_IL4_2,lower tail
5641,EFO_0000676,LDLR,OTAR033_IL4_2,lower tail
5677,EFO_0000676,CWC25,OTAR033_IL4_2,lower tail


In [19]:
evid_df.loc[
    (evid_df.diseaseFromSourceMappedId == 'EFO_0000676')
    & (evid_df.targetFromSourceId == 'KMT2D')
    & (evid_df.studyId == 'OTAR033_IL4_2')
]



Unnamed: 0,targetFromSourceId,diseaseFromSourceMappedId,projectId,studyId,studyOverview,contrast,crisprScreenLibrary,cellType,cellLineBackground,statisticalTestTail,resourceScore,datasourceId,datatypeId
3693,KMT2D,EFO_0000676,OTAR033,OTAR033_IL4_2,Differenciated HaCaT keratinocytes were stimul...,,"Kosuke v1.1 (Behan et al., 2018)",HaCaT keratinocytes,,lower tail,7.7e-05,ot_crispr,ot_partner
5564,KMT2D,EFO_0000676,OTAR033,OTAR033_IL4_2,Differenciated HaCaT keratinocytes were stimul...,,"Kosuke v1.1 (Behan et al., 2018)",HaCaT keratinocytes,,lower tail,0.16912,ot_crispr,ot_partner


In [8]:
import pandas as pd


df = pd.DataFrame([
    {'a': 2, 'b': 24},
    {'a': 4, 'b': None},
    {'a': 12, 'b': 100}
])

(
    df
    .dropna().to_json('test.json.gz', compression='gzip', orient='records', lines=True)

)

In [9]:
%%bash

gzcat test.json.gz

{"a":2,"b":24.0}
{"a":12,"b":100.0}
