# Gathering Metadata on Ribo-Seq Studies

In this notebook the process of parsing outputs from GEO and AR-Geos Searches is documented. This should help reproduce our list of Ribo-Seq studies to be processed. 

In [2]:
import pandas as pd 
from info_from_pmid import get_info_from_medline

### Reading Inputs

ARGEOS (https://ar-geos.org/) is an ArrayExpress and GEO searcher. Through their web portal we provide the below query and recieve a handy output that helps with handling strange meta-information. 

"Ribosome Profiling" OR "Ribo-Seq" 

In [5]:
argeos_path = '../data/argeos_Ribosome_profiling_Human.tsv'

argeos = pd.read_csv(argeos_path, sep="\t")

for i in range(len(list(argeos.Accession))):
    if argeos.at[i, "SRA"] != "None":
        argeos.at[i, "SRA"] = argeos.at[i, "SRA"].split("=")[1]

We also broaden our search by using GEO directly. This is because I found that a subset of studies were not picked up by the ARGEOS search. This search is similarly carried out via the web portal searching the GEO DataSets database (GDS: https://www.ncbi.nlm.nih.gov/gds). 

"Ribosome Profiling" OR "Ribo-Seq" 

In [4]:
geo_ribosome_profiling_path = '../data/geo_ribosomeprofiling.tsv'

geo_ribosome_profiling = pd.read_csv(geo_ribosome_profiling_path, sep='\t')

### Exploration of Inputs


In [6]:
all_accessions = []
for i in list(geo_ribosome_profiling.Accession):
    if i not in all_accessions:
        all_accessions.append(i)

for i in list(argeos.Accession):
    if i not in all_accessions:
        all_accessions.append(i)

In [7]:
len(all_accessions)

492

### Generation of Superset 

In [6]:
on = ["Accession", "Title", "Organism", "Samples", "SRA", "Release_Date", "Organism"]
superset = pd.merge(geo_ribosome_profiling, argeos, how='outer', on=on)

In [7]:
columns = ['Type']
index = [i for i, row in superset.iterrows()]
type_df = pd.DataFrame(index=index, columns=columns)
for i, row in superset.iterrows():
    if str(row['Type_x']) != 'nan':
        type_df.at[i, 'Type'] = str(row['Type_x'])
    else:
        type_df.at[i, 'Type'] = str(row['Type_y'])

In [8]:
superset = pd.concat([superset, type_df], axis=1)

In [9]:
for col in superset.columns:
    if 'Type_' in col:
        del superset[col]

In [10]:
import validators 

columns = ['GSE', 'GSE_Supplementary', 'BioProject']
index = [i for i, row in superset.iterrows()]
link_df = pd.DataFrame(index=index, columns=columns)

gse_supp_base = 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/'
gse_base = "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc="

for i, row in superset.iterrows():
    if str(row['Supplementary Links']) != 'nan':
        links = superset.at[i, 'Supplementary Links'].split(';')
        for link in links: 
            if "GSE" in link:
                gse_accession = link.split('/')[-2]

                link_df.at[i, 'GSE_Supplementary'] = link
                link_df.at[i, 'GSE'] = gse_base + gse_accession
            elif "PRJ" in link:
                link_df.at[i, "BioProject"] = link


    else:
        if "GSE" in row['Link']:
            gse_accession = row['Link'].split('=')[-1]
            link_df.at[i, 'GSE'] = row['Link']
            link_df.at[i, 'BioProject'] = row['BioProject link (NCBI)']
            if validators.url(gse_supp_base + gse_accession[:5] + "nnn/" + gse_accession + "/suppl"):
                link_df.at[i, 'GSE_Supplementary'] = gse_supp_base + gse_accession[:-3] + "nnn/" + gse_accession + "/suppl"
            else: 
                link_df.at[i,'GSE_Supplementary'] = ""

        else:
            link_df.at[i,'GSE'] = ""
            link_df.at[i,'GSE_Supplementary'] = ""
            link_df.at[i,'BioProject'] = ""

In [11]:
superset = pd.concat([superset, link_df], axis=1)

In [12]:
for col in superset.columns:
    if col in ['Link', 
    'Supplementary Links', 
    'Supplementary Types', 
    'BioProject link (NCBI)', 
    'BioProject link (EBI)', 
    'All References', 
    'Platform', 
    'Type of molecule', 
    'Impact factor 2018', 
    'Summary'
    ]:
        del superset[col]

In [14]:
index = [i for i, row in superset.iterrows()]

paper_info_df = pd.DataFrame(columns=["PMID", "authors", "abstract", "title", "doi", "date_published", "PMC", "journal"], index=index)
for i, row in superset.iterrows():
    if str(row['PubMed ID']) != "nan":
        if paper_info_df.at[i, 'PMID'] != "nan":
            if len(str(row['PubMed ID'])) > 0:  
                info_dict = get_info_from_medline(row['PubMed ID'])
                for item in info_dict:
                    paper_info_df.at[i, item] = info_dict[item]
    else:
        if str(row['doi or pubmed id']) != "nan":
            if "doi" in row['doi or pubmed id']:
                query = row['doi or pubmed id'].split('doi.org/')[-1]

                if len(str(query)) > 0:      
                    info_dict = get_info_from_medline(query)
                    for item in info_dict:
                        paper_info_df.at[i, item] = info_dict[item]
        else:
            continue

In [15]:
paper_info_df.to_csv("data/ribo_seq_paper_info.csv", index=False)

In [16]:
superset = pd.concat([superset, paper_info_df], axis=1)

for col in superset.columns:
    if col in ['PubMed ID', 'doi or pubmed id', 'All references', 'Journal', 'Contact']:
        del superset[col] 

In [17]:
superset.to_csv("data/ribosome_profiling_superset.tsv", sep="\t", index=False)