# 每月更新dataset_list操作指南

使用步骤：
* 将全局变量DATE_FROM改成比上次更新时间提前几天，以确保检索不会遗漏，格式"YYYY/MM/DD"。例如，上次10月1日更新，则一个月后11月1日运行该notebook，更改DATE_FROM="2019/09/28"。（第一次用可以设置成"2019/10/01"，因为现有的list创建时间是2019/10/09）

- 更改旧的数据集list的路径OLD_LIST_DIR。

* 运行以下所有代码。最终生成新的数据集list在同目录下，文件名格式类似"pubmed_dataset_list_updated_20191101.csv"（时间由代码自动生成）。

In [1]:
from pymed import PubMed
import os
import pandas as pd
from Bio import Entrez
import time

OLD_LIST_DIR="pubmed_dataset_list_unique_new.csv"

NEW_LIST_DIR="pubmed_dataset_list_updated_"+time.strftime("%Y%m%d", time.localtime())+".csv"

DATE_FROM="2019/10/01"

ALLOWED_JOURNALS = [
    'Cancer Cell',
    'Cell',
    'Cell Stem Cell',
    'Cell Syst',
    'Elife',
    'Immunity',
    'Mol. Cell',
    'Nat Biomed Eng',
    'Nat Commun',
    'Nat. Cell Biol.',
    'Nat. Genet.',
    'Nat. Immunol.',
    'Nat. Med.',
    'Nat. Methods',
    'Nat. Neurosci.',
    'Nature',
    'Neuron',
    'Science',
    'Sci Immunol',
    'Sci Transl Med',
    'Cancer Discov'
]

通过pubmed搜索获取需要增加的数据集，获取标题、期刊名、pubmedID。

In [2]:
"""
get titles and pubmed_id of single-cell related papers from pubmed

filter:
1. Allowed journals
2. Time 
    "The first description of single-cell transcriptome analysis based on a next-generation sequencing platform was published in 2009"
3. Keywords in title/abstract: single cell OR scRNA
"""

class PubMedAPI():
    """
    wrapper of PubMed API,
    based on the pymed package
    """
    def __init__(self):
        self.pubmed = PubMed(tool="MyTool", email="my@email.address")

    def get_multiple_article_info(self,
                                  query_string: str = "",
                                  max_results: int = 20,
                                  ):
        """
        :param query_string: string, query sentence; example:"single-cell[Title]"
        :param max_results: int, maximum results
        """
        def _construct_query_string_by_journal_and_time():
            """
            pre-construct query_string
            filter: allowed journals and time(DATE_FROM-present)
            """
            query_string_by_journal = '"[Journal] OR "'.join(ALLOWED_JOURNALS)
            query_string_by_journal = '("' + query_string_by_journal + '"[Journal]) AND '
            query_string_by_journal = query_string_by_journal + '(("'+DATE_FROM+'"[Date - Publication] : "3000"[Date - Publication])) AND '
            return query_string_by_journal

        # construct query string (add journal filter)
        query_string = _construct_query_string_by_journal_and_time() + query_string
        results = self.pubmed.query(query_string, max_results)

        title = list()
        journal = list()
        pubmed_id = list()

        for article in results:
            try:
                journal.append(article.journal)
            except AttributeError:
                continue
            # title[:-1:] can delete "." in the end of every title
            title.append(article.title[:-1:])
            # "splitlines()[0]" can deal with occasional format anomaly
            pubmed_id.append(article.pubmed_id.splitlines()[0])

        df = pd.DataFrame(data = {
            "title": title,
            "journal": journal,
            "pubmed_id": pubmed_id
            })

        return df

In [3]:
add_list=PubMedAPI().get_multiple_article_info("(single-cell[Title/Abstract] OR scRNA[Title/Abstract])",3000)

In [4]:
add_list

Unnamed: 0,title,journal,pubmed_id
0,Single-cell proteomics reveals changes in expr...,eLife,31682227
1,Bayesian Inference of Allelic Inclusion Rates ...,Cell systems,31677971
2,Stress-Induced Metabolic Disorder in Periphera...,Cell,31675497
3,Landscape and Dynamics of Single Immune Cells ...,Cell,31675496
4,In vitro culture of cynomolgus monkey embryos ...,"Science (New York, N.Y.)",31672918
5,Dissecting primate early post-implantation dev...,"Science (New York, N.Y.)",31672917
6,"Chronic Stress Induces Activity, Synaptic, and...",Neuron,31672263
7,Transcriptional Basis of Mouse and Human Dendr...,Cell,31668803
8,The oomycete Lagenisma coscinodisci hijacks ho...,Nature communications,31666506
9,Ontogenic changes in hematopoietic hierarchy d...,Cancer discovery,31662298


下面根据pubmed_id获得每个数据集的GSE号。

In [5]:
def getGSEFromPubmedID(pubmed_id:str="") ->str:
    try:
        handle =Entrez.elink(dbfrom = 'pubmed', id = pubmed_id, db = 'gds')
        link_results = Entrez.read(handle)
        id_list=list()
        for item in link_results[0]["LinkSetDb"][0]["Link"]:
            id_list.append("GSE"+item['Id'][3::])
        id_list_total = "_".join(id_list)
        return id_list_total
    except:
        return "notAvailable"

下面这块代码会报"Email address is not specified."的warning，无视即可。

In [6]:
gseID=list()
for pubmed_id in add_list["pubmed_id"]:
    gseID.append(getGSEFromPubmedID(pubmed_id))
add_list["gseID"]=gseID

Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.


In [30]:
add_list

Unnamed: 0,title,journal,pubmed_id,gseID
0,Single-cell proteomics reveals changes in expr...,eLife,31682227,notAvailable
1,Bayesian Inference of Allelic Inclusion Rates ...,Cell systems,31677971,notAvailable
2,Stress-Induced Metabolic Disorder in Periphera...,Cell,31675497,notAvailable
3,Landscape and Dynamics of Single Immune Cells ...,Cell,31675496,notAvailable
4,In vitro culture of cynomolgus monkey embryos ...,"Science (New York, N.Y.)",31672918,GSE030114
5,Dissecting primate early post-implantation dev...,"Science (New York, N.Y.)",31672917,notAvailable
6,"Chronic Stress Induces Activity, Synaptic, and...",Neuron,31672263,notAvailable
7,Transcriptional Basis of Mouse and Human Dendr...,Cell,31668803,notAvailable
8,The oomycete Lagenisma coscinodisci hijacks ho...,Nature communications,31666506,notAvailable
9,Ontogenic changes in hematopoietic hierarchy d...,Cancer discovery,31662298,notAvailable


In [31]:
old_list = pd.read_csv(OLD_LIST_DIR,  index_col=0)

In [32]:
old_list

Unnamed: 0_level_0,title,journal,pubmed_id,gseID
X1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,A Pooled Single-Cell Genetic Screen Identifies...,Nature genetics,31477929,notAvailable
1,Heterogeneity Of Human Bone Marrow And Blood N...,Nature communications,31477722,GSE130430
2,Single-Cell Transcriptomics Reveals Multi-Step...,Nature communications,31477698,GSE122743
4,Complex Oscillatory Waves Emerging From Cortic...,Cell stem cell,31474560,notAvailable
6,Single-Cell Analysis Of Crohn'S Disease Lesion...,Cell,31474370,notAvailable
7,Data Denoising With Transfer Learning In Singl...,Nature methods,31471617,notAvailable
8,Position Β57 Of I-,Science immunology,31471352,notAvailable
12,Identification Of Somatic Mutations In Single ...,Nature communications,31467286,notAvailable
15,Liquid Biopsy-Based Single-Cell Metabolic Phen...,Nature communications,31451693,notAvailable
18,Single-Cell Profiling Guided Combinatorial Imm...,Nature communications,31444334,GSE122336


过滤掉已使用的GSE号，如果一篇文章的所有GSE号均已使用，将整篇文章过滤掉。

In [33]:
used_gseID=set()
for each in old_list["gseID"]:
    used_gseID=used_gseID.union(set(each.split("_")))
used_gseID.discard("notAvailable")

add_gse=add_list["gseID"]
processed_add_gse=list()
for each in add_gse:
    if each=="notAvailable":
        processed_add_gse.append(each)
    else:
        processed_each=[i for i in list(each.split("_")) if not i in used_gseID]
        used_gseID=used_gseID.union(set(processed_each))  # update used_gseID
        processed_add_gse.append("_".join(processed_each))

add_list["gseID"]=processed_add_gse
add_list=add_list[add_list["gseID"]!=""]
add_list

Unnamed: 0,title,journal,pubmed_id,gseID
0,Single-cell proteomics reveals changes in expr...,eLife,31682227,notAvailable
1,Bayesian Inference of Allelic Inclusion Rates ...,Cell systems,31677971,notAvailable
2,Stress-Induced Metabolic Disorder in Periphera...,Cell,31675497,notAvailable
3,Landscape and Dynamics of Single Immune Cells ...,Cell,31675496,notAvailable
4,In vitro culture of cynomolgus monkey embryos ...,"Science (New York, N.Y.)",31672918,GSE030114
5,Dissecting primate early post-implantation dev...,"Science (New York, N.Y.)",31672917,notAvailable
6,"Chronic Stress Induces Activity, Synaptic, and...",Neuron,31672263,notAvailable
7,Transcriptional Basis of Mouse and Human Dendr...,Cell,31668803,notAvailable
8,The oomycete Lagenisma coscinodisci hijacks ho...,Nature communications,31666506,notAvailable
9,Ontogenic changes in hematopoietic hierarchy d...,Cancer discovery,31662298,notAvailable


下面通过将待增加的数据集和旧数据集list合并并根据pubmed id去重，得到新的数据集list。

In [34]:
new_list=old_list.append(add_list)
new_list

Unnamed: 0,title,journal,pubmed_id,gseID
0,A Pooled Single-Cell Genetic Screen Identifies...,Nature genetics,31477929,notAvailable
1,Heterogeneity Of Human Bone Marrow And Blood N...,Nature communications,31477722,GSE130430
2,Single-Cell Transcriptomics Reveals Multi-Step...,Nature communications,31477698,GSE122743
4,Complex Oscillatory Waves Emerging From Cortic...,Cell stem cell,31474560,notAvailable
6,Single-Cell Analysis Of Crohn'S Disease Lesion...,Cell,31474370,notAvailable
7,Data Denoising With Transfer Learning In Singl...,Nature methods,31471617,notAvailable
8,Position Β57 Of I-,Science immunology,31471352,notAvailable
12,Identification Of Somatic Mutations In Single ...,Nature communications,31467286,notAvailable
15,Liquid Biopsy-Based Single-Cell Metabolic Phen...,Nature communications,31451693,notAvailable
18,Single-Cell Profiling Guided Combinatorial Imm...,Nature communications,31444334,GSE122336


In [35]:
new_list.drop_duplicates(subset="pubmed_id", keep='first', inplace=True)
new_list.index=range(0,len(new_list))
new_list

Unnamed: 0,title,journal,pubmed_id,gseID
0,A Pooled Single-Cell Genetic Screen Identifies...,Nature genetics,31477929,notAvailable
1,Heterogeneity Of Human Bone Marrow And Blood N...,Nature communications,31477722,GSE130430
2,Single-Cell Transcriptomics Reveals Multi-Step...,Nature communications,31477698,GSE122743
3,Complex Oscillatory Waves Emerging From Cortic...,Cell stem cell,31474560,notAvailable
4,Single-Cell Analysis Of Crohn'S Disease Lesion...,Cell,31474370,notAvailable
5,Data Denoising With Transfer Learning In Singl...,Nature methods,31471617,notAvailable
6,Position Β57 Of I-,Science immunology,31471352,notAvailable
7,Identification Of Somatic Mutations In Single ...,Nature communications,31467286,notAvailable
8,Liquid Biopsy-Based Single-Cell Metabolic Phen...,Nature communications,31451693,notAvailable
9,Single-Cell Profiling Guided Combinatorial Imm...,Nature communications,31444334,GSE122336


In [36]:
new_list.to_csv(NEW_LIST_DIR, header='column_names', sep=',')