### Goal 01: randomly sample 1000 articles from the AD endnote library. (1000 out of 21865 articles)
### Goal 02: turn articles from the AD library in 2020 and 2019 into two files respectively so that the NIO annotator can process them.
### Goal 02: split the 1000 articles into batches so that the NIO and PTC annotators can process them.
### Goal 03: for the PTC annotator, split the articles with pmids and the rest. For the articles with pmids, add the pmid number into the ptc/PmidCode.Python/input_pmid folder. Each pmid number will be a line. <strike>For the articles without pmids, add the articles with the xxx.PubTator format into the folder ptc/nonPmidCode.Python/input. Each Pubtator document is better to limit to have 100 abstracts for processing reasons. So batches might be created.</strike> For the articles without pmids, find the pmids with the titles by esearch. Then process the articles the same way as of those with pmids.
- process the retrieved results
- some titles are duplicated, need to process them if using frequency of the uids from articles

### Goal 04: the pre-processing for the articles should be the same for both NIO and PTC annotators. 
- The number of articles put into the NIO and PTC annotators should be the same  
- Exclude the articles wihout pmids: for the articles without pmids, find the pmids with the titles by esearch. If not having the pmids, then exclude these articles.
    - link ccc to nlm, lacking ccc and nlm link file before 2006
    - based on the year distributions, the articles before 2006 can't be ignored, create the ccc and nlm link file for articles before 2006
    - among the 4736 articles before 2006, 1035 don't have a database id number. These have to be ignored like the ISI articles, because we can't retrieve the PTC annotation for them.
    - no need to create the ccc and nlm link file for the articles before 2006, because the distribution of database is defaultdict(<class 'int'>, {'ISI': 14, 'NLM': 3680}). The ISI articles only account for a small proportion.
    - combine the ccc and nlm link files in two periods of time into one file 
- Save all metadata in one json file. Save metadata for each article in the format of:    
[{"authors": ["Zhang, X., et al."], "year": 2021, "date": Jun 7, "pmcid": None, "database provider": NLM, "accession number": 33822840, "title": "An APP ectodomain mutation outside of the Aβ domain promotes Aβ production in vitro and deposition in vivo.", "abstract": "Familial Alzheimer's disease (FAD)-linked mutations in the APP gene..."}, {...}]  

### Goal 05: turn the articles into the input formats of two annotators
- turn into the input format of the NIO annotator  
- turn into the input format of the PTC annotator


In [1]:
path = "/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/raw/"
nio_path = "/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/"  # each single raw articles
pmid_path = "/Users/yidesdo21/Projects/ptc/PmidCode.Python/input_pmid/"
non_pmid_path = "/Users/yidesdo21/Projects/ptc/nonPmidCode.Python/input/"
output_path = "/Users/yidesdo21/Projects/outputs/12_time_slicing/"
pmid_link_path = "/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/link_ccc_to_nlm/"
pre_cutoff = "2006_to_2015.txt"
post_cutoff = "2016_to_2020.txt"
pre_link_file = "2006_to_2015_pmid_link.txt"
post_link_file = "2016_to_2020_pmid_link.txt"
all_name = "all.txt"
name_2020 = "2020.txt"   # the third input to article_to_txt is post_link_file 
name_2019 = "2019.txt"   # the third input to article_to_txt is post_link_file 

In [2]:
from itertools import groupby
import random
import os
from collections import defaultdict, Counter
from Bio import Entrez
import json
import pandas as pd
import altair as alt
alt.data_transformers.disable_max_rows()
alt.data_transformers.enable('data_server')
alt.renderers.enable('default')

RendererRegistry.enable('default')

In [3]:
## change these two variables when setting pre-cutoff dataset and the post-cutoff dataset
# raw_file = pre_cutoff
# article_to_txt function

In [4]:
raw_file = all_name   # change if need another year

with open(path+raw_file) as f:
    contents = f.read().replace("\t", "").splitlines()

In [5]:
# some articles don't have abstracts. 
# Using % to separate the content will have issues.
len(contents)

78676

In [6]:
# contents[:12]

In [7]:
# ~1000 articles in 2020 and 2019 respectively 
# concatenate the title and the abstract into a single list
i = (list(g) for _, g in groupby(contents, key=''.__ne__))
sepa_content = [a + b for a, b in zip(i, i)]

In [8]:
len(sepa_content)

28085

In [9]:
sepa_content[0]

['\ufeffCCC:000612215200005;Protective effect of Terminalia chebula Retz. extract against A beta aggregation and A beta-induced toxicity in Caenorhabditis elegans;Zhao, L. H., et al.;2021;Mar;',
 'Ethnopharmacological relevance: Terminalia chebula Retz. (T.chebula) is an important medicinal plant in Tibetan medicine and Ayurveda. T.chebula is known as the "King of Tibetan Medicine", due to its widespread clinical pharmacological activity such as anti-inflammatory, antioxidative, antidiabetic as well as anticancer in lots of in vivo and in vitro models. In this study, we use transgenic and/or RNAi Caenorhabditis elegans (C.elegans) model to simulation the AD pathological features induced by A beta, to detect the effect of TWE on improving A beta-induced toxicity and the corresponding molecular mechanism. Aim of study: The study aimed to tested the activities and its possible mechanism of T.chebula to against A beta 1-42 induced toxicity and A beta 1-42 aggregation. Materials and methods

### the following data visualization is moved to raw_data_and_evaluation.ipynb

In [9]:
##
# exclude the articles that don't have the abstracts
documents = list()
no_abs = list()
cnt = 0

for i in sepa_content:
    if len(i) != 3:
        cnt += 1
        no_abs.append(i)
    else:
        documents.append(i)
        
print(cnt)
print(len(documents))

6221
21864


In [10]:
# check the articles without abstracts -- the year distributions
no_abs_doc = list()
years = list()

for d in no_abs:
    metadata = d[0].split(";")
#     print(metadata)
    
    if len(metadata) < 3:  # the metadata is in another format, ignore them for now
#         ignore.append(document)
        continue
    
    year = metadata[-3]
    
    if not year.isdigit():  # the metadata is in another format, the year is not the third last item of the metadata
        year = metadata[-2]
        if not year.isdigit():  # the metadata is in another format, ignore the ~20 for now
#             ignore.append(document)
            continue
    
    no_abs_doc.append(d)
    years.append(year)

In [11]:
years_df = pd.DataFrame(years)
years_df.columns = ["year"]

In [12]:
alt.Chart(years_df).mark_bar().encode(
    alt.X("year:N"),
    y='count()',
    tooltip=['year',"count()"]
)

In [37]:
# years

In [31]:
# no_abs

In [11]:
6221/21864

0.28453165020124405

In [27]:
# check the year distribution for the articles
years = list()
before_2006 = list()   # this list is created to create the ccc and nlm link for the files before 2006
all_doc = list()   # this list is created as a filtering of all documents
year_2007 = list()
ignore = list()

for document in documents:
    metadata = document[0].split(";")
    if len(metadata) < 3:  # the metadata is in another format, ignore them for now
        ignore.append(document)
        continue
        
    year = metadata[-3]
    
    if not year.isdigit():  # the metadata is in another format, the year is not the third last item of the metadata
        year = metadata[-2]
        if not year.isdigit():  # the metadata is in another format, ignore the ~20 for now
            ignore.append(document)
            continue
    
    if int(year) == 2007:
        year_2007.append(document)
    
    if int(year) < 2006:
        before_2006.append(document)
    
    all_doc.append(document)
    years.append(year)


In [28]:
len(ignore)

24

In [24]:
len(year_2007)

291

In [13]:
len(all_doc)

21840

In [14]:
len(before_2006)

4736

In [15]:
years_df = pd.DataFrame(years)
years_df.columns = ["year"]

In [16]:
years_df

Unnamed: 0,year
0,2021
1,2021
2,2021
3,2021
4,2021
...,...
21835,1987
21836,1985
21837,1985
21838,1982


In [17]:
alt.Chart(years_df).mark_bar().encode(
    alt.X("year:N"),
    y='count()',
    tooltip=['year',"count()"]
)

#### use the following lines if need sampling articles

In [11]:
# sample 1000 articles from 21865 articles
# because some articles don't have ids, they will be excluded in the next stage
# so we sample more than 1000 articles here
random.seed(10)
sampled_list = random.sample(documents, 1200)

In [50]:
# sampled_list[:4]

#### create a function to link the "CCC" accession numbers with the "NLM" accession numbers

In [17]:
# 1. if #("CCC" and "NLM" ids) == 1: link title to "CCC", dump
# 2. if #("CCC" and "NLM" ids) == 2: 
#    2.1 one "CCC" and one "NLM": link title to "NLM"
#    2.2 two different "CCC": link title to the first "CCC", articles will be excluded anyway, dump
#    2.3 two same "CCC": link title to the first "CCC", dump
# 3. if #("CCC" and "NLM" ids) == 3:
#    3.1 two same "CCC" and one "NLM": link title to "NLM"
#    3.2 two different "CCC" and one "NLM": link title to the first "CCC", dump
#    3.3 one "CCC" and two different "NLM": link title to the first "CCC", dump
# 4. if #("CCC" and "NLM" ids) > 3: dump


# link articles starting with link "CCC" or "ISI" with "NLM"
def link_ccc_to_nlm(path, link_file1, link_file2, multiple = False):
    """input -- two link files that has {"title1": ["id1","id2",...], "title2": ["id...",...], ...}
       output -- a dictionary that links "CCC:..." to "NLM:..."
    multiple: the parameter to define if conserving the 
    titles returned with more than 1 pmid numbers. Default is False"""

    # combine two files that link "CCC" with "NLM"
    # two articles are in both files, but this isn's an issue
    #    one article has more than three linking NLM ids, so the article will be dumped
    #    another article has the same one NLM id
    with open (path+link_file1) as f:
        link_data1 = json.load(f)

    with open (path+link_file2) as f:
        link_data2 = json.load(f)

    link_data = dict()
    link_data.update(link_data1)
    link_data.update(link_data2)
        
        
    ccc_to_nlm = dict() 

    for k,v in link_data.items():
        len_v = len(v)

        if len_v == 2:
            id1, id2 = v[0], v[1]
            if id1.split("_")[0] != id2.split("_")[0]:
                ccc_to_nlm[id1.replace("_", ":")] = id2.replace("_", ":")

        elif len_v == 3: 
            id1, id2, id3 = v[0], v[1], v[2]
            if id1 == id2 and id3.startswith("NLM"):
                ccc_to_nlm[id1.replace("_", ":")] = id3.replace("_", ":")
    
    return ccc_to_nlm

In [18]:
link_file = link_ccc_to_nlm(pmid_link_path,"2006_to_2015_pmid_link.txt","2016_to_2020_pmid_link.txt")

In [43]:
# link_file

In [19]:
# doc = sampled_list  # for sampling
# doc = documents   # for time slicing


def article_to_txt(doc, path, link_file1, link_file2, link=True):  ## add a parameter to link "CCC" or "ISI" with "NLM" or not
    """turn the sampled articles into individual txt files as the 
    input for the NIO annotator and the PTC annotator
    the codes belwo are grabbed from "endnote_to_input.ipynb
    input -- doc: a list of lists, [["title1","abstract1",""], ["title2", "abstract2", ""], [...], ...]"
    input -- path, link_files: inputs to the function <link_ccc_to_nlm>
    output -- articles: a list of dictionary, 
            [{"sourcedb":..., "sourceid":..., "title": ..., "abstract":..., "text":...}, {...}, ...]"""
    
    article_num = len(doc)  
    articles = list()
    article_names = list()
#     txts = ""   # this is for the PTC annotator
#     txts_list = list()  # PTC suggests to have 100 abstracts in each file
#     txts_cnt = 0
    article_cnt = 0
    revise_cnt = 0
    if link == True:
        link_dict = link_ccc_to_nlm(path, link_file1, link_file2)
    check_ids = set()
#     cnt = 0
    wrong_cnt = 0
    
    ## developing starts
    
    for i in range(article_num):
        art_dict = dict()
        article = doc[i]
        src_title,abstract = article[0].split(";"),article[1]
        len_src = len(src_title)
        
        if src_title[0].startswith("PMC"):   # have full text annotations in PTC
            if src_title[1] == "NLM":  # ['PMC7953340', 'NLM', '33421595', "PET measurement of longitudinal amyloid load identifies the earliest stages of amyloid-beta accumulation during Alzheimer's disease progression in Down syndrome", 'Zammit, M. D., et al.', '2021', 'Mar', '']
                source_db = "NLM"
                source_id,title,author,year,date = src_title[2],src_title[3],src_title[4],src_title[5],src_title[6]
                text = title+". "+abstract  # because of the PTC annotator, can't have space after the dot ?? is it??
            else:  # the rubbish data are ignored
                if len_src == 6 or len_src == 7:
                    source_db = "NLM"
                    source_id,title,author,year,date = src_title[1],src_title[2],src_title[3],src_title[4],src_title[5]
                    text = title+". "+abstract     

        elif src_title[0] == "NLM":
            source_db = src_title[0]
            source_id,title,author,year,date = src_title[1],src_title[2],src_title[3],src_title[4],src_title[5]
            text = title+". "+abstract

        elif src_title[0].startswith("CCC") or src_title[0].startswith("ISI") or src_title[0]=="\ufeffCCC:000612215200005": 
            ## add codes to link "CCC" or "ISI" with "NLM"
            ## add a parameter to link "CCC" or "ISI" with "NLM" or not
            ## input -- src_title[0], e.g. CCC:000527779200038
            ## output -- xxx:1234455
            ## link_dict is the dictionary that links "CCC:..." to "NLM:..."
            if link == False:
                link_dict = dict()
            if src_title[0] == "\ufeffCCC:000612215200005":
                src_title[0] = "CCC:000612215200005"
            
            t = link_dict.get(src_title[0], src_title[0])
            db_id = t.split(":")
            source_db = db_id[0]
            source_id = db_id[1]
            title,author,year,date = src_title[1],src_title[2],src_title[3],src_title[4]
            text = title+". "+abstract

        
        elif src_title[0].isnumeric():
            if len_src == 5 or len_src == 6:
                source_db = "NLM"   # the numbers are PMIDs. Like the articles starting with "NLM"
                source_id = src_title[0]
                if source_id == "0011140685":   # this is an error 
                    source_id = "11140685"
                title,author,year,date = src_title[1],src_title[2],src_title[3],src_title[4]
                text = title+". "+abstract
                
            elif len_src == 7:
                if src_title[1].isnumeric():
                    source_db = "NLM" # the first number is PMCID, the second number is PMID. Like the articles starting with "PMC".
                    source_id = src_title[1]
                    title,author,year,date = src_title[2],src_title[3],src_title[4],src_title[5]
                    text = src_title[2]+". "+abstract
                else:
                    source_db = "NLM" 
                    source_id = src_title[0]
                    title = src_title[1]+". "+src_title[2]
                    author,year,date = src_title[3],src_title[4],src_title[5]
                    text = title+". "+abstract    # might run into issues because there are dots inside the titles   
                          

        else:  # the articles without database and id numbers are excluded
#             cnt += 1
#             print(src_title)
            continue
        
        # sanity check, some classfications are wrong
        if not year.isdigit():
            print("db:",source_db)
            print("id:",source_id)
            print("title:",title)
            print("author:",author)
            print("year:",year)
            print("date:",date)
            print(article)
            print(src_title)
            wrong_cnt += 1
            for i in src_title:
                if len(i) == 4 and i.isdigit():
                    year = i
            print("---------")
 
        if not year.isdigit():
            revise_cnt += 1
        
#         if source_id in ["32654262","4862319","4848473"]:
#             print(article)
#             print(src_title)
#             print("---------")
            
        
        art_dict["sourcedb"] = source_db
        art_dict["sourceid"] = source_id
        art_dict["title"] = title.lower()
        art_dict["author"] = author
        art_dict["year"] = year
        art_dict["date"] = date
        art_dict["abstract"] = abstract.lower()
        art_dict["text"] = text.lower()     # title+abstract


        # remove duplications in the articles
        check_id = source_db+"_"+source_id
        if check_id not in check_ids:    # avoid same id but different titles/abstracts, or same id but similar titles/abstract, 
            check_ids.add(check_id)          # we can't distinguish them for now 
            
            if art_dict not in articles:   # avoid duplications, exact the same title, id, and abstract
                article_names.append(source_db+"_"+source_id)
                articles.append(art_dict) 
                
        ## developing ends
        
#         txts += source_id+"|t|"+title+"\n"+source_id+"|a|"+abstract+"\n"+"\n"

#         txts_cnt += 1
#         if txts_cnt >= 100:
#             txts_list.append(txts)
#             txts = ""
#             txts_cnt = 0

#         article_cnt += 1
#         if article_cnt >= 1000:
#             break
#     print(cnt)
    print(wrong_cnt)
    print(revise_cnt)
    return articles


  

In [20]:
articles = article_to_txt(doc=all_doc, path=pmid_link_path, 
                          link_file1="2006_to_2015_pmid_link.txt",
                          link_file2="2016_to_2020_pmid_link.txt", link=True)

db: CCC
id: 000633493900004
title: Bacterial sepsis increases hippocampal fibrillar amyloid plaque load and neuroinflammation in a mouse model of Alzheimer & rsquo
author: s disease
year: Basak, J. M., et al.
date: 2021
['CCC:000633493900004;Bacterial sepsis increases hippocampal fibrillar amyloid plaque load and neuroinflammation in a mouse model of Alzheimer & rsquo;s disease;Basak, J. M., et al.;2021;May;', 'Background: Sepsis, a leading cause for intensive care unit admissions, causes both an acute encephalopathy and chronic brain dysfunction in survivors. A history of sepsis is also a risk factor for future development of dementia symptoms. Similar neuropathologic changes are associated with the cognitive decline of sepsis and Alzheimer?s disease (AD), including neuroinflammation, neuronal death, and synaptic loss. Amyloid plaque pathology is the earliest pathological hallmark of AD, appearing 10 to 20 years prior to cognitive decline, and is present in 30% of people over 65. As s

db: NLM
id: 4659940
title: 26549211
author: Delta-secretase cleaves amyloid precursor protein and regulates the pathogenesis in Alzheimer's disease
year: Zhang, Z., et al.
date: 2015
["4659940;26549211;Delta-secretase cleaves amyloid precursor protein and regulates the pathogenesis in Alzheimer's disease;Zhang, Z., et al.;2015;", "The age-dependent deposition of amyloid-beta peptides, derived from amyloid precursor protein (APP), is a neuropathological hallmark of Alzheimer's disease (AD). Despite age being the greatest risk factor for AD, the molecular mechanisms linking ageing to APP processing are unknown. Here we show that asparagine endopeptidase (AEP), a pH-controlled cysteine proteinase, is activated during ageing and mediates APP proteolytic processing. AEP cleaves APP at N373 and N585 residues, selectively influencing the amyloidogenic fragmentation of APP. AEP is activated in normal mice in an age-dependent manner, and is strongly activated in 5XFAD transgenic mouse model and

db: NLM
id: 4515302
title: 26246868
author: The Effect of Age on Osteogenic and Adipogenic Differentiation Potential of Human Adipose Derived Stromal Stem Cells (hASCs) and the Impact of Stress Factors in the Course of the Differentiation Process
year: Kornicka, K., et al.
date: 2015
['4515302;26246868;The Effect of Age on Osteogenic and Adipogenic Differentiation Potential of Human Adipose Derived Stromal Stem Cells (hASCs) and the Impact of Stress Factors in the Course of the Differentiation Process;Kornicka, K., et al.;2015;', 'Human adipose tissue is a great source of autologous mesenchymal stem cells (hASCs), which are recognized for their vast therapeutic applications. Their ability to self-renew and differentiate into several lineages makes them a promising tool for cell-based therapies in different types of degenerative diseases. Thus it is crucial to evaluate age-related changes in hASCs, as the elderly are a group that will benefit most from their considerable potential. In t

db: NLM
id: 4070095
title: 24915960
author: The expression of apoptosis inducing factor (AIF) is associated with aging-related cell death in the cortex but not in the hippocampus in the TgCRND8 mouse model of Alzheimer's disease
year: Yu, W., et al.
date: 2014
["4070095;24915960;The expression of apoptosis inducing factor (AIF) is associated with aging-related cell death in the cortex but not in the hippocampus in the TgCRND8 mouse model of Alzheimer's disease;Yu, W., et al.;2014;", "BACKGROUND: Recent evidence has suggested that Alzheimer's disease (AD)-associated neuronal loss may occur via the caspase-independent route of programmed cell death (PCD) in addition to caspase-dependent mechanisms. However, the brain region specificity of caspase-independent PCD in AD-associated neurodegeneration is unknown. We therefore used the transgenic CRND8 (TgCRND8) AD mouse model to explore whether the apoptosis inducing factor (AIF), a key mediator of caspase-independent PCD, contributes to cell

db: NLM
id: 4119225
title: 25072324
author: Cerebrospinal fluid markers including trefoil factor 3 are associated with neurodegeneration in amyloid-positive individuals
year: Paterson, R. W., et al.
date: 2014
['4119225;25072324;Cerebrospinal fluid markers including trefoil factor 3 are associated with neurodegeneration in amyloid-positive individuals;Paterson, R. W., et al.;2014;', "We aimed to identify cerebrospinal fluid (CSF) biomarkers associated with neurodegeneration in individuals with and without CSF evidence of Alzheimer pathology. We investigated 287 Alzheimer's Disease Neuroimaging Initiative (ADNI) subjects (age=74.9+/-6.9; 22/48/30% with Alzheimer's disease/mild cognitive impairment/controls) with CSF multiplex analyte data and serial volumetric MRI. We calculated brain and hippocampal atrophy rates, ventricular expansion and Mini Mental State Examination decline. We used false discovery rate corrected regression analyses to assess associations between CSF variables and a

db: NLM
id: 4090451
title: 25050129
author: Fumanjian, a Classic Chinese Herbal Formula, Can Ameliorate the Impairment of Spatial Learning and Memory through Apoptotic Signaling Pathway in the Hippocampus of Rats with Abeta 1-40 -Induced Alzheimer's Disease
year: Hu, H. Y., et al.
date: 2014
["4090451;25050129;Fumanjian, a Classic Chinese Herbal Formula, Can Ameliorate the Impairment of Spatial Learning and Memory through Apoptotic Signaling Pathway in the Hippocampus of Rats with Abeta 1-40 -Induced Alzheimer's Disease;Hu, H. Y., et al.;2014;", "Alzheimer's disease (AD) is the most common form of dementia and lacks disease-altering treatments. Fumanjian (FMJ), a famous classic Chinese herbal prescription for dementia, was first recorded in the Complete Works of Jingyue during the Ming Dynasty. This study aimed to investigate whether FMJ could prevent cognitive deficit and take neuroprotective effects in Abeta 1-40-induced rat model through apoptotic signaling pathway. AD model was est

db: NLM
id: 4031619
title: 24386896
author: Axonal BACE1 dynamics and targeting in hippocampal neurons: a role for Rab11 GTPase
year: Buggia-Prevot, V., et al.
date: 2014
['4031619;24386896;Axonal BACE1 dynamics and targeting in hippocampal neurons: a role for Rab11 GTPase;Buggia-Prevot, V., et al.;2014;', "BACKGROUND: BACE1 is one of the two enzymes that cleave amyloid precursor protein to generate Alzheimer's disease (AD) beta amyloid peptides. It is widely believed that BACE1 initiates APP processing in endosomes, and in the brain this cleavage is known to occur during axonal transport of APP. In addition, BACE1 accumulates in dystrophic neurites surrounding brain senile plaques in individuals with AD, suggesting that abnormal accumulation of BACE1 at presynaptic terminals contributes to pathogenesis in AD. However, only limited information is available on BACE1 axonal transport and targeting. RESULTS: By visualizing BACE1-YFP dynamics using live imaging, we demonstrate that BACE1 u

db: NLM
id: 3671299
title: 23762130
author: Amyloidosis in Alzheimer's Disease: The Toxicity of Amyloid Beta (A beta ), Mechanisms of Its Accumulation and Implications of Medicinal Plants for Therapy
year: Prasansuklab, A. and T. Tencomnao
date: 2013
["3671299;23762130;Amyloidosis in Alzheimer's Disease: The Toxicity of Amyloid Beta (A beta ), Mechanisms of Its Accumulation and Implications of Medicinal Plants for Therapy;Prasansuklab, A. and T. Tencomnao;2013;", 'Alzheimer\'s disease (AD) is a progressive neurodegenerative disorder that leads to memory deficits and death. While the number of individuals with AD is rising each year due to the longer life expectancy worldwide, current therapy can only somewhat relieve the symptoms of AD. There is no proven medication to cure or prevent the disease, possibly due to a lack of knowledge regarding the molecular mechanisms underlying disease pathogenesis. Most previous studies have accepted the "amyloid hypothesis," in which the neuropathoge

db: NLM
id: 3634454
title: 23485990
author: Effect of Different Phospholipids on alpha-Secretase Activity in the Non-Amyloidogenic Pathway of Alzheimer's Disease
year: Grimm, M. O., et al.
date: 2013
["3634454;23485990;Effect of Different Phospholipids on alpha-Secretase Activity in the Non-Amyloidogenic Pathway of Alzheimer's Disease;Grimm, M. O., et al.;2013;", "Alzheimer's disease (AD) is characterized by extracellular accumulation of amyloid-beta peptide (Abeta), generated by proteolytic processing of the amyloid precursor protein (APP) by beta- and gamma-secretase. Abeta generation is inhibited when the initial ectodomain shedding is caused by alpha-secretase, cleaving APP within the Abeta domain. Therefore, an increase in alpha-secretase activity is an attractive therapeutic target for AD treatment. APP and the APP-cleaving secretases are all transmembrane proteins, thus local membrane lipid composition is proposed to influence APP processing. Although several studies have focuse

db: NLM
id: 3325610
title: 22258516
author: Measurement of altered AbetaPP isoform expression in frontal cortex of patients with Alzheimer's disease by absolute quantification real-time PCR
year: Tharp, W. G., et al.
date: 2012
["3325610;22258516;Measurement of altered AbetaPP isoform expression in frontal cortex of patients with Alzheimer's disease by absolute quantification real-time PCR;Tharp, W. G., et al.;2012;", "Enzymatic cleavage of amyloid-beta protein precursor (AbetaPP) produces amyloid-beta (Abeta) peptides which form the insoluble cortical plaques characteristic of Alzheimer's disease (AD). AbetaPP is post-transcriptionally processed into three major isoforms with differential cellular and tissue expression patterns. Changes in AbetaPP isoform expression may be indicative of disease pathogenesis in AD, but accurately measuring AbetaPP gene isoforms has been difficult to standardize, reproduce, and interpret. In light of this, we developed a set of isoform specific absolute

db: NLM
id: 3302888
title: 22427802
author: Small-animal PET imaging of amyloid-beta plaques with [11C]PiB and its multi-modal validation in an APP/PS1 mouse model of Alzheimer's disease
year: Manook, A., et al.
date: 2012
["3302888;22427802;Small-animal PET imaging of amyloid-beta plaques with [11C]PiB and its multi-modal validation in an APP/PS1 mouse model of Alzheimer's disease;Manook, A., et al.;2012;", "In vivo imaging and quantification of amyloid-beta plaque (Abeta) burden in small-animal models of Alzheimer's disease (AD) is a valuable tool for translational research such as developing specific imaging markers and monitoring new therapy approaches. Methodological constraints such as image resolution of positron emission tomography (PET) and lack of suitable AD models have limited the feasibility of PET in mice. In this study, we evaluated a feasible protocol for PET imaging of Abeta in mouse brain with [(11)C]PiB and specific activities commonly used in human studies. In vivo 

db: NLM
id: 22970285
title: Disturbed Ca2+ homeostasis increases glutaminyl cyclase expression
author:  connecting two early pathogenic events in Alzheimer's disease in vitro
year: De Kimpe, L., et al.
date: 2012
["3436868;22970285;Disturbed Ca2+ homeostasis increases glutaminyl cyclase expression; connecting two early pathogenic events in Alzheimer's disease in vitro;De Kimpe, L., et al.;2012;", "A major neuropathological hallmark of Alzheimer's disease (AD) is the deposition of aggregated beta amyloid (Abeta) peptide in the senile plaques. Abeta is a peptide of 38-43 amino acids and its accumulation and aggregation plays a key role early in the disease. A large fraction of beta amyloid is N-terminally truncated rendering a glutamine that can subsequently be cyclized into pyroglutamate (pE). This makes the peptide more resistant to proteases, more prone to aggregation and increases its neurotoxicity. The enzyme glutaminyl cyclase (QC) catalyzes this conversion of glutamine to pE. In b

db: NLM
id: 3025006
title: 21283692
author: Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease
year: Twine, N. A., et al.
date: 2011
["3025006;21283692;Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease;Twine, N. A., et al.;2011;", "Recent studies strongly indicate that aberrations in the control of gene expression might contribute to the initiation and progression of Alzheimer's disease (AD). In particular, alternative splicing has been suggested to play a role in spontaneous cases of AD. Previous transcriptome profiling of AD models and patient samples using microarrays delivered conflicting results. This study provides, for the first time, transcriptomic analysis for distinct regions of the AD brain using RNA-Seq next-generation sequencing technology. Illumina RNA-Seq analysis was used to survey transcriptome profiles from total 

db: NLM
id: 3079704
title: 21526230
author: Non-esterified fatty acids generate distinct low-molecular weight amyloid-beta (Abeta42) oligomers along pathway different from fibril formation
year: Kumar, A., et al.
date: 2011
['3079704;21526230;Non-esterified fatty acids generate distinct low-molecular weight amyloid-beta (Abeta42) oligomers along pathway different from fibril formation;Kumar, A., et al.;2011;', "Amyloid-beta (Abeta) peptide aggregation is known to play a central role in the etiology of Alzheimer's disease (AD). Among various aggregates, low-molecular weight soluble oligomers of Abeta are increasingly believed to be the primary neurotoxic agents responsible for memory impairment. Anionic interfaces are known to influence the Abeta aggregation process significantly. Here, we report the effects of interfaces formed by medium-chain (C9-C12), saturated non-esterified fatty acids (NEFAs) on Abeta42 aggregation. NEFAs uniquely affected Abeta42 aggregation rates that depended o

db: NLM
id: 20474036
title: Tetrapeptides, as small-sized peptidic inhibitors
author:  synthesis and their inhibitory activity against BACE1
year: Kakizawa, T., et al.
date: 2010
['CCC:000278392200001;Tetrapeptides, as small-sized peptidic inhibitors; synthesis and their inhibitory activity against BACE1;Kakizawa, T., et al.;2010;Jun;', "beta-Site amyloid precursor protein cleaving enzyme 1 (BACE1) is known to be involved in the production of amyloid beta-peptide in Alzheimer's disease and is a major target for current drug design. We previously reported substrate-based peptidomimetics, KMI-compounds as potent BACE1 inhibitors. In this study, we designed and synthesized tetrapeptides as low molecular-sized inhibitors. These exhibited high potency against recombinant BACE1, with the highest IC50 value of 34.6 nm from KMI-927. Copyright (C) 2010 European Peptide Society and John Wiley & Sons, Ltd.", '']
['CCC:000278392200001', 'Tetrapeptides, as small-sized peptidic inhibitors', ' synthe

db: NLM
id: 2529401
title: 18800168
author: Abeta mediated diminution of MTT reduction--an artefact of single cell culture?
year: Ronicke, R., et al.
date: 2008
['2529401;18800168;Abeta mediated diminution of MTT reduction--an artefact of single cell culture?;Ronicke, R., et al.;2008;', "The 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl-tetrazoliumbromide (MTT) reduction assay is a frequently used and easily reproducible method to measure beta-amyloid (Abeta) toxicity in different types of single cell culture. To our knowledge, the influence of Abeta on MTT reduction has never been tested in more complex tissue. Initially, we reproduced the disturbed MTT reduction in neuron and astroglia primary cell cultures from rats as well as in the BV2 microglia cell line, utilizing four different Abeta species, namely freshly dissolved Abeta (25-35), fibrillar Abeta (1-40), oligomeric Abeta (1-42) and oligomeric Abeta (1-40). In contrast to the findings in single cell cultures, none of these Abeta sp

db: NLM
id: 154150
title: 12631385
author: The prolyl isomerase Pin1 in breast development and cancer
year: Wulf, G., et al.
date: 2003
['154150;12631385;The prolyl isomerase Pin1 in breast development and cancer;Wulf, G., et al.;2003;', 'The prolyl isomerase Pin1 specifically isomerizes certain phosphorylated Ser/Thr-Pro bonds and thereby regulates various cellular processes. Pin1 is a target of several oncogenic pathways and is overexpressed in human breast cancer. Its overexpression can lead to upregulation of cyclin D1 and transformation of breast epithelial cells in collaboration with the oncogenic pathways. In contrast, inhibition of Pin1 can suppress the transformation of breast epithelial cells. In addition, Pin1 knockout in mice prevents massive proliferation of breast epithelial cells during pregnancy. Pin1 plays a pivotal role in breast development and may be a promising new anticancer target.', '']
['154150', '12631385', 'The prolyl isomerase Pin1 in breast development and 

db: NLM
id: 0011140686
title: Ab peptide vaccination prevents memory loss in an animal model of Alzheimer's disease [correction in Nature 2001
author: 412:660]
year: Morgan, D., et al.
date: 2000
["0011140686;Ab peptide vaccination prevents memory loss in an animal model of Alzheimer's disease [correction in Nature 2001;412:660];Morgan, D., et al.;2000;", "Vaccinations with amyloid-beta peptide (A beta) can dramatically reduce amyloid deposition in a transgenic mouse model of Alzheimer's disease. To determine if the vaccinations had deleterious or beneficial functional consequences, we tested eight months of A beta vaccination in a different transgenic model for Alzheimer's disease in which mice develop learning deficits as amyloid accumulates. Here we show that vaccination with A beta protects transgenic mice from the learning and age-related memory deficits that normally occur in this mouse model for Alzheimer's disease. During testing for potential deleterious effects of the vaccine

In [40]:
len(articles)

20536

In [41]:
output_path

'/Users/yidesdo21/Projects/outputs/12_time_slicing/'

In [42]:
# save metadata for all articles in one json file
with open(output_path+"metadata/"+"articles.json", "w") as f:
    json.dump(articles, f, indent=4, sort_keys=True)

In [43]:
# there are duplications in articles, remove them as the preprocessing step 
# should return nothing because we have added the preprocessing in the article_to_txt function
pmid_check = dict()

for article in articles:
    art_db, art_id, art_title, art_abs = article["sourcedb"], article["sourceid"], article["title"], article["abstract"]
    f_name = art_db+"_"+art_id
    
    if f_name not in pmid_check.keys():
        pmid_check[f_name] = article
    
    else:
        print(art_id)
        print(art_title)  # duplicated title
        existed_title = pmid_check.get(f_name).get("title")
        existed_abs = pmid_check.get(f_name).get("abstract")
        print(existed_title)  # original title
        print(art_title == existed_title)
        print("-------")
        print(art_abs)
        print("-------")
        print(existed_abs)
        print(art_abs == existed_abs)
        print("----------------")



In [44]:
articles[:5]

[{'sourcedb': 'CCC',
  'sourceid': '000612215200005',
  'title': 'protective effect of terminalia chebula retz. extract against a beta aggregation and a beta-induced toxicity in caenorhabditis elegans',
  'author': 'Zhao, L. H., et al.',
  'year': '2021',
  'date': 'Mar',
  'abstract': 'ethnopharmacological relevance: terminalia chebula retz. (t.chebula) is an important medicinal plant in tibetan medicine and ayurveda. t.chebula is known as the "king of tibetan medicine", due to its widespread clinical pharmacological activity such as anti-inflammatory, antioxidative, antidiabetic as well as anticancer in lots of in vivo and in vitro models. in this study, we use transgenic and/or rnai caenorhabditis elegans (c.elegans) model to simulation the ad pathological features induced by a beta, to detect the effect of twe on improving a beta-induced toxicity and the corresponding molecular mechanism. aim of study: the study aimed to tested the activities and its possible mechanism of t.chebula

In [42]:
# sampled_list

In [43]:
# article_names

### for the PTC annotator, split the articles with pmids and the rest.
- There are three databases: {'CCC', 'ISI', 'NLM'}. The articles with NLM have pmids, while the rest don't.
- Nearly half of the documents don't need to go through the "processing raw document" api: {'NLM': 449, 'CCC': 526, 'ISI': 25}. This saves a lot of time.

In [45]:
d = defaultdict(int)

for article in articles:
    sourcedb, sourceid = article["sourcedb"], article["sourceid"]
    d[sourcedb] += 1

print(d)

defaultdict(<class 'int'>, {'CCC': 3830, 'NLM': 16452, 'ISI': 254})


In [46]:
16452/(16452+3830+254)

0.8011297234125438

In [47]:
## For the articles with pmids, 
# add the pmid number into the ptc/PmidCode.Python/input_pmid folder. 
# Each pmid number will be a line. Nedd. tosplit into batches. 1000 works, 1500 doesn't work.

## For the articles without pmids,
# the retrieving process is easy as we have done that above,
# the problem is to set how many articles for each batch as it relates to the retrieving speed
# here we set 25 articles/batch

cnt = 0
pmids = list()
pmids_list = list()
pmids_batch = 1000
pmid_all_cnt = 0
non_pmids = ""
non_pmids_cnt = 0
non_pmids_list = list()
non_pmids_batch = 25
non_pmids_titles = defaultdict(list)


for article in articles:
    sourcedb, sourceid = article["sourcedb"], article["sourceid"]
    if sourcedb == "NLM":
        pmids.append(str(sourceid))
        pmid_all_cnt += 1

#         need to create batches for the articles with pmids too 
#         not sure how many articles per batch is appropriate, after test, 1000 is the upper limit
        cnt += 1
        if cnt >= pmids_batch:
            pmids_list.append(pmids)
            pmids = list()
            cnt = 0

# this is the last batch that doesn't reach the condition to be added to the list
pmids_list.append(pmids)

pmid_check = 0

for i in pmids_list:
    pmid_check += len(i)
    
assert pmid_all_cnt == pmid_check
    
#     else:
#         title, abstract = article["title"], article["abstract"]
#         non_pmids_titles[title].append(sourcedb+"_"+sourceid)

#         non_pmids += sourceid+"|t|"+title+"\n"+sourceid+"|a|"+abstract+"\n"+"\n"
    
#         non_pmids_cnt += 1
        
#         # this code has bugs, will lose the last batch becase it doesn't 
#         # reach the condition to be added to the whole list
#         if non_pmids_cnt >= non_pmids_batch:
#             non_pmids_list.append(non_pmids)
#             non_pmids = ""
#             non_pmids_cnt = 0
    

In [48]:
len(non_pmids_list)  

0

In [22]:
# time to process ~500 articles for raw document processing
# batch for 100 -- 2256/60
# batch for 50 -- 3358/60
# batch for 25 -- 


55.96666666666667

In [123]:
# print(pmids)

#### For the articles with pmids
- pmid list -> ptc annotator -- Retrieving PubTator annotations -> binary output -> process and save as txt file -> create the time-sliced dataset by create_time_sliced_dataset.ipynb

In [49]:
len(pmids_list)

17

In [50]:
pmid_path

'/Users/yidesdo21/Projects/ptc/PmidCode.Python/input_pmid/'

In [67]:
file_name = raw_file.split(".")[0]
batch_cnt = 0

for p in pmids_list:
    batch_cnt += 1
    
    with open(pmid_path+file_name+"_0"+str(batch_cnt)+".pmid", "w") as f:
        for pmid in p:
            f.write("%s\n" % pmid)

In [69]:
output_path

'/Users/yidesdo21/Projects/outputs/12_time_slicing/'

In [70]:
batch_cnt

17

In [71]:
pmid_all_cnt

16452

In [72]:
# check the number of articles is correct
ptc_check_path = "/Users/yidesdo21/Projects/ptc/PmidCode.Python/input_pmid/"
ptc_nums = 0
ptc_set_num = 0

for c in range(batch_cnt):
    ptc_file = ptc_check_path+file_name+"_0"+str(c+1)+".pmid"
    
    with open(ptc_file) as f:
        ptc_lines = f.readlines()  
        ptc_set_num += len(list(set(ptc_lines)))
        ptc_nums += len(ptc_lines)    

assert pmid_all_cnt == ptc_nums
print(ptc_set_num)    # have duplications in the pmid numbers

16452


In [73]:
# process the retrieved results
# the retrieved results are in binary mode -- updated 
# the retrieved results are in str mode 

pmid_contents = list()

for c in range(batch_cnt):
    pmid_file = output_path+"ptc_results/"+file_name+"_0"+str(c+1)+"_pmid.PubTator"

    with open(pmid_file) as f:
        pmid_results = f.read().replace("\t", " ").split("\n")
#         pmid_results = f.read().replace("\\t", " ").split("\\n")

        # group each article and corresponding annotations by using the split ''
        pmid_groups = (list(g) for _, g in groupby(pmid_results, key=''.__ne__))
        pmid_content = [a + b for a, b in zip(pmid_groups, pmid_groups)]
        
        pmid_contents.extend(pmid_content)

In [74]:
pmid_results

['72777|t|Immune response of Lewis rats to peptide C1 (residues 68-88) of guinea pig and rat myelin basic proteins.',
 '72777|a|Peptide C1 (residues 68-88) from GP and rat BP differ by a single amino acid interchange at residue 79. This residue is serine in GP C1 and threonine in rat C1. GP C1 was encephalitogenic in Le rats at doses as low as 15 ng. Rat C1 was encephalitogenic at doses of 1,500 ng or greater. LNC from rats challenged with 25 X 10(-4) micronmol of GP C1 and 250 X 10(-4) micronmol of rat C1 showed a proliferative response in vitro to both peptides, but in each instance the magnitude of the response was greater to the GP peptide. GP C1 also induced higher levels of circulating antibodies at 25 X 10(-4) micronmol, but the specificity of antibodies produced by the two peptides was the same. These results have been interpreted as indicating that the presence of serine at position 79 in GP C1 results in the stimulation of greater numbers of T cells involved in (a) the induct

In [75]:
len(pmid_contents)

16442

In [76]:
pmid_contents[:10]

[["28174113|t|Involvement of GluN2B subunit containing N-methyl-d-aspartate (NMDA) receptors in mediating the acute and chronic synaptotoxic effects of oligomeric amyloid-beta (Abeta) in murine models of Alzheimer's disease (AD).",
  "28174113|a|To elucidate whether a permanent reduction of the GluN2B subunit affects the pathology of Alzheimer's disease (AD), we cross-bred mice heterozygous for GluN2B receptors in the forebrain (hetGluN2B) with a mouse model for AD carrying a mutated amyloid precursor protein with the Swedish and Arctic mutation (mAPP) resulting in a hetGluN2B/mAPP transgenic. By means of voltage-sensitive dye imaging (VSDI) in the di-synaptic hippocampal pathway and the recording of field excitatory postsynaptic potentials (fEPSPs), hippocampal slices of all genotypes (WT, hetGluN2B, mAPP and hetGluN2B/mAPP, age 9-18 months) were tested for spatiotemporal activity propagation and long-term potentiation (LTP) induction. CA1-LTP induced by high frequency stimulation (HF

In [119]:
# # abandon this, bring too much conversion problems in the other codes
# # save as a txt file as an input to create the time-sliced dataset
# with open(output_path+"ptc_results/"+file_name+'_pmid.txt', 'w') as f:
#     for item in pmid_contents:
#         f.write("{}\n".format(item))

In [None]:
# then we can read the annotation results to other formats, like json
# for further processing to Neo4j


#### For the articles without pmids
#### This part is finished. The output file is under pmid_link_path. Don't run. 
- find the pmids with the titles for the articles without pmids
- use esearch

In [121]:
pmid_link_path

'/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/link_ccc_to_nlm/'

In [120]:
# {title:[CCC_accession number]}
# when appending the corresponding PMID for the title, just append to the value of the list
## non_pmids_titles[title].append("pmid1")
# non_pmids_titles


In [105]:
# initialize these before running the esearch over titles
Entrez.email = "yiyuanp1@student.unimelb.edu.au"
titles = list(non_pmids_titles.keys())  # the titles and the id_lists should have the same index
process_cnt = 0
id_lists = list()

In [114]:
# the esearch might not be correct 100% of the time,
# for instance, for the article 'Amyloid-beta(1-42) induced glutamatergic receptor and transporter expression changes in the mouse hippocampus'
# it returns three pmid numbers, one of them is correct.


for t in titles:
    handle = Entrez.esearch(db="pubmed", term=t, retmax=5)
    record = Entrez.read(handle)
    id_list = record["IdList"]

    id_lists.append(id_list)

# use the loop later, because the codes might fail here
#     for i in id_list:
#         non_pmids_titles[t].append("NLM_"+i)
    
    handle.close()
    
    process_cnt += 1
    
    if process_cnt % 100 == 0:
        print("Processing %s uids" % process_cnt)


Processing 4800 uids
Processing 4900 uids
Processing 5000 uids
Processing 5100 uids
Processing 5200 uids
Processing 5300 uids
Processing 5400 uids
Processing 5500 uids
Processing 5600 uids
Processing 5700 uids
Processing 5800 uids
Processing 5900 uids
Processing 6000 uids


In [117]:
for c,t in enumerate(titles):  
    title_ids = id_lists[c]
    for i in title_ids:
        non_pmids_titles[t].append("NLM_"+i)

In [119]:
# save the pmid links for each title in a file for future use
# {'Label-Free SERS Strategy for ...': 
# ['CCC_000527779200038', 'NLM_32227892']
save_name = raw_file.split(".")[0]

with open(pmid_link_path+save_name+"_pmid_link.txt", 'w') as json_file:
    json.dump(non_pmids_titles, json_file)

In [107]:
# ptc_path = "/Users/yidesdo21/Projects/ptc/nonPmidCode.Python/input/"

In [24]:
# # for PTC -- nonpmids
# for count, value in enumerate(non_pmids_list):
#     f_name = "random_batch_"+str(count)
#     with open(ptc_path+f_name+".PubTator", "w") as f:
#         f.write(value)
    

### Save results for the NIO annotator

In [128]:
# # create the batches for NIO inputs
# # the IntelliJ can't save 1000 results

# volume = 100
# for i in range(int(len(articles)/volume)):
#     path = nio_path+"articles-txt-format_batch"+str(i)
#     if not os.path.exists(path):
#         os.makedirs(path)

In [36]:
nio_path

'/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/'

In [130]:
# # save the articles to the nio path 
# # use the output from articles directly
# # for NIO
# batch_cnt = 0
# batch_name = 0

# for article in articles:
#     file_name = article["sourcedb"]+"_"+article["sourceid"]
    
#     with open(nio_path+"articles-txt-format_batch"+str(batch_name)+"/"+file_name+".txt", 'w') as f:
#         f.write(article["text"])
    
#     batch_cnt += 1
#     if batch_cnt >= volume:
#         batch_cnt = 0
#         batch_name += 1

In [27]:
articles[0]

{'sourcedb': 'NLM',
 'sourceid': '25862638',
 'title': "Intraneuronal APP and extracellular A beta independently cause dendritic spine pathology in transgenic mouse models of Alzheimer's disease",
 'abstract': "Alzheimer's disease (AD) is thought to be caused by accumulation of amyloid-beta protein (A beta), which is a cleavage product of amyloid precursor protein (APP). Transgenic mice overexpressing APP have been used to recapitulate amyloid-beta pathology. Among them, APP23 and APPswe/PS1deltaE9 (deltaE9) mice are extensively studied. APP23 mice express APP with Swedish mutation and develop amyloid plaques late in their life, while cognitive deficits are observed in young age. In contrast, deltaE9 mice with mutant APP and mutant presenilin-1 develop amyloid plaques early but show typical cognitive deficits in old age. To unveil the reasons for different progressions of cognitive decline in these commonly used mouse models, we analyzed the number and turnover of dendritic spines as i

In [78]:
nio_path

'/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/'

In [79]:
# I force the IntelliJ to output a file that contains all the results
# no need to use batches

## need to fix the bug: need to erase the files in the folder before creating new ones
## otherwise the old files will still be in the folder 
year = raw_file.split(".")[0]
nlm_cnt = 0
ccc_cnt = 0
file_names = list()

for article in articles:
    art_db, art_id = article["sourcedb"], article["sourceid"]
    if art_db == "NLM":
        nlm_cnt += 1
        file_name = art_db+"_"+art_id
        file_names.append(file_name)
        
        
        path = nio_path+year
        if not os.path.exists(path):
            os.makedirs(path)

        with open(path+"/"+file_name+".txt", 'w') as f:
            f.write(article["text"])
    
    else:
        ccc_cnt += 1

# the number of articles into two annotators should be the same
assert nlm_cnt == pmid_all_cnt

In [34]:
articles[0]

{'sourcedb': 'NLM',
 'sourceid': '25862638',
 'title': "Intraneuronal APP and extracellular A beta independently cause dendritic spine pathology in transgenic mouse models of Alzheimer's disease",
 'abstract': "Alzheimer's disease (AD) is thought to be caused by accumulation of amyloid-beta protein (A beta), which is a cleavage product of amyloid precursor protein (APP). Transgenic mice overexpressing APP have been used to recapitulate amyloid-beta pathology. Among them, APP23 and APPswe/PS1deltaE9 (deltaE9) mice are extensively studied. APP23 mice express APP with Swedish mutation and develop amyloid plaques late in their life, while cognitive deficits are observed in young age. In contrast, deltaE9 mice with mutant APP and mutant presenilin-1 develop amyloid plaques early but show typical cognitive deficits in old age. To unveil the reasons for different progressions of cognitive decline in these commonly used mouse models, we analyzed the number and turnover of dendritic spines as i

In [41]:
nlm_cnt

609

In [42]:
ccc_cnt

203

In [43]:
len(file_names)

609

In [44]:
len(set(file_names))

609

In [88]:
x = Counter(file_names)

In [89]:
x.most_common()   # duplicated pmids

[('NLM_25862638', 1),
 ('NLM_25980993', 1),
 ('NLM_25147114', 1),
 ('NLM_26043045', 1),
 ('NLM_24614496', 1),
 ('NLM_25611954', 1),
 ('NLM_25751170', 1),
 ('NLM_26244608', 1),
 ('NLM_25891083', 1),
 ('NLM_25708205', 1),
 ('NLM_26005850', 1),
 ('NLM_26104676', 1),
 ('NLM_26528958', 1),
 ('NLM_26454022', 1),
 ('NLM_26549211', 1),
 ('NLM_26256421', 1),
 ('NLM_26199414', 1),
 ('NLM_26165445', 1),
 ('NLM_26505914', 1),
 ('NLM_25024455', 1),
 ('NLM_25667991', 1),
 ('NLM_26254241', 1),
 ('NLM_24866905', 1),
 ('NLM_26129772', 1),
 ('NLM_26424877', 1),
 ('NLM_25293506', 1),
 ('NLM_26092426', 1),
 ('NLM_25498712', 1),
 ('NLM_25193102', 1),
 ('NLM_25459675', 1),
 ('NLM_26592823', 1),
 ('NLM_25811848', 1),
 ('NLM_26214837', 1),
 ('NLM_26190634', 1),
 ('NLM_25060965', 1),
 ('NLM_25576151', 1),
 ('NLM_25775543', 1),
 ('NLM_25741717', 1),
 ('NLM_25436414', 1),
 ('NLM_26489896', 1),
 ('NLM_26551432', 1),
 ('NLM_25925958', 1),
 ('NLM_25281129', 1),
 ('NLM_25352456', 1),
 ('NLM_26104799', 1),
 ('NLM_259

In [16]:
nio_path

'/Users/yidesdo21/Projects/inputs/articles/06_time_slicing/'

In [47]:
nio_files = os.listdir(nio_path+"2020/") # dir is your directory path
number_files = len(nio_files)

In [48]:
number_files

610

In [49]:
for file in nio_files:
    if not file.endswith(".txt"):
        print(file)

.DS_Store


#### Don't use the following codes for the whole txts. 
- The articles are splitted by having pmids and not having pmids

In [109]:
output_path = "/Users/yidesdo21/Projects/ptc/ExampleCode.Python/input/"

In [111]:
# for PTC
for count, value in enumerate(txts_list):
    file_name = "random_batch_"+str(count)
    with open(output_path+file_name+".PubTator", "w") as f:
        f.write(value)
    

In [112]:
3255/60

54.25