
The reason we should not the bio and medarchive by doing something like 

```
url = "https://www.biorxiv.org/content/" + doi + "v1"
```

is that version number increases, and sometimes version 1 remains active after version 2 is uploaded. 

For a bioarchive and medarchive papers, if there is only a pdf, the url with the suffix ".full" will work but will just show the original page with just the Abstract. Weirdly, both ".full-text" and ".full" seem to work with medarchive papers. The ".full-text" results when you click on the Full Text hyperlink, but ".full-text" always redirects to ".full". 

Note that the `doi_pubmed` method can be an invalid article.

We can also try accessing articles this way: "https://pubmed.ncbi.nlm.nih.gov/".


### Redirecting (not used)
The following approach:

``` url_pmc = "https://www.ncbi.nlm.nih.gov/pmc/articles/doi/" + self.doi"```

redirects to the PMC link to the paper. I've found cases were the "doi.org" approach does not get the full text but this does. I've also found cases where the "doi.org" approach gets the full text but this does not. Clearly, we need to run both. This approach is replicated by using the `metapub` package, with the advantage being that we avoid unecessary html queries when the paper is not on pubmed and this approach would fail anyways. 


### PMC full text for open access papers
We may want to check in the PMCID full text retrieval system for open access papers. Information on this is [here](https://ftp.ncbi.nlm.nih.gov/pub/pmc/) and [here](https://www.ncbi.nlm.nih.gov/pmc/tools/get-full-text/). One advantage of this approach is that the XML text should be much cleaner and avoid html gobble that happens to be PDBs. Another advantage of this approach is that source files like supplementary data and images are in the open access subset. It seems like there may be a better approach than simply inserting the PMCID in the link like in the code below, such as using a library for direct queries of the database, like [pymed/article.py](https://github.com/gijswobben/pymed/blob/master/pymed/article.py). But this approach is more complicated and requries writing much more code. Also it doesn't seem like the xml text much noise/gobble, so direct queries may not provide much advantage. Some papers like [https://doi.org/10.1515/cclm-2021-1287](10.1515/cclm-2021-1287) are technically on pubmed, it has a PMID, but either aren't on pubmed central and thus don't have a PMCID or aren't open access. This paper doesn't even appear on the `metapub` fetch. When searched by doi, but when searched by its PMID, it's found. 

```
pmc_oa_xml = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=" + self.pmc
```

We should write a checker to see if this approach fails. The xml page has a standard error messsage when a paper does not exist I think.


### Scraping PMIDs and PMCIDs

* [Variable length of PMID](https://libguides.library.arizona.edu/c.php?g=406096&p=2779570#:~:text=PMID,to%20all%20records%20in%20PubMed.) (From 1 to 8 digits.)
* [Fixed length of PMC](https://en.wikipedia.org/wiki/PubMed_Central#:~:text=The%20two%20identifiers%20are%20distinct%20however.%20It%20consists%20of%20%22PMC%22%20followed%20by%20a%20string%20of%20seven%20numbers) (7 digits.)

In [1]:
import get_poss_pdbs as gp
import get_doi_from_pubmed as gd
import pandas as pd
from metapub import PubMedFetcher





In [63]:
fields = ["title", "authors", "date", "abstract", "journal", "doi"]

In [120]:
data = gd.pubmed_papers_and_pt(txt=False, jsonl=False, csv=True)

In [128]:
paper_data = data[0]

In [127]:
fields

['title', 'authors', 'date', 'abstract', 'journal', 'doi']

In [286]:
doi = data[5]['doi']

In [237]:
"https://www.ncbi.nlm.nih.gov/pmc/articles/doi/10.14744/nci.2021.99075"

'https://www.ncbi.nlm.nih.gov/pmc/articles/doi/10.14744/nci.2021.99075'

In [None]:
"https://www.ncbi.nlm.nih.gov/pmc/articles/doi/10.14744/nci.2021.99075"

In [285]:
# Another way to get PMID
"https://pubmed.ncbi.nlm.nih.gov/?term=" + doi
# Then query the html. 

'https://pubmed.ncbi.nlm.nih.gov/?term=10.1101/2021.12.06.21267328'

In [264]:
doi = "10.1515/cclm-2021-1287"
fetch = PubMedFetcher()


In [265]:
doi2pmc = "https://www.ncbi.nlm.nih.gov/pmc/articles/doi/" + doi

In [266]:
doi2pmc

'https://www.ncbi.nlm.nih.gov/pmc/articles/doi/10.1515/cclm-2021-1287'

In [269]:
article.content

In [290]:
article = fetch.article_by_doi(doi)
print(article.doi)
print(article.pmid)
print(article.pmc)


10.1101/2021.12.01.470697
34909774
8669841


In [17]:
class PDBChecker:
    def __init__(self):
        from Bio.PDB.PDBList import PDBList
        """
        First we store all the existing pdb IDs as a dictionary for O(1) lookup. 
        There are 184,929 IDs as of 2021-12-8 and the retrieval using biopython takes about 7 seconds.
        For some reason, calling PDBList() creates an empty folder in the directory called "obsolete", 
        but this goes away by setting the `obsolte_pdb` parameter to some random string, which I made "None".

        """

        self.pdbl = PDBList(verbose=False, obsolete_pdb="None")
        self.existing_pdbs = {pdb_id: True for pdb_id in self.pdbl.get_all_entries()}  # takes 7 secs

    def get_actual(self, possible_pdbs: list, verbose=True) -> list:
        """
        Takes a list of possible PDB IDs as input. 
        Returns a list of the actual PDB IDs, i.e. the ones from the input list that exist on the PDB database.
        
        
        Warning: Please remember that html gobble can include actual PDB IDs by chance. So just because a possible
        PDB ID from the paper url html turns out to be an actual PDB ID (is actually on the database), does not 
        mean it was meant to be written in the text of the paper. 
        """
        actual_pdbs = [pdb_id for pdb_id in possible_pdbs if self.existing_pdbs.get(pdb_id, False)]
        if verbose: 
            print("Out of the", len(possible_pdbs), 'possible PDB IDs scraped', len(actual_pdbs), 'are actual PDB IDs.')
        return actual_pdbs


    def get_top_authors(self, pdb_id: str, top_num=3, verbose=True) -> list:
        """
        Takes an actual PDB ID as input.
        Returns (hopefully) the last names of the top authors for that paper, as retrieved from the PDB database. 
        
        Misc notes:
        If an author has essentially two last names, like "von Kuegelgen", the function will treat those
        as two separate last names. Though this shouldn't matter for practical purposes. 

        The logic of returning only the top authors is that sometimes institutions are named as authors, 
        for example "Seattle Structural Genomics Center for Infectious Disease (SSGCID), McGuire, A.T., Veesler, D."
        or "Midwest Center for Structural Genomics". 
        If there are less than `top_num` authors on the paper, it will return all the authors
        of the paper. 
        """
        
        import tempfile
        import re

        temp_dir = tempfile.TemporaryDirectory()
        pdb_file = self.pdbl.retrieve_pdb_file(pdb_id, file_format="pdb", pdir=temp_dir.name)
        author_txt = ' '.join(filter(lambda line: line.split()[0] == "AUTHOR", open(pdb_file).read().splitlines()))
        temp_dir.cleanup()
        top_authors = list(filter(lambda word: len(word) > 1 and word != "AUTHOR", re.findall(r"[\w']+", author_txt)))[:top_num]
        if verbose:
            print("Top Authors scraped from PDB Database:", top_authors)
        return top_authors
    
        

In [11]:
import requests
from metapub import PubMedFetcher
import get_poss_pdbs as gp
import re

class Paper:
    def __init__(self, paper_data: dict, verbose=True):

        self.fields = ("title", "authors", "date", "abstract", "journal", "doi")
        self.title, self.authors, self.date, self.abstract, self.journal, self.doi = (paper_data[field] for field in self.fields)
        self.url_doi, self.url_pmc, self.url_pmid = None, None, None
        
        if self.doi:
            if verbose:
                print("DOI exists. Attempting to retrieve...", end='\n\n')
            self.retrieve_from_doi()
        else:
            if verbose:
                print("DOI does not exist. Catastrophic failure.", end='\n\n')
                self.pmid, self.pmc = None, None

        if verbose:
            print("DOI:", self.doi)
            print("PMID:", self.pmid)
            print("PMC:", self.pmc)

        self.get_pdbs()


    def get_pdbs(self, verbose=True):
        if self.url_doi:
            if verbose:
                print("Retrieving pdbs from DOI...")
            doi_poss_pdbs = gp.get_poss_pdbs(self.url_doi)
            self.doi_actual_pdbs = checker.get_actual(doi_poss_pdbs)
            if verbose:
                print(self.doi_actual_pdbs, end="\n\n")


        if self.url_pmid:
            if verbose:
                print("Retrieving pdbs from PMC...")
            pmid_poss_pdbs = gp.get_poss_pdbs(self.url_pmid)
            self.pmid_actual_pdbs = checker.get_actual(pmid_poss_pdbs)
            if verbose:
                print(self.pmid_actual_pdbs, end="\n\n")

        
        if self.url_pmc:
            if verbose:
                print("Retrieving pdbs from PMC...")
            pmc_poss_pdbs = gp.get_poss_pdbs(self.url_pmc)
            self.pmc_actual_pdbs = checker.get_actual(pmc_poss_pdbs)
            if verbose:
                print(self.pmc_actual_pdbs, end="\n\n")



    def retrieve_from_doi(self, verbose=True):
        if verbose:
            print("Scraping from doi...", end="\n\n")
        self.get_url_doi()

        if verbose:
            print("Trying metapub fetch...")
        self.try_metapub()

        if not self.found_metapub:
            # We only try the pubmed search `try_pubmed` if metapub fails. 
            # Therefore we are assuming that if metapub succeeds, it will always
            # find the PMID and PMC if they exist. 
            if verbose:
                print("Trying pubmed webscrape...")
            self.try_pubmed()

        if self.pmid:
            self.get_url_pmid() 
        if self.pmc:
            self.get_url_pmc()

            

    def try_metapub(self, verbose=True):
            try:
                self.article_fetch = PubMedFetcher().article_by_doi(self.doi)
                self.found_metapub = True
                self.pmid = self.article_fetch.pmid
                self.pmc = self.article_fetch.pmc

                if verbose:
                    print("Metapub fetch succeeded. PMID and PMC found.", end='\n\n')
            except:
                self.found_metapub = False
                self.pmid, self.pmc = None, None
                if verbose:
                    print("Metapub fetch failed.", end='\n\n')

            

    def try_pubmed(self, verbose=True):
        import get_poss_pdbs as gp

        url_search_pubmed = "https://pubmed.ncbi.nlm.nih.gov/?term=" + self.doi
            
        pubmed_txt = gp.get_txt(url_search_pubmed)

        try:
            start_idx = pubmed_txt.index("pmid:")
            # Length of "mid:" plus max length of PMID plus 1
            end_idx = start_idx + 4 + 8 + 1
            pmid_match = pubmed_txt[start_idx:end_idx]
            if verbose:
                print("PMID Match:", pmid_match)  # if verbose
            self.pmid = pmid_match.split(':')[1].split(',')[0]
        except ValueError:  # substring not found
            if verbose:
                print("PMID not found via pubmed webscrape.")
            self.pmid = None

        try:
            self.pmc = re.findall(r"PMC\d{7}", pubmed_txt)[0][3:]
        except IndexError:  # list index out of range
            if verbose:
                print("PMC not found via pubmed webscrape.")
            self.pmc = None

        if (not self.pmid) and not (self.pmc):
            self.found_pubmed = False
        else:
            self.found_pubmed = True
         
        if not self.found_pubmed:
            if verbose:
                print("Pubmed webscrape failed. No PMID or PMC available.", end='\n\n')
        else:
            if verbose:
                print("Pubmed webscrape succceeded.", end='\n\n')



    def get_url_doi(self):  
        # Get the url that results from the doi.org approach
        self.url_doi = 'https://doi.org/' + self.doi
        self.url_doi = requests.get(self.url_doi, allow_redirects=True).url
        
        if ("medRxiv" in self.journal) or ("bioRxiv" in self.journal):
            suffix = ".full"
        else:  # room to add more alterations
            suffix = ""
        self.url_doi += suffix


    def get_url_pmid(self):
        # Add "/" to check for redirects later
        self.url_pmid = "https://pubmed.ncbi.nlm.nih.gov/" + self.pmid + "/"  


    def get_url_pmc(self):
        # Add "/" to check for redirects later
        self.url_pmc = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC" + self.pmc + "/"



In [2]:
data = gd.pubmed_papers_and_pt(txt=False, jsonl=False, csv=True)

In [4]:
checker = PDBChecker()

In [5]:
data[0]

{'title': 'Immunofluorescence studies on the expression of the SARS-CoV-2 receptors in human term placenta.',
 'authors': 'JürgenBecker-DannyQiu-WalterBaron-JörgWilting',
 'date': '2021-12-17',
 'abstract': 'Until September 2021, the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2; COVID-19) pandemic caused over 217 million infections and over 4.5 million deaths. In pregnant women the risk factors for the need of intensive care treatment are generally the same as in the overall population. Of note, COVID-19+ women deliver earlier than COVID-19- women, and the risk for severe neonatal and perinatal morbidity and mortality is significantly higher. The probability and pathways of vertical transmission of the virus from the pregnant woman to the fetus are highly controversial. Recent data have shown that 54 (13%) of 416 neonates born to COVID-19-positive women were infected. Here, we investigated term placentas collected before the SARS-CoV-2 pandemic and studied the main COVID

In [15]:
paper = Paper(data[0])

DOI exists. Attempting to retrieve...

Scraping from doi...

Trying metapub fetch...
Metapub fetch failed.

Trying pubmed webscrape...
PMID Match: pmid:34915475
PMC not found via pubmed webscrape.
Pubmed webscrape succceeded.

DOI: 10.1159/000521436
PMID: 34915475
PMC: None
Retrieving pdbs from DOI...
Out of the 0 possible PDB IDs scraped 11 are actual PDB IDs.
[]

Retrieving pdbs from PMC...
Out of the 14 possible PDB IDs scraped 69 are actual PDB IDs.
['1A14', '1A23', '1C10', '1ZM9', '2H16', '3A19', '3A79', '3D38', '4F7D', '5A39', '5C12', '5C51', '5H16', '5H21']



In [13]:
paper = Paper(data[2])

DOI exists. Attempting to retrieve...

Scraping from doi...

Trying metapub fetch...
Metapub fetch succeeded. PMID and PMC found.

DOI: 10.1016/j.intimp.2021.108424
PMID: 34915409
PMC: None
Retrieving pdbs from DOI...
Out of the 0 possible PDB IDs scraped 1 are actual PDB IDs.
[]

Retrieving pdbs from PMC...
Out of the 14 possible PDB IDs scraped 66 are actual PDB IDs.
['1A14', '1A23', '1C10', '1ZM9', '2H16', '3A19', '3A79', '3D38', '4F7D', '5A39', '5C12', '5C51', '5H16', '5H21']



In [16]:
paper = Paper(data[498])

DOI does not exist. Catastrophic failure.

DOI: None
PMID: None
PMC: None


In [387]:
a = True
b = None

In [388]:
if (not a) and (not b):
    print('hi')

In [13]:
df = pd.read_csv("pubmed_results.csv")

In [126]:
print("Number of entries with this field empty")
print(dict(filter(lambda elem: elem[1] > 0, {field: sum([paper_data[field] is None if field in paper_data else False for paper_data in data]) for field in fields}.items())))

Number of entries with this field empty
{'abstract': 124, 'journal': 5, 'doi': 85}


In [383]:
print(len(data))

4414


In [None]:
checker = PDBChecker()
checker.get_top_authors("7KUU")

In [443]:
len(data)

4414

In [None]:
class PaperData:
    def __init__(self, data: list):


Interesting case study:

* No doi
* [PMID method](https://pubmed.ncbi.nlm.nih.gov/34873578/)
* [PMCID method](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8647651/)
* [Weird PMCID requests method](https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=8647651&tool=my_tool&email=my_email@example.com)
