## Checking versions
Please do not run this code on your computer if you don't understand what it is.

In [1]:
%load_ext version_information
import time
now = time.strftime("%Y-%m-%d %H:%M:%S (%Z = GMT%z)")
print(f"This notebook was generated at {now} ")

vv = %version_information requests, tqdm, pandas, astroquery, version_information
for i, pkg in enumerate(vv.packages):
    print(f"{i} {pkg[0]:10s} {pkg[1]:s}")

This notebook was generated at 2019-07-02 16:13:53 (KST = GMT+0900) 
0 Python     3.7.3 64bit [Clang 4.0.1 (tags/RELEASE_401/final)]
1 IPython    6.5.0
2 OS         Darwin 18.6.0 x86_64 i386 64bit
3 requests   2.22.0
4 tqdm       4.32.1
5 pandas     0.24.2
6 astroquery 0.3.10.dev5533
7 version_information 1.0.3


## Importing and Setting Up

In [2]:
import math
import requests
import time
from itertools import product
from pathlib import Path
from tqdm import tqdm
import pandas as pd
from astroquery import nasa_ads as na


# helped from https://stackoverflow.com/questions/37573483/progress-bar-while-download-file-over-http-with-requests
def download_pdf(response, fpath):
    total_size = int(response.headers.get('content-length', 0)); 
    block_size = 1024
    wrote = 0 
    with open(fpath, 'wb') as f:
        for data in tqdm(response.iter_content(block_size), total=math.ceil(total_size//block_size), unit='kB', unit_scale=True):
            wrote = wrote + len(data)
            f.write(data)
#     if total_size != 0 and wrote != total_size:
#         print("ERROR, something went wrong")  

def altnames(fullname):
    names = [fullname]
    lastname = fullname.split(', ')[0]
    firstmiddle_names = fullname.split(', ')[-1].split(' ')
    N = len(firstmiddle_names)
    pieces = {'0':firstmiddle_names, '1':[]}  # 0/1 = full/initial
    
    for n in firstmiddle_names:
        pieces['1'].append('{}.'.format(n[0].upper()))
    
    for ind in product('012', repeat=N):
        altname = ''
        for i, case in enumerate(ind):
            if case != '2':
                altname += "{} ".format(pieces[case][i])
        if altname == '':
            continue
        names.append("{}, {}".format(lastname, altname[:-1]))    
    
    return list(set(names))

## Team Member Setting

Define team members. If multiple names are there, add as separate person. ``altnames`` are the alternative combinations of initials of middle/first names.

In [3]:
team = dict(
    names=["Kim, Yoonyoung", "Geem, Jooyeon", "Kim, Jooyeon", "Jin, Sunho", "Bach, Yoonsoo P.", "Kwon, Yuna Grace"],
    kornames=["김윤영", "김주연", "김주연", "진선호", "박윤수", "권유나"],
    researcher_number=[1111,1212,1212,2121,3333,4444],
    altnames=[]
)

for name in team["names"]:
    team["altnames"].append(altnames(name))


team_df = pd.DataFrame.from_dict(team)
team_df

Unnamed: 0,names,kornames,researcher_number,altnames
0,"Kim, Yoonyoung",김윤영,1111,"[Kim, Yoonyoung, Kim, Y.]"
1,"Geem, Jooyeon",김주연,1212,"[Geem, J., Geem, Jooyeon]"
2,"Kim, Jooyeon",김주연,1212,"[Kim, J., Kim, Jooyeon]"
3,"Jin, Sunho",진선호,2121,"[Jin, S., Jin, Sunho]"
4,"Bach, Yoonsoo P.",박윤수,3333,"[Bach, Yoonsoo, Bach, Y. P., Bach, P., Bach, Y..."
5,"Kwon, Yuna Grace",권유나,4444,"[Kwon, G., Kwon, Y. G., Kwon, Yuna G., Kwon, Y..."


* **NOTE**: You may make many different such excel/csv/txt files and load them by ``pd.read_csv``, etc.

## Query to ADS

1. Go to [ADS](https://ui.adsabs.harvard.edu/), log in. 
2. Then go to [Account - Settings - API Token](https://ui.adsabs.harvard.edu/user/settings/token). 
3. Generate your token.
4. Copy and paste it to ``na.ADS.TOKEN`` below:

In [4]:
na.ADS.TOKEN = 'RXPglegHZqHD6dav0ur6sac6ZXFYPdMMdJbaes1F'

# by default, the top 10 records are returned, sorted in
# reverse chronological order. This can be changed

# change the number of rows returned
na.ADS.NROWS = 9999

# change the fields that are returned (enter as strings in a list)
na.ADS.ADS_FIELDS = ["title", "bibcode", "author", "pubdate", "property", "esources",
                     "pub", "issn", "volume", "issue", "page", "doi", "arxiv", "bibstem", "database"]

author = "Ishiguro, Masateru"
year = "2000-2019"
query_str = f'author:"={author}" year:{year}'
print(f"Query with: \n\t {query_str}")
results = na.ADS.query_simple(query_str)

results.sort(['pubdate', "title"])

# flatten the shape to convert to pandas... 
# I currently don't know what bad thing will happen.
# It was OK when I tested for my personal purposes.
for c in results.colnames:
    if len(results[c].shape) > 1:
        results[c] = results[c][:, 0]

results = results.to_pandas()

results["N_author"] = results["author"].str.len()
results["YYYYMM"] = results["pubdate"].str[:-3].str.replace("-", "").astype(int)
results["refereed"] = [True if "REFEREED" in row["property"] else False for i, row in results.iterrows()]
results["astronomy"] = [True if "astronomy" in row["database"] else False for i, row in results.iterrows()]
results["volume"] = [-1 if row["volume"]==[None] else row["volume"] for i, row in results.iterrows()]

results_ref = results[((results["refereed"]==True) 
                      & (results["astronomy"]==True) 
                      & (results["volume"] != -1))]

print(f"ADS contains {len(results)} match with <{author}> (refreed: {len(results_ref)}) in {year}.")
if len(results_ref) > 100:
    print(f"\nHey {author}, you are awesome.")

Query with: 
	 author:"=Ishiguro, Masateru" year:2000-2019
ADS contains 305 match with <Ishiguro, Masateru> (refreed: 108) in 2000-2019.

Hey Ishiguro, Masateru, you are awesome.


* **NOTE**: If you want to search for your results, change the ``query_str``.
* **NOTE**: See http://adsabs.github.io/help/search/comprehensive-solr-term-list for the complete list of columns.
* **NOTE**: As of 2019-07-02, the ``issn`` is not yet supported from ADS.

## Select Rows for This BK Survey
I will select those with ``201803 <= YYYYMM <= 201908``. Also, based on the columns of ``2019보고서요청자료(연구실) - 논문`` Excel file, I will only select the 

1. title
2. journal (full name)
3. issn
4. volume
5. issue
6. page
7. YYYYMM
8. number of authors 

in this order. Then add the students' names and their corresponding KRI researcher numbers.

It will be saved as ``BK2019_ishiguro.csv`` and you can open it with Excel, copy-and-paste to the original Excel file.
* **WARNING**: The formatting is crazy in the original Excel, so you should do it by yourself.

In [5]:
results_ref_2019 = results_ref[(results_ref["YYYYMM"] >= 201803) & (results_ref["YYYYMM"] <= 201908)]
results_ref_BK2019 = results_ref_2019[["author", "title", "pub", "issn", "volume", "issue", "page", "YYYYMM", "N_author"]]
results_ref_BK2019["students"] = ""
results_ref_BK2019["researcher_number"] = ""


for i, row in results_ref_BK2019.iterrows():
    students = ""
    researcher_number = ""
    for _, student in team_df.iterrows():
        student_names = student["altnames"]
        for name in student_names:
            if name in row["author"]:
                students += "{},".format(student["kornames"])
                researcher_number += "{},".format(student["researcher_number"])
    results_ref_BK2019.at[i, "students"] = students[:-1]
    results_ref_BK2019.at[i, "researcher_number"] = researcher_number[:-1]
    

del results_ref_BK2019["author"]
results_ref_BK2019.to_csv("BK2019_ishiguro.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [6]:
results_ref_BK2019

Unnamed: 0,title,pub,issn,volume,issue,page,YYYYMM,N_author,students,researcher_number
285,Significantly high polarization degree of the ...,Astronomy and Astrophysics,,611,[None],A31,201803,10,,
286,Extremely strong polarization of an active ast...,Nature Communications,,9,[None],2486,201806,12,"박윤수,권유나",33334444.0
287,The Reactivation and Nucleus Characterization ...,The Astronomical Journal,,156,1,39,201807,7,,
288,Opposition effect on S-type asteroid (25143) I...,Astronomy and Astrophysics,,616,[None],A178,201809,2,,
292,Optical observations of NEA 3200 Phaethon (198...,Astronomy and Astrophysics,,619,[None],A123,201811,27,,
293,The 2016 Reactivations of the Main-belt Comets...,The Astronomical Journal,,156,5,223,201811,10,김윤영,1111.0
294,High polarization degree of the continuum of c...,Astronomy and Astrophysics,,620,[None],A161,201812,19,권유나,4444.0
295,Physical properties of near-Earth asteroids wi...,Publications of the Astronomical Society of Japan,,70,6,114,201812,43,,
300,Hayabusa2 arrives at the carbonaceous asteroid...,Science,,364,6437,268,201904,88,,
301,Shape and Rotational Motion Models for Tumblin...,The Astronomical Journal,,157,4,155,201904,16,,


## Download the PDF Files of the Papers
I will use the ADS web link and try
1. to access to the publisher's PDF if available
  - For Science, the publisher's PDF link is not directed to the full pdf, so I added some conditional clause.
2. if unavailable, I tried something
  - Nature, for example, adding ``.pdf`` seem to direct you to the pdf.
  
As time goes, I will add more exceptions so that it works as perfect as possible.

In [7]:
BASE = "https://ui.adsabs.harvard.edu/link_gateway/"
# helped from https://stackoverflow.com/questions/43165341/python3-requests-connectionerror-connection-aborted-oserror104-econnr/43167631
manual = dict(bib=[], pub_html=[])
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'

for i, row in results_ref_2019.iterrows():
    bib = row["bibcode"]
    fpath = Path('{}.pdf'.format(bib))
    print(fpath, end=' ')
    
    if fpath.exists():
        print('already exists!'.format(bib))
        continue
        
    if "PUB_PDF" in row["esources"]:
        url = BASE + row["bibcode"] + "/PUB_PDF"
        print('Downloading...'.format(bib), end=' ')

        response = requests.get(url, headers=headers, stream=True)
        
        if "Science" in row["pub"]:
            if response.url.endswith("/tab-pdf"):
                url = response.url.replace("/tab-pdf", ".full.pdf")
            else:
                url = response.url + ".full.pdf"
            response = requests.get(url, headers=headers, stream=True)

        print("\n\t" + response.url)
        time.sleep(1)
        
        download_pdf(response, fpath)

    else:
        try:
            print("trying to find pdf...", end=' ')
            url = BASE + row["bibcode"] + "/PUB_HTML"
            response = requests.get(url, headers=headers, stream=True)
            if "nature.com" in response.url:
                url = response.url + ".pdf"
            else:
                raise ConnectionError()
            response = requests.get(url, headers=headers, stream=True)
            if response.status_code == 404:
                raise ConnectionError()
            print('I found it! Downloading...'.format(bib), end=' ')
            print("\n\t" + response.url)
            time.sleep(1)
        
            download_pdf(response, fpath)
            
        except ConnectionError:            
            print("\n!!! I couldn't find a valid link. Download from below:".format(bib))
            print("\t" + BASE + bib + "/PUB_HTML")
            manual["bib"].append(bib)
            manual["pub_html"].append(BASE + bib + "/PUB_HTML")

2018A&A...611A..31K.pdf Downloading... 
	https://www.aanda.org/articles/aa/pdf/2018/03/aa32086-17.pdf


1.70kkB [00:04, 362kB/s]                           


2018NatCo...9.2486I.pdf trying to find pdf... I found it! Downloading... 
	https://www.nature.com/articles/s41467-018-04727-2.pdf


749kB [00:00, 3.41kkB/s]                         


2018AJ....156...39H.pdf Downloading... 
	https://iopscience.iop.org/article/10.3847/1538-3881/aac81c/pdf


1.84kkB [00:05, 347kB/s]


2018A&A...616A.178L.pdf Downloading... 
	https://www.aanda.org/articles/aa/pdf/2018/08/aa32721-18.pdf


6.52kkB [00:16, 405kB/s]                           


2018A&A...619A.123K.pdf Downloading... 
	https://www.aanda.org/articles/aa/pdf/2018/11/aa33593-18.pdf


1.77kkB [00:05, 342kB/s]                           


2018AJ....156..223H.pdf Downloading... 
	https://iopscience.iop.org/article/10.3847/1538-3881/aae528/pdf


2.29kkB [00:03, 629kB/s]


2018A&A...620A.161K.pdf Downloading... 
	https://www.aanda.org/articles/aa/pdf/2018/12/aa33968-18.pdf


1.87kkB [00:04, 418kB/s]                           


2018PASJ...70..114H.pdf Downloading... 
	https://watermark.silverchair.com/psy119.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAlwwggJYBgkqhkiG9w0BBwagggJJMIICRQIBADCCAj4GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM71Z07gdVeEuiN548AgEQgIICDyK4X0ENkY7xGz2TQZs1pw1BAa_Tkl8ZGM7Yk2ZFYrtP0mYfYvyJKGryPriu7Ku3imvOkAOZuevEslWGg_xlCq8vnLr18668Rf21N7fUaeh6-cH3ixpp5TkvI91cpH_n-hJHaJcMhTTANsrdHqcKf_GoYVx4p1cZKFuFMNVnZO9tEw0dRF9YWO1VNMtdmtNZEAdLIPhn8a2IusE0cjRnpSjzsb9T53jo4691_9EcxAQLXIb0Q4TefyRQfmyopEtay-8xTNG5CXLeCZgucytfP6-kfEH4M8sm7OQbgeR5VqFoV-V4hd-Unmi5sqHVn5KvTjEECxQyWFxwe6bEtj_Cn4Dv9tKxJveXmSe47aJwac3vM7-V4psbetjE08hAQiF-kG6Av8s4X5GMEg0UYgidC7E_pOEVC9Bpd4EoA0VvVTIIUJy_30parb3liYQi_HHrq57pukI_ZQ7qz9nh1dAzocPqbAXToBNNpVv1ZYSB6x-Ve68xscLZycLq3KP7uBCNDXkk5bBAxUG0WVzto-t4uqSYBZW2_y8vIiIr__mhkZrkW5cbUQlp6qGgZW98w8xXy1pBmrryOpdR3Ar4p1k23cu_-jVPZ8bp5U0mvGuQiOi0lHqXcgpTFWR7yJiepyBeaGKtR7pt1nf116RjtUm_uYVl8OCf4Y521uaCqY2HCEMfgPp-FUMKEWM9qNUCMUx3


3.42kkB [00:26, 131kB/s]                           


2019Sci...364..268W.pdf Downloading... 
	https://science.sciencemag.org/content/sci/364/6437/268.full.pdf


839kB [00:00, 1.49kkB/s]                       


2019AJ....157..155U.pdf Downloading... 
	https://iopscience.iop.org/article/10.3847/1538-3881/ab09f0/pdf


1.94kkB [00:07, 267kB/s]


2019Sci...364..252S.pdf Downloading... 
	https://science.sciencemag.org/content/sci/364/6437/eaaw0422.full.pdf


12.5kkB [00:24, 515kB/s]                           


* **WARNING**: You may have some papers that are accepted but not on ADS yet. You **MUST** find those by yourself!!!
* **NOTE**: I didn't put much effort to automatize the "paper download link finding" algorithm. But anyway it gives the link to PDF download, it may save a lot of time.

In [8]:
import pandas as pd
pd.DataFrame(manual)

Unnamed: 0,bib,pub_html
