## ADS Searcher Notebook

This notebook includes all of the necessary functions to take a list of names or institutions and find the expertise of that author or the authors in that institution based on publications in ADS.

Before the code begins here are the things needed:
1. The supporting file **TextAnalysis.py**.
2. A file of ignorable "stop words" and its directory path: **stopwords.txt**.
3. Your own **NASA ADS API token**. This is a long string of characters generated by ADS that gives you acces to their API
4. Either an input author/institution or an input csv file that has all authors/institutions you want to run.

Set up of this notebook:
- The steps of the process are listed out in each individual function (make sure to run these cells)
- At the end is the final for loop that you will input the csv file. This function contains multiple different pathways based on the input you give it.


### Step 1: Import statements and getting all necessary files

In [None]:
import requests
from urllib.parse import urlencode, quote_plus
import numpy as np

In [None]:
import pandas as pd
print(pd. __version__)

In [None]:
token = 'Z6VGw27uH9j2oYKa3t8zqdSmOMK8G5zEGbVnAdsD ' #edit this for yourself

In [None]:
directory= 'C:\\Users\\mhelfenb\\NASA_internship\stopwords.txt' #edit this directory path for yourself
import sys
sys.path.append('C:\\Users\\mhelfenb\\NASA_internship') #edit this
import TextAnalysis as TA

### Step 2: ADS Search function (retrieving the ADS info)
#### Inputs: name (optional), institution (optional), year (optional), and refereed property (optional) - choose the input
#### Outputs: Title, First author, bibcode, abstract, affiliation, publication date and keywords of each publication by this specific author

In [None]:
def ads_search(name=None, institution=None, year= None, refereed= 'property:notrefereed OR property:refereed'):

#editing query input here

    if name:
        if institution:
            query = 'pos(institution:"{}",1), author:"^{}"'.format(institution, name)
            print(query)
        else:
            query = 'author:"^{}"'.format(name)
            print(query)

    else:
        query = 'pos(institution:"{}",1)'.format(institution)
        print(query)

    if year_range:
        if year_range=='general':
            startd= str(2000)
            endd= str(2023)
            years= '['+startd+' TO '+endd+']'
            query += ', pubdate:{}'.format(years)
            print(query)

        else:
            startd=str(int(year)-1)
            endd=str(int(year)+4)
            years='['+startd+' TO '+endd+']'
            query += ', pubdate:{}'.format(years) #input year in function
            print(query)


#making and sending query to ADS

    encoded_query = urlencode({
        "q": query,
        "fl": "title, first_author, bibcode, abstract, aff, pubdate, keyword",
        "fq": "database:astronomy,"+str(refereed),
        "rows": 300,
        "sort": "date desc"
    })
    results = requests.get(
        "https://api.adsabs.harvard.edu/v1/search/query?{}".format(encoded_query),
        headers={'Authorization': 'Bearer ' + token}
    )

    data = results.json()["response"]["docs"]

#extract results into each separate detail

    pdates = [d['pubdate'] for d in data]
    affiliations = [d['aff'][0] for d in data]
    bibcodes = [d['bibcode'] for d in data]
    f_auth = [d['first_author'] for d in data]
    keysw = [d.get('keyword', []) for d in data]
    titles = [d.get('title', '') for d in data]
    abstracts = [d.get('abstract', '') for d in data]

#define data frame

    df = pd.DataFrame({
        'Input Author': [name] * len(data),
        'Input Institution': [institution] * len(data),
        'First Author': f_auth,
        'Bibcode': bibcodes,
        'Title': titles,
        'Publication Date': pdates,
        'Keywords': keysw,
        'Affiliations': affiliations,
        'Abstract': abstracts,
        'Data Type': [[]]*len(data)
    })

    if name==None:
        df['Input Author']= f_auth


    return df

### Step 2.5: ADS Search (with aff instead of institution)
This mid step was created in case the institution is not returning any results. Due to an ADS issue, sometimes affiliation works better than institution in the query

In [None]:
def ads_search_aff(name=None, institution=None, year= None, refereed= 'property:notrefereed OR property:refereed'):

#editing query input here
    if name:

        if institution:
            query = 'pos(aff:"{}",1), author:"^{}"'.format(institution, name)
            print(query)
        else:
            query = 'author:"^{}"'.format(name)
            print(query)

    else:
        query = 'pos(aff:"{}",1)'.format(institution)
        print(query)

    if year_range:
        if year_range=='general':
            startd= str(2000)
            endd= str(2023)
            years= '['+startd+' TO '+endd+']'
            query += ', pubdate:{}'.format(years)
            print(query)

        else:
            startd=str(int(year)-1)
            endd=str(int(year)+4)
            years='['+startd+' TO '+endd+']'
            query += ', pubdate:{}'.format(years) #input year in function
            print(query)


#making and sending query to ADS

    encoded_query = urlencode({
        "q": query,
        "fl": "title, first_author, bibcode, abstract, aff, pubdate, keyword",
        "fq": "database:astronomy,"+str(refereed),
        "rows": 300,
        "sort": "date desc"
    })
    results = requests.get(
        "https://api.adsabs.harvard.edu/v1/search/query?{}".format(encoded_query),
        headers={'Authorization': 'Bearer ' + token}
    )

    data = results.json()["response"]["docs"]

#extract results into each separate detail

    pdates = [d['pubdate'] for d in data]
    affiliations = [d['aff'][0] for d in data]
    bibcodes = [d['bibcode'] for d in data]
    f_auth = [d['first_author'] for d in data]
    keysw = [d.get('keyword', []) for d in data]
    titles = [d.get('title', '') for d in data]
    abstracts = [d.get('abstract', '') for d in data]

#define data frame

    df = pd.DataFrame({
        'Input Author': [name] * len(data),
        'Input Institution': [institution] * len(data),
        'First Author': f_auth,
        'Bibcode': bibcodes,
        'Title': titles,
        'Publication Date': pdates,
        'Keywords': keysw,
        'Affiliations': affiliations,
        'Abstract': abstracts,
        'Data Type': [[]]*len(data)
    })

    if name==None:
        df['Input Author']= f_auth


    return df

### Step 3: Defining data type (dirty vs clean)
This here will label the data as "clean" vs "dirty" based on if it is published in one of the chosen journals, if the listed first author matches the input author and if the listed affiliations includes the input institution

The input is the dataframe from step 2/2.5

In [None]:
def data_type(df):

    journals = ['ApJ', 'MNRAS', 'AJ', 'Nature', 'Science', 'PASP', 'AAS', 'arXiv', 'SPIE', 'A&A']

    for index, row in df.iterrows():

        flag= 0

# Journal check
        if any(journal in row['Bibcode'] for journal in journals):
            data_type_label = 'Clean'
        else:
            flag= flag+1

#Author check
        if row['First Author'].lower() == row['Input Author'].lower():
            data_type_label = 'Clean'
        else:
            flag=flag+2

# Inst check
        if institution and institution in row['Affiliations']:
            data_type_label = 'Clean'
        elif not institution:
            data_type_label = 'Clean'
        else:
            flag=flag+4

# Update the 'Data Type' column
        if flag==0:
            data_type_label = 'Clean'
        else:
            data_type_label = 'Dirty'

        df.at[index, 'Data Type'] = data_type_label

    print(flag) #this lets the user know what aspect of the data made it 'dirty'

#flag= 1 just the journal did, flag= 2 just the author, flag=3 the author and journal,
#flag=4 just the inst, etc.

    return df




### Step 4: Merge the publications for individual authors into one row
Before this step the dataframe will have each publication by each author in separate rows, here we want to combine the publications by author, i.e if an author publishes 5 times then all 5 publications are in one row with that author.

Input here is the dataframe made in Step 3

In [None]:
def merge(df):

    df['Publication Date']= df['Publication Date'].astype(str)
    df['Abstract']= df['Abstract'].astype(str)
    df['Keywords'] = df['Keywords'].apply(lambda keywords: ', '.join(keywords) if keywords else '')
    df['Title'] = df['Title'].apply(lambda titles: ', '.join(titles) if titles else '')

# if the dataframe is missing any information it is labeled as "None"

    df.fillna('None', inplace=True)

    merged= df.groupby('Input Author').aggregate({'Input Institution':', '.join, 'First Author':', '.join, 'Bibcode':', '.join,
                                                 'Title':', '.join,'Publication Date':', '.join, 'Keywords':', '.join,
                                                 'Affiliations':', '.join,'Abstract':', '.join, 'Data Type':', '.join}).reset_index()

    return merged



### Step 5: Defining the n_grams for each publication  
This final step is to define the N_grams for each paper, meaning the top words, bigrams and trigrams found (using the TextAnalysis file)

In [None]:
def n_grams(df, directorypath): #directory path should lead to TextAnalysis.py
    top10Dict = {'Top 10 Words':[],
                 'Top 10 Bigrams':[],
                 'Top 10 Trigrams':[]}

    for i in df.values:
        abstracts = i[8]

        top10words = TA.topwords(abstracts, directorypath)
        top10bigrams = TA.topbigrams(abstracts, directorypath)
        top10trigrams = TA.toptrigrams(abstracts, directorypath)

        top10Dict['Top 10 Words'].append(top10words)
        top10Dict['Top 10 Bigrams'].append(top10bigrams)
        top10Dict['Top 10 Trigrams'].append(top10trigrams)

    top10Df = df
    top10Df['Top 10 Words'] = top10Dict['Top 10 Words']
    top10Df['Top 10 Bigrams'] = top10Dict['Top 10 Bigrams']
    top10Df['Top 10 Trigrams'] = top10Dict['Top 10 Trigrams']

    top10Df = top10Df[['Input Author', 'Input Institution', 'First Author', 'Bibcode', 'Title', 'Publication Date',
             'Keywords', 'Affiliations', 'Abstract', 'Top 10 Words', 'Top 10 Bigrams', 'Top 10 Trigrams', 'Data Type']]

    return top10Df

### Step 6: Putting it all together
This function (final) takes in one file that has all of the authors or institutions you want to test and completes each step on each author or individual institution.

**Important to note** you need to edit the 'final' function statement below based on what your personal file contains- see below in comments the details of what you may need to replace

- If the desire is to test one singular author or one singular institution look ahead to Step 7

In [None]:
def final(file):
    dataframe= pd.read_csv(file)

    #replace inside the brackets below with the arguments you want to use
    institutions= dataframe['Institution Name']
    names= dataframe['Author'] #format must be Last, First
    start_years= dataframe['Fellowship Year']
    referee= 'property:notrefereed OR property:refereed'

    final_df= pd.DataFrame()
    count= 0

    #starting for loop to go through the input csv file
    for i in np.arange(len(dataframe)):

        #edit for what your argument will be in step 2- comment out the aspects of the dataframe that will not be included in the function
        inst= institutions[i]
        name= names[i]
        year= start_years[i]

        #inputting into step 2
        data1= ads_search(name= name, institution= inst, year=year, refereed=referee)

        #if the dataframe is empty and there is an author inputted into the function
        if name and data1.empty:

            #if the year is an inputted argument then drop the institutution from the search
            if year:
                data1= ads_search(name=name, year=year, refereed=referee)

                #if the dataframe is still empty for just the name and year then search for a larger year range (2000 to 2023)
                if data1.empty:
                      data1= ads_search(name=name, year='general', refereed=referee)

            #no year input then just search  name without institution
            else:
                data1= ads_search(name, refereed=referee)

        #if there is no name input
        if name==None and data1.empty:
            if year:
                data1= ads_search_aff(institution= inst, year=year, refereed=referee)
            else:
                data1= ads_search_aff(institution= inst, refereed=referee)

        data1['Input Institution']=inst

        data2= data_type(data1)
        data3= merge(data2)
        data4= n_grams(data3, directory)

        final_df= final_df.append(data4, ignore_index= True)
        count+=1
        print(str(count)+' iterations done')

    return final_df

### Step 7: Entering in your own data
- The first cell is running through a dataframe that consists of multiple authors or institutions (in the function final)
- The second cell below the function 'final' is if you would just like to run through one author name or one institution name

##### 7.1 USE FINAL if you are using a csv file that has specific data in it
- whether you want to find the expertise of specific authors, authors from a specific institution or other info you can edit the cell below to match the input you give

In [None]:
file= 'filename.csv'
complete_df= final(file)
complete_df.to_csv('dataframe.csv', index=False)

##### 7.2 Use this cell if the input is only one author or institution input
Edit the definition statements below to match your desired input!

In [None]:
author= 'Last, First'
inst= 'Input Inst Here'
year= 2000
referee= 'property:notrefereed OR property:refereed'
d1= ads_search(name=author, inst=inst, year=year, refereed=referee)
d2= data_type(d1)
d3=merge(d2)
d4=n_grams(d3, directory)
d4