In [88]:
import transformers
import requests
import pandas as pd

In [2]:
# Locate the model cache to remove unwanted models
# from transformers import file_utils
# print(file_utils.default_cache_path)

Our first goal is to help students find an interesting field of research. We will do this by taking some fields to sort by (that semantic scholar offers) and some keywords, then letting search do the rest. This way, students can take a look around diverse research scenes quickly to discover where they might be interested.

Ideally, students already have ideas in mind and are using this to explore them, not generate them (College Board doesn't like that)

In [127]:
# We'll use this dataframe to organize results
paper_df = pd.DataFrame()

# Rate limiting shenanigans. Could be avoided if I was willing to sign up for an API key.
# Reach out to Semantic Scholar to ask about API keys for student research projects.
r = {'message': ''}
while list(r.keys())[0] == 'message':
    query = 'wind+speed&prediction|predict'
    base_url = f'https://api.semanticscholar.org/graph/v1/paper/search'
    r = dict(requests.get(base_url, params={'query': query}).json())
titles = [r['data'][i]['title'] for i in range(len(r['data']))]
ids = [r['data'][i]['paperId'] for i in range(len(r['data']))]

In [128]:
# Grab additional data for each paper returned from search
r = requests.post(
'https://api.semanticscholar.org/graph/v1/paper/batch',
params={'fields': 'abstract,tldr,url,s2FieldsOfStudy,isOpenAccess,openAccessPdf'},
json={"ids": ids}).json()
abstracts = [r[i]['abstract'] for i in range(len(r))]

# tldrs are special since they return a dict when not None
tldrs = []
for i in range(len(r)):
    if r[i]['tldr'] != None:
        tldrs.append(r[i]['tldr']['text'])
    else:
        tldrs.append(r[i]['tldr'])

# For fields, we need to extract multiple from different dicts to make a nice comma separated list
fields = [''] * len(r)
for i in range(len(r)):
    paper_fields = r[i]['s2FieldsOfStudy']
    for j in range(len(paper_fields)):
        fields[i] += paper_fields[j]['category']
        if j != len(paper_fields) - 1:
            fields[i] += ', '

In [129]:
# Assemble the dataframe
paper_df['title'], paper_df['id'], paper_df['abstract'], paper_df['tldr'], paper_df['field'] = titles, ids, abstracts, tldrs, fields
display(paper_df)

Unnamed: 0,title,id,abstract,tldr,field
0,A Review on ANN Based Model for Solar Radiatio...,985edf9d7a0895c58e921921db0be38775050d0d,,The study indicates that good quality predicti...,"Environmental Science, Engineering"
1,An ultra‐short‐term wind speed prediction mode...,e99ee8e39c055a77be0170788f4f2ff6597caf2f,Accurate ultra‐short‐term wind speed predictio...,A hybrid prediction model based on the long sh...,"Engineering, Environmental Science, Computer S..."
2,Short-Term Canyon Wind Speed Prediction Based ...,75336eab4a5b16bd69c93edf33fce3ebf44a2d70,Due to the particularity of the site selection...,A hybrid transfer learning model based on a co...,"Engineering, Environmental Science, Computer S..."
3,Wavelet kernel least square twin support vecto...,8a1e1b2bc19765e0c77f81ade7136475813039f3,,Wavelet kernel–based LSTSVR models are propose...,"Medicine, Environmental Science, Engineering"
4,Wind speed prediction over Malaysia using vari...,5b2f9872f132de046e723ffcf3bf7e68893003b5,Modeling wind speed has a signiﬁcant impact on...,Three machine learning approaches – Gaussian p...,"Environmental Science, Engineering"
5,Learning Temporal and Spatial Correlations Joi...,335df24c2fa0831e67f010c53094f0bcd157e32e,Leveraging both temporal and spatial correlati...,This paper proposed a deep architecture termed...,"Computer Science, Environmental Science, Compu..."
6,Wind speed prediction using a hybrid model of ...,2028e2c2827996044fdaff1ca6de6957e5f71637,"Wind power as a renewable source of energy, ha...",It was concluded that WOA optimization algorit...,"Computer Science, Mathematics, Environmental S..."
7,Determining the number of hidden layer and hid...,df1a8eb08dda6c327967dfa4f064991d3ede9169,Artificial neural network (ANN) is one of the ...,The results of the experiment show that the pe...,"Computer Science, Medicine, Engineering, Compu..."
8,Short-term Wind Speed Prediction with a Two-la...,0bc895863b912f4a7fce075e97fbc30a89035981,Wind speed prediction is of great importance b...,This paper investigates the problem of predict...,"Computer Science, Engineering, Environmental S..."
9,Short-Term Wind Speed Prediction Based on Prin...,596ed779bc6c747a91278c2592493c7f01ee571c,An accurate prediction of wind speed is crucia...,"Several prevailing prediction methods, such as...","Computer Science, Environmental Science, Engin..."


Looks like we have a small data quality issue (at least for some queries), but this is nothing that requerying searches with higher limits and excluding null abstracts/tldrs can't fix.

In [130]:
def semantic_scholar_query(query ,limit):
    ''' Takes an integer limit and a string query where space is replaced with +, and = &, or = |. 
    Returns a Pandas DataFrame with paper titles, ids, abstracts, tldrs, and fields. '''
    paper_df = pd.DataFrame()
    
    r = {'message': ''}
    while list(r.keys())[0] == 'message':
        query = 'wind+speed&prediction|predict'
        base_url = f'https://api.semanticscholar.org/graph/v1/paper/search'
        r = dict(requests.get(base_url, params={'query': query, 'limit': limit}).json())
    titles = [r['data'][i]['title'] for i in range(len(r['data']))]
    ids = [r['data'][i]['paperId'] for i in range(len(r['data']))]
    
    r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'abstract,tldr,url,s2FieldsOfStudy'},
    json={"ids": ids}).json()
    abstracts = [r[i]['abstract'] for i in range(len(r))]

    tldrs = []
    for i in range(len(r)):
        if r[i]['tldr'] != None:
            tldrs.append(r[i]['tldr']['text'])
        else:
            tldrs.append(r[i]['tldr'])

    fields = [''] * len(r)
    for i in range(len(r)):
        paper_fields = r[i]['s2FieldsOfStudy']
        for j in range(len(paper_fields)):
            fields[i] += paper_fields[j]['category']
            if j != len(paper_fields) - 1:
                fields[i] += ', '
    
    paper_df['title'], paper_df['id'], paper_df['abstract'], paper_df['tldr'], paper_df['field'] = titles, ids, abstracts, tldrs, fields
    return paper_df

In [136]:
# You can also get links to papers, but this part is not necessary
r = requests.post(
'https://api.semanticscholar.org/graph/v1/paper/batch',
params={'fields': 'abstract,tldr,url,s2FieldsOfStudy,isOpenAccess,openAccessPdf'},
json={"ids": ids}).json()
print(r[1]['openAccessPdf'])

{'url': 'https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/ese3.1183', 'status': 'GOLD'}
