In [7]:
import pandas as pd
import requests
from pathlib import Path
from pypdf import PdfReader
import os
from dotenv import load_dotenv

While I keep the exploration notebooks separate for different parts of my methods, all of the universal functions developed can be found in the functions.py file.

In [8]:
load_dotenv()
core_api_key = os.getenv("core_api_key", '')
headers={"Authorization":"Bearer " + core_api_key}
query = 'wind+speed+(prediction|predict)'
url = 'https://api.core.ac.uk/v3/search/works/'
r = requests.get(url + '?q=' + query + '&limit=25', headers=headers).json()

We can get a lot of metadata from each paper, in addition to the full text of the paper:

In [9]:
r['results'][0]

{'acceptedDate': '2014-03-20T00:00:00',
 'arxivId': '1305.3696',
 'authors': [{'name': "D'Amico, Guglielmo"},
  {'name': 'Petroni, Filippo'},
  {'name': 'Prattico, Flavio'}],
 'citationCount': 0,
 'contributors': [],
 'outputs': ['https://api.core.ac.uk/v3/outputs/297903089',
  'https://api.core.ac.uk/v3/outputs/54533522',
  'https://api.core.ac.uk/v3/outputs/209880239'],
 'createdDate': '2014-10-24T19:18:15',
 'dataProviders': [{'id': 144,
   'name': '',
   'url': 'https://api.core.ac.uk/v3/data-providers/144',
   'logo': 'https://api.core.ac.uk/data-providers/144/logo'},
  {'id': 4786,
   'name': '',
   'url': 'https://api.core.ac.uk/v3/data-providers/4786',
   'logo': 'https://api.core.ac.uk/data-providers/4786/logo'},
  {'id': 8857,
   'name': '',
   'url': 'https://api.core.ac.uk/v3/data-providers/8857',
   'logo': 'https://api.core.ac.uk/data-providers/8857/logo'},
  {'id': 1084,
   'name': '',
   'url': 'https://api.core.ac.uk/v3/data-providers/1084',
   'logo': 'https://api.cor

We can take some of this information to cretae a nice summary of the paper:

In [10]:
i = 0
print(r['results'][i]['title'])
print(r['results'][i]['abstract'])
print(r['results'][i]['downloadUrl'])

Wind speed forecasting at different time scales: a non parametric
  approach
The prediction of wind speed is one of the most important aspects when
dealing with renewable energy. In this paper we show a new nonparametric model,
based on semi-Markov chains, to predict wind speed. Particularly we use an
indexed semi-Markov model, that reproduces accurately the statistical behavior
of wind speed, to forecast wind speed one step ahead for different time scales
and for very long time horizon maintaining the goodness of prediction. In order
to check the main features of the model we show, as indicator of goodness, the
root mean square error between real data and predicted ones and we compare our
forecasting results with those of a persistence model
http://arxiv.org/abs/1305.3696


Let's define a function that can return all of this metadata that we want, along with the full text of the paper

In [18]:
def core_query(query, limit=10, offset=0):
    ''' Takes an integer limit and a string query where space is replaced with +, and = AND, or = OR. 
    Returns a Pandas DataFrame with paper titles, abstracts, links, and text. '''
    paper_df = pd.DataFrame()
    
    core_api_key = '0IyesnpxHglJ8brOMXzUQfNi91S7PR3Z'
    headers={"Authorization" : "Bearer " + core_api_key}
    url = 'https://api.core.ac.uk/v3/search/works/'
    r = requests.get(url + '?q=' + query + f'&limit={limit}' + f'&offset={offset}', headers=headers).json()['results']
    
    paper_df['title'] = [r[i]['title'] for i in range(len(r))]
    paper_df['abstract'] = [r[i]['abstract'] for i in range(len(r))]
    paper_df['link'] = [r[i]['downloadUrl'] for i in range(len(r))]
    paper_df['text'] = [r[i]['fullText'] for i in range(len(r))]
    
    text_list = []
    for i in range(len(paper_df)):
        # Modify arxiv links to point directly to the article
        url = paper_df.loc[i, 'link']
        if 'abs' in url:
            url_left = url[:url.find('abs')] 
            url_right = url[url.find('abs') + 3:]
            url = url_left + 'pdf' + url_right
        paper_df.loc[i, 'link'] = url
        
        # Remove links and newlines from the text (to reduce noise)
        text = paper_df.loc[i, 'text']
        text = text.replace("\n", "")
        new_text = []
        for word in text.split(" "):
            word = 'http' if word.startswith('http') else word
            new_text.append(word)
        text = " ".join(new_text)
        paper_df.loc[i, 'text'] = text
        
    return paper_df

In [19]:
df = core_query('wind+speed+(prediction|predict)', 10)
display(df)

Unnamed: 0,title,abstract,link,text
0,Wind speed forecasting at different time scale...,The prediction of wind speed is one of the mos...,http://arxiv.org/pdf/1305.3696,Wind speed forecasting at different time scale...
1,Wind power prediction models,Investigations were performed to predict the p...,https://core.ac.uk/download/pdf/42880399.pdf,General Disclaimer One or more of the Followin...
2,Prediction of power generation from a wind farm,Wind farms produce a variable power output dep...,https://core.ac.uk/download/pdf/6182363.pdf,PREDICTION OF POWER GENERATIONFROM A WIND FARM...
3,Hybrid Wind Speed Prediction Model Using Intri...,"Before sitting a wind turbine, reliable wind s...",https://core.ac.uk/download/480425308.pdf,INTERNATIONAL JOURNAL OF INTEGRATED ENGINEERI...
4,Wind Turbine Noise and Wind Speed Prediction,In order to meet the US Department of Energy p...,https://core.ac.uk/download/229163990.pdf,Georgia Southern University Digital Commons@Ge...
5,Short-Term Wind Speed Forecasting Model Using ...,"Wind speed is one of the most vital, imperativ...",https://core.ac.uk/download/577152945.pdf,Chapter 7 Short-Term Wind Speed Forecasting Mo...
