# Overview
The data structure of this repository is organized within a primary directory named `data`. This directory contains three subdirectories:

- `/raw`: Contains the raw text data of papers. This can be sourced from OpenAlex, a custom dataset, or other sources.
- `/processed`: Contains processed versions of the papers. Papers are sorted by publication date (according to MAG). Three specific files are expected here:
    - `papers_words.csv`
    - `papers_phrases.csv`
    
    Each of the processed files should have the following three columns:
    - *`PaperID`*
    - *`Title_Words`* (or `Title_Phrases` for noun phrases)
    - *`Abstract_Words`* (or `Abstract_Phrases` for noun phrases)
    
> **_NOTE 1:_**  All papers MUST be sorted by publication date.

> **_NOTE 2:_** Nan (`np.nan`) are placed where the abstract is unavailable.


# Data Selection Guide: *Choosing the Dataset*
This repository is based on OpenAlex data. The data output is available at https://zenodo.org/record/13869486, that provides data on all MAG papers (both journal and conference papers) published between 1800 and 2020.

While the underlying code remains consistent, users might have different data requirements. Some may prefer a custom dataset. This notebook outlines four distinct approaches to data acquisition.

- **Approach 1**: Use custom data
- **Approach 2**: Use OpenAlex data
- **Approach 3**: Use a subset of Zenodo (OpenAlex papers) data
- **Approach 4**. Use the entire Zenodo (OpenAlex papers) data

## Approach 1: Custom Data
Data should be placed in a file named `papers_raw.csv` within the `data/raw/` directory.
- The file must have the following columns:
    - *`PaperID`*: Unique identifier of the paper
    - *`Date`*: Date of publication of the paper in the format *yyyy-mm-dd*
    - *`Title`*: Title of the paper(non-processed)
    - *`Abstract`*: Abstract of the paper (non-processed)
    
If an abstract is unavailable, use a placeholder value of None or `np.nan`.

> **_NOTE:_**  All papers MUST be sorted by publication date.

## Approach 2: OpenAlex data

Data from OpenAlex can be obtained using the OpenAlex API (https://docs.openalex.org/)

There are several strategies to download data from OpenAlex. 

Here is shown the standard strategy of getting papers by searching for keywords in the title, abstract of full text. The following query is used:

> **`(natural language processing) & novelty`**

In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')

import preprocessing
import pandas as pd
import requests
import time
import os
from tqdm.notebook import tqdm


### Query: Title, Abstract, Full text

> **_NOTE:_**  All papers MUST be sorted by publication date.


In [None]:
# OpenAlex API URL
url = "https://api.openalex.org/works"

# This is an example query
query = '(natural language processing) & novelty'

# Define the initial page and per page variables
page = 1
per_page = 100 
papers = []

params = {'search': query, 'filter': 'type:article'}

# Send a GET request to the API
response = requests.get(url, params=params)
count = response.json()['meta']['count']
total_pages = round(count/per_page) + 1

print('Total papers: %d'%(count))

time.sleep(1)

print('Start querying...')
# Loop through all pages (+1 to get the last page)
for page in tqdm(range(1,total_pages + 1)):
    
    # Get the cursor for the first page
    if page == 1:
        cursor = '*'

    params = {
        'search': query,
        'sort': 'publication_date',
        'per-page': per_page,
        'filter': 'type:article',
        'cursor' : cursor
    }

    # Send a GET request to the API
    response = requests.get(url, params=params)

    # If the request is successful
    if response.status_code == 200:
        data = response.json()

        # Get the data
        results = data.get('results',[])
        
        # Select the information need from these publications
        papers.extend([((res['id'].split('/')[-1].replace('W','')),
                        res['publication_date'],
                        res['title'],
                        preprocessing.plain_text_from_inverted(res['abstract_inverted_index'])) 
                               for res in results])
        
        # Get the next cursor for the pagination
        cursor = response.json()['meta']['next_cursor']

        # Respect the API rate limit
        time.sleep(1)
        
    else:
        print(f"Request failed with status code {response.status_code}.")
        break
    
print('Creating the dataframe...')
papers = pd.DataFrame(papers, columns = ['PaperID','Date','Title','Abstract'])

print('Drop missing papers with missing title and abstract')
papers = papers.dropna(subset = ['Title','Abstract'], how = 'all')

#### Export the data
Make sure to export the data separated by tab.

In [None]:
papers.to_csv('../data/raw/papers_raw.csv', index = False, sep = '\t')

## Approach 3: Subset of Zenodo (MAG) Data
The Zenodo repository, available at https://zenodo.org/record/8283353, provides data on all MAG papers (both journal and conference papers) published between 1800 and 2020. However, the raw text of titles and abstracts is not available. Instead, the repository offers processed versions of these papers.

To extract a specific subset from the Zenodo repository, there are two recommended methods:

1. **Textual Search**: Navigate through the processed titles and abstracts to identify desired papers.
2. **ID-Based Selection**: Utilize a custom list of identifiers to filter desired papers.


### Textual Search in the title and abstract 

In order to avoid overloading of the RAM, a line-by-line reading and writing is recommended.

- Data from zenodo should be downloaded and placed in a separate external folder out of this repository
- The file `papers_words.csv` will be read line-by-line
- For each line the text to be queried will be searched and title and abstract containing it will be filter out in two files:
    - `papers_ids.csv` placed in the `data/raw/` with one column and for each row the id of the paper containing the query

In this case, the following query strategy will be followed: search for the papers containing both `(natural language processing)` and `novelty`.

Users can use their own searching strategies.

> **_NOTE:_**  All papers are sorted by publication date (using MAG date).

In [None]:
# select the words
# all lower cased
terms = ['natural language','novelty']

path_external = 'D:/PublicDatabases/ScientificPublications/Papers/TitleAbstract/MAG_SN/Science_Novelty/'

print('Preparing for reading...')
words = open(path_external + 'papers_words.csv','r', encoding = 'utf-8')
words.__next__()


print('Preparing for writing...')
words_write = open('../data/processed/papers_words.csv','w')
words_write.write('PaperID,Words_Title,Words_Abstract\n') # write the first line for the headers


for line_words in tqdm(zip(words), total = 83490800):
    
    if all(term in line_words for term in terms):
        
        words_write.write(line_words)
        
print('Done.') 

words_write.close()        

### Use a custom set of papers
To use a subset of the Zenodo data:

1. Prepare a file named `papers_ids.csv` in the /raw directory.

    This file should contain a single column with a header indicating the type of identifier used. Acceptable headers are:

    - `PaperID` (for MAG or OpenAlex identifiers)
    - `DOI` (for Digital Object Identifier)
    - `PubMedID` (for PubMed's unique identifier)
    
    > Note: MAG and OpenAlex identifiers are similar. To convert a MAG identifier to a PubMed identifier, append 'W' to the MAG identifier.
    
2. Place the file with papers identifiers (`papers.csv` from the Zenodo repository) and the three original files of processed papers (`papers_words.csv`) in an external folder out of this repository

3. Filter the files using the `papers_ids.csv`, creating three versions of the processed files (`papers_words.csv`) in the `data/processed/` directory. Ensure that the files contain only the data intended for the analysis.

> **_NOTE:_**  All papers MUST be sorted by publication date.

## Approach 4: Entire Zenodo Repository
For those interested in using the complete dataset (comprising 83,490,800 papers) from the Zenodo repository, refer to the three processed files to be placed in the `data/processed/` directory:

- `papers_words.csv`

> **_NOTE:_**  All papers MUST be sorted by publication date.