# OpenAlex

The SYNERGY datasets do not contain text data right away.

Instead, they house ids which refer to the OpenAlex repository.

One has to retrieve those data using their associated ids, before being able to work with the texts

## Load Datasets
Load the datasets into memory:

In [1]:
import os
import pandas as pd

data_directory_uniform = '../../../../data/datasets/01_uniform'

files = os.listdir(data_directory_uniform)
subjects = [file.split('_uniform')[0] for file in files]

uniform_datasets = {
    subjects[count]: pd.read_csv(f'{data_directory_uniform}/{file}')
    for count, file in enumerate(files)
}

## Query API
For each article, retrieve title & abstract by its OpenAlex-ID:

In [None]:
import pyalex
from tqdm.notebook import tqdm

for dataset, data in tqdm(uniform_datasets.items(), desc='Downloading datasets'):

    # add titles & abstracts as a whole column later
    titles=[]
    abstracts=[]

    for index, row in tqdm(data.iterrows(), total=len(data), desc=dataset):

        if(data.isna()['openalex_id'][index] == True):
            titles.append(pd.NA)
            abstracts.append(pd.NA)
        else:
            # retrieve title/abstract through the api
            openalex = pyalex.Works()[row['openalex_id']]
        
            titles.append(openalex['title'])
            abstracts.append(openalex['abstract'])

    data['title'] = titles
    data['abstract'] = abstracts

## Save locally
Save the downloaded data locally:

In [7]:
directory_to_save = '../../../data/02_openalex'

[dataframe.to_csv(f'{directory_to_save}/{subject}_openalex.csv', index=False)
 for subject, dataframe in uniform_datasets.items()];