# ***Libraries***

In [1]:
import pandas as pd
import json
import requests

# ***Loading the Dataset***

In [2]:
ppg_data = pd.read_csv('/media/work/icarovasconcelos/mono/data/authors-ppg7-6.csv')
ppg_data.head()

Unnamed: 0,ano_calendario,ppg_codigo,ppg_nome,ppg_nota,institution_id,ies_sigla,nome_docente,doutorado_ano,regime_trabalho,carga_horaria,link_do_lattes,author_id,bolsista_produtividade,extrato_bolsa_produtividade,doutorado_institution_id,doutorado_institution_name,doutorado_ppg_codigo,doutorado_supervisor_id,doutorado_supervisor_name
0,2022,42005019016P8,CC,7,I45643870,PUC/RS,TIAGO COELHO FERRETO,2010,Integral,40,http://lattes.cnpq.br/8685431534934812,A5009859711,VERDADEIRO,DT2,I45643870,Pontifícia Universidade Católica do Rio Grande...,31005012004P9,A5071130875,César Augusto Fonticielha De Rose
1,2022,42005019016P8,CC,7,I45643870,PUC/RS,SORAIA RAUPP MUSSE,2000,Integral,40,http://lattes.cnpq.br/2302314954133011,A5059434669,VERDADEIRO,PQ1C,I5124864,École polytechnique fédérale de Lausanne,,A5005709068,Dr Daniel Thalmann
2,2022,42005019016P8,CC,7,I45643870,PUC/RS,SABRINA DOS SANTOS MARCZAK,2011,Integral,40,http://lattes.cnpq.br/9458496222461501,A5014651524,VERDADEIRO,PQ2,I212119943,University of Victoria,,A5007049054,Daniela Damian
3,2022,42005019016P8,CC,7,I45643870,PUC/RS,RODRIGO COELHO BARROS,2013,Integral,20,http://lattes.cnpq.br/8172124241767828,A5039629929,VERDADEIRO,PQ2,I17974374,Universidade de São Paulo,33002045004P1,A5079499583,André Carlos Ponce de Leon Ferreira de Carvalho
4,2022,42005019016P8,CC,7,I45643870,PUC/RS,RAFAEL PRIKLADNICKI,2009,Integral,40,http://lattes.cnpq.br/2007065934836962,A5024645888,VERDADEIRO,PQ1D,I45643870,Pontifícia Universidade Católica do Rio Grande...,31005012004P9,A5022404709,Jorge Luis Nicolas Audy


In [3]:
ppg_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   ano_calendario               504 non-null    int64 
 1   ppg_codigo                   504 non-null    object
 2   ppg_nome                     504 non-null    object
 3   ppg_nota                     504 non-null    int64 
 4   institution_id               504 non-null    object
 5   ies_sigla                    504 non-null    object
 6   nome_docente                 504 non-null    object
 7   doutorado_ano                504 non-null    int64 
 8   regime_trabalho              504 non-null    object
 9   carga_horaria                504 non-null    int64 
 10  link_do_lattes               504 non-null    object
 11  author_id                    498 non-null    object
 12  bolsista_produtividade       504 non-null    object
 13  extrato_bolsa_produtividade  277 no

In [4]:
ppg_data = ppg_data.dropna(subset=['author_id'])
ppg_data.drop_duplicates(subset=['author_id'], inplace=True)
ppg_data = ppg_data.fillna("null")

In [5]:
ppg_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 491 entries, 0 to 503
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   ano_calendario               491 non-null    int64 
 1   ppg_codigo                   491 non-null    object
 2   ppg_nome                     491 non-null    object
 3   ppg_nota                     491 non-null    int64 
 4   institution_id               491 non-null    object
 5   ies_sigla                    491 non-null    object
 6   nome_docente                 491 non-null    object
 7   doutorado_ano                491 non-null    int64 
 8   regime_trabalho              491 non-null    object
 9   carga_horaria                491 non-null    int64 
 10  link_do_lattes               491 non-null    object
 11  author_id                    491 non-null    object
 12  bolsista_produtividade       491 non-null    object
 13  extrato_bolsa_produtividade  491 non-nul

In [6]:
ppg_data = ppg_data[ppg_data['author_id'] != 'A5012278873']

In [7]:
ppg_data.to_csv('/media/work/icarovasconcelos/mono/data/authors-ppg7-6-processed.csv', index=False)

# ***Accessing OpenAlex***

## *Fetching works from OpenAlex*


- This Python script is used to fetch and store works data from the OpenAlex API for a list of authors. The data is then stored in a JSON file.

- The script starts by defining a `select` string that contains the fields to be fetched from the API. It also initializes a `cursor` to '*', `n_works` and `calls` to 0, and `works` to an empty list.

- The script then enters a loop that iterates over each `author_id` in `ppg_data['author_id']`. For each `author_id`, it constructs a URL to fetch works data from the OpenAlex API, filtering by the `author_id` and works published since 2004-01-01.

- Inside this loop, there is another loop that fetches and processes each page of results from the API. It constructs the full URL by appending the `select` string and the `cursor` to the base URL. It then sends a GET request to the API and parses the JSON response.

- The `results` from the response are added to the `works` list, and the number of works `n_works` is incremented by the number of results. The `cursor` is then updated to the `next_cursor` from the response metadata, and the number of API calls `calls` is incremented by 1.

- The script also prints a message every 5, 10, 20, 50, 100, and every 500 API calls, indicating the number of API requests made so far.

- If an exception occurs during the process, it is caught and its message is printed.

- After all the works have been fetched, the script writes the `works` list to a JSON file named 'works_since_2004.json'. The `json.dump` function is used to convert the Python list into a JSON string and write it to the file.

In [8]:
cursor = '*'

select = ",".join((
    'id',
    'ids',
    'title',
    'display_name',
    'publication_year',
    'publication_date',
    'primary_location',
    'open_access',
    'authorships',
    'cited_by_count',
    'is_retracted',
    'is_paratext',
    'updated_date',
    'created_date',
    'topics',

))

n_works = 0
calls = 0
work_ids = set()  # Use a set to store unique work IDs
works = []  # Use a list to store the work details
try:
    for author_id in ppg_data['author_id']:
        url = f'https://api.openalex.org/works?filter=author.id:{author_id},from_publication_date:2004-01-01'
        # loop through pages
        cursor = '*'
        while cursor:
            # set cursor value and request page from OpenAlex
            url_1 = f'{url}&select={select}&cursor={cursor}'
            page_with_results = requests.get(url_1).json()

            results = page_with_results['results']
            for result in results:
                work_id = result['id']
                if work_id not in work_ids:  # Check if work ID is already in the set
                    work_ids.add(work_id)  # Add work ID to the set
                    works.append(result)  # Append the work to the list
                    n_works += 1
            # update cursor to meta.next_cursor
            cursor = page_with_results['meta']['next_cursor']
            calls += 1
            if calls in [5, 10, 20, 50, 100] or calls % 500 == 0:
                print(f'{calls} api requests made so far')

    print(f'done. made {calls} api requests. collected {n_works} works')

except Exception as e:
    print(f'An exception occurred: {str(e)}')

with open('/media/work/icarovasconcelos/mono/data/works_set_since_2004.json', 'w') as f:
    json.dump(works, f)

5 api requests made so far
10 api requests made so far
20 api requests made so far
50 api requests made so far
100 api requests made so far
500 api requests made so far
1000 api requests made so far
1500 api requests made so far
2000 api requests made so far
2500 api requests made so far
3000 api requests made so far
done. made 3180 api requests. collected 50010 works


## *Organizing the works data*

- This Python script is used to process a list of works, extract specific information from each work, and store the extracted data into a CSV file.

- The script begins by initializing an empty list `data`. It then iterates over each `work` in the `works` list. For each `work`, it further iterates over each `authorship` in the `work`.

- If an `authorship` exists, it extracts the `author` from the `authorship`. It then obtains the `author_id` and `author_name` from the `author` object, if it exists. The `author_id` is extracted from the last part of the `id` URL. The `author_position` is also extracted from the `authorship`.

- Next, the script iterates over each `institution` in the `authorship`. If an `institution` exists, it extracts the `institution_id`, `institution_name`, and `institution_country_code` from the `institution` object. The `institution_id` is extracted from the last part of the `id` URL.

- The `topic_name` is extracted from the first `topic` in the `work`, if the `work` contains any `topics`.

- After extracting all the necessary information, the script creates a dictionary with the extracted data and appends it to the `data` list.

- Once all the works have been processed, the script converts the `data` list into a pandas DataFrame `df_works`. Finally, it writes the DataFrame to a CSV file named '7&6ppg_works_since_2004.csv', with `index=False` to prevent pandas from writing row indices into the CSV file.

In [9]:
data = []
for work in works:
    for authorship in work['authorships']:
        if authorship:
            author = authorship['author']
            author_id = author['id'].split('/')[-1] if author else None
            author_name = author['display_name'] if author else None
            author_position = authorship['author_position']
            for institution in authorship['institutions']:
                if institution:
                    institution_id = institution['id'].split('/')[-1]
                    institution_name = institution['display_name']
                    institution_country_code = institution['country_code']
                    topic_name = work['topics'][0]['display_name'] if 'topics' in work and work['topics'] else None
                    data.append({
                        'work_id': work['id'].split('/')[-1],
                        'work_title': work['title'],
                        'work_display_name': work['display_name'],
                        'work_publication_year': work['publication_year'],
                        'work_publication_date': work['publication_date'],
                        'author_id': author_id,
                        'author_name': author_name,
                        'author_position': author_position,
                        'institution_id': institution_id,
                        'institution_name': institution_name,
                        'institution_country_code': institution_country_code,
                        'topic_name': topic_name,
                    })
                    
df_works = pd.DataFrame(data)
df_works.to_csv('/media/work/icarovasconcelos/mono/data/7&6ppg_works_since_2004.csv', index=False)

## *Dictionary of works and its authors*

- This Python script is used to process a list of works, extract specific information about each work and its authors, and store the extracted data into a JSON file.

- The script begins by initializing an empty dictionary `works_and_authors`. It then iterates over each `work` in the `works` list. For each `work`, it initializes an empty list `authors_list` and then iterates over each `authorship` in the `work`.

- If an `authorship` exists, it extracts the `author` from the `authorship`. It then obtains the `author_id` and `author_name` from the `author` object, if it exists. The `author_id` is extracted from the last part of the `id` URL.

- After extracting the author information, the script creates a dictionary with the `author_id` and `author_name` and appends it to the `authors_list`.

- Once all the authorships in a work have been processed, the script adds an entry to the `works_and_authors` dictionary. The key is the `work` id (extracted from the last part of the `id` URL) and the value is the `authors_list`.

- After all the works have been processed, the script writes the `works_and_authors` dictionary to a JSON file named '7&6ppg_works_and_authors_since_2004.json'. The `json.dump` function is used to convert the Python dictionary into a JSON string and write it to the file.

In [10]:
works_and_authors = {}            

for work in works:
    authors_list = []
    for authorship in work['authorships']:
        if authorship:
            author = authorship['author']
            author_id = author['id'].split('/')[-1] if author else None
            author_name = author['display_name'] if author else None
            authors_list.append({
                'author_id': author_id,
                'author_name': author_name,
            })
    works_and_authors[work['id'].split('/')[-1]] = authors_list
    
with open('/media/work/icarovasconcelos/mono/data/7&6ppg_works_and_authors_since_2004.json', 'w') as f:
    json.dump(works_and_authors, f)

In [11]:
df = pd.read_csv('/media/work/icarovasconcelos/mono/data/7&6ppg_works_since_2004.csv')
df.head()

Unnamed: 0,work_id,work_title,work_display_name,work_publication_year,work_publication_date,author_id,author_name,author_position,institution_id,institution_name,institution_country_code,topic_name
0,W1984712701,Performance Evaluation of Container-Based Virt...,Performance Evaluation of Container-Based Virt...,2013,2013-02-01,A5065379079,Miguel G. Xavier,first,I45643870,Pontifícia Universidade Católica do Rio Grande...,BR,Cloud Computing and Big Data Technologies
1,W1984712701,Performance Evaluation of Container-Based Virt...,Performance Evaluation of Container-Based Virt...,2013,2013-02-01,A5062060864,Marcelo Veiga Neves,middle,I45643870,Pontifícia Universidade Católica do Rio Grande...,BR,Cloud Computing and Big Data Technologies
2,W1984712701,Performance Evaluation of Container-Based Virt...,Performance Evaluation of Container-Based Virt...,2013,2013-02-01,A5075787478,Fábio Diniz Rossi,middle,I45643870,Pontifícia Universidade Católica do Rio Grande...,BR,Cloud Computing and Big Data Technologies
3,W1984712701,Performance Evaluation of Container-Based Virt...,Performance Evaluation of Container-Based Virt...,2013,2013-02-01,A5009859711,Tiago Ferreto,middle,I45643870,Pontifícia Universidade Católica do Rio Grande...,BR,Cloud Computing and Big Data Technologies
4,W1984712701,Performance Evaluation of Container-Based Virt...,Performance Evaluation of Container-Based Virt...,2013,2013-02-01,A5018576433,Timoteo Alberto Peters Lange,middle,I45643870,Pontifícia Universidade Católica do Rio Grande...,BR,Cloud Computing and Big Data Technologies
