# Open Alex Extraction and Matching with .search()

The goal of this Notebook is look up the PhD students (Authors) contained in the [cleaned](clean_data.ipynb) NARCIS dataset, and
1. Confirm they can be found in OpenAlex
2. Confirm their affiliation in NARCIS matches the one in OpenAlex
2. Confirm they wrote the associated PhD Thesis
3. Per author, look up all the contributors (i.e. potential first supervisors) that are listen in the NARCIS dataset and
    a. Find all authors that have worked for the same organization at the time the PhD thesis was published (within a 1 year window)
    b. xxx


The previous version of this notebook written by a Bachelor student was using the `.search_filter()` method of `pyalex`, which does not search alternate spellings of the specified name. In this notebook we are using `search_filter()`, which does not have that problem. See the example code [here](search_parameter_vs_search_filter.ipynb).

## 1. Setup

In [None]:
#from pyalex import Works, Authors, Sources, Institutions, Topics, Publishers, Funders, Concepts
import pyalex # importing full package seems to be the only way to call `pyalex.config.email = email_address`
import pandas as pd
from os import path

from src.unabbreviate_institutions import unabbreviate_institutions
from src.open_alex_helpers import AuthorRelations, find_phd_and_supervisors_in_row, get_supervisors_openalex_ids
from src.io_helpers import fetch_supervisors_from_pilot_dataset
from src.clean_names_helpers import format_name_to_lastname_firstname

In [None]:
# Number of rows to read of the full dataset.
NROWS = 25 # None for all

Notebook settings

In [None]:
# Automatically reloads any modules that are imported, 
# so that any changes made to the module files are reflected # without needing to restart the Jupyter kernel.
# load autoreload module
%load_ext autoreload
# mode 1 reloads only when an import statement is called. For production
# mode 2 reloads before execution of every cell
%autoreload 2

# limit the number of rows that are shown with printing dataframes
pd.set_option('display.max_rows', 5)

Set contact email adress to get to the [polite pool](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication#the-polite-pool). If you are having a premium plan, you can access it via your email address as well.

In [None]:
# Get contact email adress from file
email_file_path = 'contact_email.txt'

if path.isfile(email_file_path):
    with open(email_file_path, 'r') as file:
        email_address = file.read().strip()

    # Assign the email address to the pyalex configuration
    pyalex.config.email = email_address

pyalex.config.email

## 2. Load datasets

### 2.1 Cleaned processed NARCIS dataset

In [None]:
pubs_df = pd.read_csv('data/cleaned/pubs.csv')

# Take a sample
if NROWS == None:
    n_sample = len(pubs_df)
else:
    n_sample = NROWS
    
#pubs_df = pubs_df.sample(n=n_sample, random_state=42).reset_index(drop=True)
 
pubs_df = pubs_df.head(NROWS) # use head for now because we actually do find some of these PhDs
    
pubs_df

In [None]:
# replace institution abbreviation with names that can be found in OpenAlex
pubs_unabbrev_df = unabbreviate_institutions(pubs_df, 'institution')
pubs_unabbrev_df

### 2.2 Priority supervisor list from ResponsibleSupervision pilot

This dataset was created during the Responsible Supervision pilot project, see [here](https://github.com/tamarinde/ResponsibleSupervision/tree/main/Pilot-responsible-supervision).

In [None]:
repo_url = "https://github.com/tamarinde/ResponsibleSupervision/tree/main/Pilot-responsible-supervision/data/spreadsheets"
csv_path = "data/output/sups_pilot.csv"

try:
    # Attempt to read the supervisors in the pilot dataset from csv_path
    # If it fails, we get them again from GitHub
    supervisors_ids = get_supervisors_openalex_ids(repo_url, csv_path)
    print("Unique Supervisors with OpenAlex IDs:")
    print(supervisors_ids)
except Exception as e:
    print(f"An error occurred: {e}")

# NOTE: The problem with this is that I am getting a lot of matches in OA with all the supervisors. But how do I match them with the spelling and ID I will find in my final dataset?
# SOLUTION: I will just list all the author IDs here that I come across. We compare those with the IDs in the final dataset, and whenever we have a match we give the bonus points to that supervisor! 

# Still looking up all the supervisors in OA takes a while and some queries. So I should buffer the result to a csv and only 

## 3. Extraction

In [None]:
# Test for one row

if False:
    # Assume we are working with the first row of the DataFrame
    row = pubs_df.iloc[0]

    # Extract necessary fields
    phd_name = row['phd_name']
    title = row['title']
    year = int(row['year'])
    institution = row['institution']
    contributors = [row[f'contributor_{i}'] for i in range(1, 11) if pd.notna(row[f'contributor_{i}'])]

    # Create an instance of AuthorRelations with desired verbosity ('NONE', 'MEDIUM', 'DETAILED')
    years_tolerance = -1  # years tolerance
    author_relations = AuthorRelations(
        phd_name=phd_name,
        title=title,
        year=year,
        institution=institution,
        contributors=contributors,
        years_tolerance=years_tolerance,
        verbosity='DEBUG'
    )

    # Search for the PhD candidate using both criteria
    author_relations.search_phd_candidate(criteria='title')

    # Find potential supervisors among the contributors
    author_relations.find_potential_supervisors()

    # Get the OpenAlex ID pairs
    results = author_relations.get_results()
    print(results)

In [None]:
# Apply the function to each row using DataFrame.apply
results_list = pubs_df.apply(find_phd_and_supervisors_in_row, axis=1).tolist()

# Convert the results into a DataFrame
results_df = pd.DataFrame(results_list)

results_df.to_csv('data/output/matched_pairs.csv', index=False)