# Open Alex Extraction and Matching with .search()

The goal of this Notebook is look up the PhD students (Authors) contained in the [cleaned](clean_data.ipynb) NARCIS dataset, and
1. Confirm they can be found in OpenAlex
2. Confirm their affiliation in NARCIS matches the one in OpenAlex
2. Confirm they wrote the associated PhD Thesis
3. Per author, look up all the contributors (i.e. potential first supervisors) that are listen in the NARCIS dataset and
    a. Find all authors that have worked for the same organization at the time the PhD thesis was published (within a 1 year window)
    b. xxx


The previous version of this notebook written by a Bachelor student was using the `.search_filter()` method of `pyalex`, which does not search alternate spellings of the specified name. In this notebook we are using `search_filter()`, which does not have that problem. See the example code [here](search_parameter_vs_search_filter.ipynb).

## 1. Setup

In [None]:
#from pyalex import Works, Authors, Sources, Institutions, Topics, Publishers, Funders, Concepts
from pyalex import config # to set email_address
import pandas as pd
from sentence_transformers import SentenceTransformer
from os import path
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

from src.unabbreviate_institutions import unabbreviate_institutions
from src.open_alex_helpers import AuthorRelations, find_phd_and_supervisors_in_row, get_supervisors_openalex_ids
from src.dataset_config_helpers import read_config, load_dataset
from src.api_cache_helpers import initialize_request_cache
from src.plotters import PhDMatchPlotter

# Initialize tqdm for progress bars
tqdm.pandas()

# Install the cache before any API calls are made.
# This will cache every API call to Open Alex and if a cached version of the call is available,
# it will be preferred over making a new API call.
initialize_request_cache()

Notebook settings

In [None]:
# Automatically reloads any modules that are imported, 
# so that any changes made to the module files are reflected # without needing to restart the Jupyter kernel.
# load autoreload module
%load_ext autoreload
# mode 1 reloads only when an import statement is called. For production
# mode 2 reloads before execution of every cell
%autoreload 2

# limit the number of rows that are shown with printing data frames
pd.set_option('display.max_rows', 5)

Set contact email address to get to use the [polite pool](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication#the-polite-pool). Also, if you are on a premium plan, you can access the higher usage limit by using the associated email address.

In [None]:
# Get contact email address from file
email_file_path = 'contact_email.txt'

if path.isfile(email_file_path):
    with open(email_file_path, 'r') as file:
        email_address = file.read().strip()

    # Assign the email address to the pyalex configuration
    config.email = email_address

config.email

Configure number of retries and backoff factor
Pyalex is using [urllib3.util.Retry](https://urllib3.readthedocs.io/en/stable/reference/urllib3.util.html) for retrying.

In [None]:
config.max_retries = 7
config.retry_backoff_factor = 2 # conservative backoff

## 2. Load datasets

### 2.1 Cleaned processed NARCIS dataset

In [None]:
dataset_path = 'data/cleaned/pubs_with_domain.csv'
output_filename = 'data/output/author_relations.csv'

config = read_config('dataset_config.yaml')

pubs_df = load_dataset(config, dataset_path)

pubs_df


In [None]:
# replace institution abbreviation with names that can be found in OpenAlex
pubs_unabbrev_df = unabbreviate_institutions(pubs_df, 'institution')
pubs_unabbrev_df


In [None]:
# PhD candidates with 4 or more supervisors (for information)
contributor_cols = [f'contributor_{i}' for i in range(1, 11)]

# Count non-missing contributor entries per row
pubs_unabbrev_df['contributor_count'] = pubs_unabbrev_df[contributor_cols].notna().sum(axis=1)

# Reorder columns to place contributor_count after institution
cols = list(pubs_unabbrev_df.columns)
if 'institution' in cols and 'contributor_count' in cols:
    cols.remove('contributor_count')
    institution_index = cols.index('institution')
    cols.insert(institution_index + 1, 'contributor_count')
    pubs_unabbrev_df = pubs_unabbrev_df[cols]

# Filter and sort
pubs_more_than_n_df = (
    pubs_unabbrev_df[pubs_unabbrev_df['contributor_count'] >= 4]
    .sort_values(by=['institution', 'contributor_count'], ascending=[True, True])
    .copy()
)

print(f"There are {pubs_more_than_n_df.shape[0]} PhD candidates with 4 or more supervisors")

pubs_more_than_n_df




### 2.2 Priority supervisor list from ResponsibleSupervision pilot

This dataset was created during the Responsible Supervision pilot project, see [here](https://github.com/tamarinde/ResponsibleSupervision/tree/main/Pilot-responsible-supervision).

In [None]:
repo_url = "https://github.com/tamarinde/ResponsibleSupervision/tree/main/Pilot-responsible-supervision/data/spreadsheets"
csv_path = "data/output/sups_pilot.csv"

try:
    # Attempt to read the supervisors in the pilot dataset from csv_path
    # If it fails, we get them again from GitHub
    supervisors_in_pilot_dataset = get_supervisors_openalex_ids(repo_url, csv_path)
    print("Unique Supervisors with OpenAlex IDs:")
    print(supervisors_in_pilot_dataset)
except Exception as e:
    print(f"An error occurred: {e}")

## 3. Extraction

Load the pre-trained SPECTER model by allenai (designed for scientific documents). We pre-load the model here, so that we don't need to do that per class instance.

Citation information can be found here: https://github.com/allenai/specter

In [None]:
model = SentenceTransformer("allenai-specter")

In [None]:
# set the dict to overwrite the default class attribute specified in src/open_alex_helpers.py
AuthorRelations.supervisors_in_pilot_dataset = supervisors_in_pilot_dataset

# Apply the function to each row with a constant, preloaded model
extraction_series = pubs_unabbrev_df.progress_apply(
    lambda row: find_phd_and_supervisors_in_row(row, model),
    axis=1
    )

# Concatenate all DataFrames into one
extraction_df = pd.concat(list(extraction_series), ignore_index=True)

extraction_df.to_csv(output_filename, index=False)

extraction_df

## 4. Analysis and Visualization

In [None]:
print(f"We managed to find contributors with {extraction_df['n_shared_inst_grad'].sum()} shared institutions and {extraction_df['n_shared_pubs'].sum()} shared publications!")

Load the extraction dataset from file in case we didn't run the extraction

In [None]:
if 'extraction_df' not in locals() and 'extraction_df' not in globals():
    file_path = output_filename
    
    # Check if the file exists
    if path.exists(file_path):
        extraction_df = pd.read_csv(file_path)
        print(f"Read `extraction_df` from {file_path}")
    else:
        raise FileNotFoundError(f"File not found: {file_path}")
    
extraction_df

Get PhDs that we could not find in OpenAlex.

In [None]:
# Step 1: Filter extraction_df for rows with phd_id = NaN
extraction_none_df = extraction_df.query("phd_id != phd_id")

# Step 2: Filter pubs_unabbrev_df for matching phd_names; then sort and export
pubs_phd_not_confirmed_df = (
    pubs_unabbrev_df
    .query("phd_name in @extraction_none_df.phd_name")
    .sort_values(by=["year", "institution"])   # sort by multiple columns
)

# Export to CSV without the DataFrame index
pubs_phd_not_confirmed_df.to_csv("data/output/phds_not_confirmed.csv", index=False)

pubs_phd_not_confirmed_df


In [None]:
plotter = PhDMatchPlotter(extraction_df)
ax = plotter.plot()
plt.show()


In [None]:
count_phds_df

In [None]:
extraction_df


In [None]:
# Function to assign a match category based on the flags
def determine_category_contrib(row):
    # If a contributor was found but not confirmed (i.e. no match information)
    if pd.isnull(row['contributor_id']) or not row['contributor_id']:
        return 'Not found'
    else:
        if not row['n_shared_pubs'] and not row['same_grad_inst']:
            return 'No shared publications or affiliation at graduation'
        elif not row['n_shared_pubs'] and row['same_grad_inst']:
            return 'Shared affiliation at graduation only'
        elif row['n_shared_pubs'] and not row['same_grad_inst']:
            return 'Shared publications only'
        elif row['n_shared_pubs'] and row['same_grad_inst']:
            return 'Shared publications and affiliation at graduation'
        else:
            return 'Other'

bar_categories = [
    'Not found',
    'No shared publications or affiliation at graduation',
    'Shared affiliation at graduation only',
    'Shared publications only',
    'Shared publications and affiliation at graduation',
    'Other'
]

# Apply the categorization function to create a new column
extraction_df['match_category'] = extraction_df.apply(determine_category_contrib, axis=1)

# filter out rows where 'phd_id' is Na, so that we only look at PhDs we could confirm
count_contrib_df = extraction_df.query('phd_id.notna()')[['contributor_name', 'match_category']].copy()

# Count how many contributors per match type
match_contrib_counts = count_contrib_df['match_category'].value_counts().reindex(bar_categories)

# Create a bar plot
ax = match_contrib_counts.plot(kind='bar', color='#FFD54F')

# Add count labels on top of each bar
ax.bar_label(ax.containers[0], label_type='edge')

# Add labels and title for clarity
plt.xlabel("Match Type")
plt.ylabel("Number of contributors")
plt.suptitle("Contributor Matching Confirmation by Type", fontsize=12) # title
plt.title(f"Only considering the {n_confirmed_phds} PhDs we could confirm", fontsize=10) # subtile

# Display the plot
plt.show()