The project is based on NARCIS (National Academic Research and Collaboration Information System) harvested meta-data from different PhD thesis repositories in The Netherlands (NARCIS, n.d.). NARCIS aggregates the information on scholarly work across Dutch academic institutions. NARCIS followed standardized protocols during the extraction, ensuring consistency and interoperability across various data sources.

By now the NARCIS dataset has been included into OpenAIRE.

To obtain the original dataset, please contact @tamarinde [on Github](https://github.com/tamarinde).

This script is intended to clean the original dataset that was obtained from the XXX study. The raw dataset should be placed in

`data/raw/pairs_sups_phds.csv`

Make sure that the columns have the following names: `thesis_identifier`,`contributor`,`contributor_order`,`institution`,`author_name`,`title`,`year`,`language`

This notebook cleans the raw data in the following ways

1. Remove wrongly detected conributor entries that are not people. For this, we make use of spacy and the `xx_ent_wiki_sm` model
2. Standardize the contributor names to "Last name, Initials" -> let's check here if we are actually throwing away information here and if we should keep the first names
3. Aggregate all rows per publication/PhD student into one row with all the contributors
4. Pivot the dataset, to get to one row per publication/PhD student, with one columns per contributor
5. Export the dataset to `data/cleaned/sampled_pubs.csv`

## Load Dependencies

In [None]:
# Import dependencies
import pandas as pd

# custom functions
from src.clean_names_helpers import remove_non_person_contributors_and_export
from src.clean_names_helpers import format_name_to_lastname_initials
from src.clean_names_helpers import ensure_and_load_spacy_model

## Settings

In [None]:
# limit the number of rows that are shown with printing dataframes
pd.set_option('display.max_rows', 5)

# Number of rows to read of the full dataset.
NROWS = None # None for all

## Clean and restructure dataset

In [None]:
# Load, and if notvailable, download the spacy nlp model 
model_name = "xx_ent_wiki_sm" # multilingual NER model
nlp = ensure_and_load_spacy_model(model_name)

When checking if contributor names are actual names of people, the functions use a whitelist and a blacklist. If a name is on the whitelist, it will always be considered to be a name. Strings on the blacklist will however always be discared.

The whitelist can be fed from the file `data/removed_contributors.csv` that is created when running the script.

The blacklist can be fed from non-people names that end up in the filtered dataset.

In [None]:
# Names that spacy does not recognize as such
# NOTE: Add the verbatim names here, not the standardized target notation 

WHITELIST = [ 
    "Oosterlaan, J.",
    "Nollet, F."
    ] 

# non-people's names that don't get filtered out by spaCy 
BLACKLIST = [
    "Cardiology"
]

removed_contributors = []

### Read dataset

In [None]:
# Read data
pairs_raw = pd.read_csv("data/raw/pairs_sups_phds.csv", nrows=NROWS)
pairs_raw = pairs_raw.convert_dtypes() # make sure all integer columns are integer dtype 
pairs_raw

### Remove duplicates, rows where contributor is NA or not a person

The raw data contains a lot of junk rows for contributors that the original scraper identified, but that are actually the institution, the reseach field or other junk.
To remove contributors that are not people, we are using spacy with the model chosen above. The detecion still has a lot of false positives and negatives, but we can rid of them via a the custom `WHITELIST` and `BLACKLIST`.

In [None]:
# remove duplicates
pairs_filtered = pairs_raw.drop_duplicates() 

# Remove rows where 'contributor' is NA
pairs_filtered = pairs_filtered.dropna(subset=['contributor'])

# remove contributors that aren't people
csv_path = "data/removed_contributors.csv"
pairs_filtered = remove_non_person_contributors_and_export(pairs_filtered, csv_path, nlp, WHITELIST, BLACKLIST)

print(f"{len(pairs_filtered)} columns are left.")
pairs_filtered


### Standardize names to Last name, Initials

In [None]:
# Standardize names
pairs_std = pairs_filtered
# Apply name standardization to the contributor column
pairs_std['contributor'] = pairs_filtered['contributor'].apply(format_name_to_lastname_initials)

pairs_std

### Pivot

Aggregate and unpack so that we get one row per publication and one column per contributor.

#### Aggregate

In [None]:
# Group by publication
aggregated = pairs_std.groupby([
        'integer_id', 
        'thesis_identifier', 
        'institution', 
        'author_name', 
        'title', 
        'year', 
        'language'
    ])
        
# Aggregate contributors into a list
aggregated = aggregated.agg(list)

aggregated = aggregated.reset_index()
    
# make sure the contributor is a sequence from 1 to n_contributors
aggregated['contributor_order'] = aggregated['contributor_order'].apply(lambda lst: list(range(1, len(lst) + 1)))

aggregated


#### Unpack contributor rows

In [None]:
# Pivot the dataset, to get to one row per dissertation, with the contributors in columns

# Initialize a list to hold publication data dictionaries
pubs_list = []

# Iterate over each aggregated group
for _, row in aggregated.iterrows():
    # Initialize a dictionary with publication information
    pub_dict = {col: row[col] for col in ['integer_id', 'thesis_identifier', 'institution', 'author_name', 'title', 'year', 'language']}
    
    # Get the list of contributors and their orders for this publication
    contributors = row['contributor']
    contributor_orders = row['contributor_order']
    
    # Add contributors to the dictionary using dynamic keys
    for order in sorted(set(contributor_orders)):  # Ensure unique and sorted order numbers
        if order - 1 < len(contributors):  # Check to prevent index error
            pub_dict[f'contributor_{order}'] = contributors[order - 1]
    
    # Append the publication dictionary to the list
    pubs_list.append(pub_dict)

# Convert the list of dictionaries to a DataFrame
pubs = pd.DataFrame(pubs_list)

# Ensure correct data types and fill missing values with a suitable placeholder if necessary
pubs = pubs.convert_dtypes()

pubs

## Export cleaned dataset

In [None]:
pubs.to_csv('data/cleaned/pubs.csv', index=False)