***************************************************************************************
Jupyter Notebooks from the Metadata for Everyone project

Code:
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)

Project team: 
* Juan Pablo Alperin (https://orcid.org/0000-0002-9344-7439)
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)
* Mike Nason (https://orcid.org/0000-0001-5527-8489)
* Julie Shi (https://orcid.org/0000-0003-1242-1112)
* Marco Tullney (https://orcid.org/0000-0002-5111-2788)

Last updated: 2024-08-02
***************************************************************************************

# Labeling Problems in the Data

We identified many issues in Phase 1 of our project (see Shi, J., Nason, M., Tullney, M., & Alperin, J. P. (2023). Identifying Metadata Quality Issues Across Cultures. SocArXiv. https://doi.org/10.31235/osf.io/6fykh). Now we will go through and programatically label the records in this sample if they contain some of those issues.


Start by importing the packages we'll need, setting up our directories, and loading in the data.

In [1]:
import pandas as pd #Creating dataframe and manipulating data
from bs4 import BeautifulSoup as bs # for cleaning xml tags
import re #regular expressions used for detection of initials
from py3langid.langid import LanguageIdentifier, MODEL_FILE #For language detection
from nltk.tokenize import sent_tokenize #Tokenizing abstracts during language detection
from pathlib import Path

In [2]:
# Data Directory
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'
# Loading in dataset
df = pd.read_parquet(input_dir / '02_cleaned_data.parquet')
df.head()

Unnamed: 0,abstract_langs,abstracts,article_lang,article_title,authors,doi,journal_lang,journal_title,publisher_name
0,,,,[Revolving Boot Crimping Machine],,10.1038/scientificamerican04281849-249k,,Scientific American,Springer Science and Business Media LLC
1,,,,[Development of Desktop CNC Lathe with Pipe Fr...,"[{'affiliation': None, 'given_name': 'Naohiko'...",10.2493/jjspe.85.189,en,Journal of the Japan Society for Precision Eng...,Japan Society for Precision Engineering
2,,,,[0547 Predicting Response to Oral Appliance Th...,"[{'affiliation': 'Zephyr Sleep Technologies, C...",10.1093/sleep/zsy061.546,en,Sleep,Oxford University Press (OUP)
3,,,,[Anomalous absorption of hydrogen peroxide (H2...,"[{'affiliation': None, 'given_name': 'Mohit K....",10.1016/j.jqsrt.2020.107085,en,Journal of Quantitative Spectroscopy and Radia...,Elsevier BV
4,,,,[Cultura emprendedora de los estudiantes de Ma...,"[{'affiliation': None, 'given_name': None, 'na...",10.24265/iggp.2020.v7n1.09,,Revista en Gobierno y Gestión Pública,Universidad de San Martin de Porres


## Missing Values in Common Fields
This is a relatively easy problem to label, so we'll tackle these first.

We'll set up a column *'no_author'* and assign `0` to all of the records. Then we will locate the records missing an author and change their value to `1`.

Then we'll do the same for the *language, abstract,* and *title* fields.

In [3]:
#Authors
df['Author Missing'] = float(0)
df.loc[df.authors.isna(), 'Author Missing'] = float(1)
#Languages
df['Article Language Missing'] = float(0)
df.loc[df.article_lang.isna(), 'Article Language Missing'] = float(1)
df['Journal Language Missing'] = float(0)
df.loc[df.journal_lang.isna(), 'Journal Language Missing'] = float(1)
#Abstracts
df['Abstract Missing'] = float(0)
df.loc[df.abstracts.isna(), 'Abstract Missing'] = float(1)
#Titles
df['Article Title Missing'] = float(0)
df.loc[df.article_title.isna(), 'Article Title Missing'] = float(1)
df['Journal Title Missing'] = float(0)
df.loc[df.journal_title.isna(), 'Journal Title Missing'] = float(1)

## Prevalence of missing values

### Missing Authors

In [4]:
prevalence_AuMis = (len(df.loc[df['Author Missing'] == 1])/len(df)) * 100
#prevalence_AuMis # percentage of records with this specific issue
print("{:0.2f} percent of the article records do not contain an author".format(prevalence_AuMis))
print("Number of records: ",len(df.loc[df['Author Missing'] ==1]))

9.76 percent of the article records do not contain an author
Number of records:  51707


In [5]:
#missing_authors_df = df.loc[df.no_author == 1]
#missing_authors_df.head(50)

### Missing Language

In [6]:
prevalence_Article_Lang_Miss = (len(df.loc[df['Article Language Missing'] == 1])/len(df)) * 100
#prevalence_LangMis # percentage of records with this specific issue
print("{:0.2f} percent of the article records do not have an article language specified".format(prevalence_Article_Lang_Miss))
print("Number of records: ",len(df.loc[df['Article Language Missing'] == 1]))

95.49 percent of the article records do not have an article language specified
Number of records:  506145


In [7]:
prevalence_Journal_Lang_Miss = (len(df.loc[df['Journal Language Missing'] == 1])/len(df)) * 100
#prevalence_LangMis # percentage of records with this specific issue
print("{:0.2f} percent of the article records do not have an journal language specified".format(prevalence_Journal_Lang_Miss))
print("Number of records: ",len(df.loc[df['Journal Language Missing'] == 1]))

21.33 percent of the article records do not have an journal language specified
Number of records:  113070


In [8]:
#missing_language_df = df.loc[df.no_language == 1]
#missing_language_df.head(50)

### Missing Abstract

In [9]:
prevalence_AbsMis = (len(df.loc[df['Abstract Missing'] == 1])/len(df)) * 100
#prevalence_AbsMis
print("{:0.2f} percent of the article records do not contain an abstract".format(prevalence_AbsMis))
print("Number of records: ",len(df.loc[df['Abstract Missing'] == 1]))

75.45 percent of the article records do not contain an abstract
Number of records:  399935


In [10]:
#missing_abstract_df = df.loc[df.no_abstract == 1]
#missing_abstract_df.head(50)

### Missing Title

In [11]:
prevalence_Article_Title_Mis = (len(df.loc[df['Article Title Missing'] == 1])/len(df)) * 100
#prevalence_TitleMis
print("{:0.2f} percent of the article records do not contain an article title".format(prevalence_Article_Title_Mis))
print("Number of records: ",len(df.loc[df['Article Title Missing'] == 1]))

0.28 percent of the article records do not contain an article title
Number of records:  1473


In [12]:
prevalence_Journal_Title_Mis = (len(df.loc[df['Journal Title Missing'] == 1])/len(df)) * 100
#prevalence_TitleMis
print("{:0.2f} percent of the article records do not contain a Journal title".format(prevalence_Journal_Title_Mis))
print("Number of records: ",len(df.loc[df['Journal Title Missing'] == 1]))

0.00 percent of the article records do not contain a Journal title
Number of records:  2


In [13]:
#missing_titles_df = df.loc[df.no_title == 1]
#missing_titles_df.head(50)

## Investigating the Author Entries

We'll start off by investigating the *author* field. This is an area that was found to have a number of potentially high priority issues as it pertains to social and political matters, as well as a field that has seen the some of the most pervasive issues in standardization. 

## Author Sequence
Our first function will be checking the *sequence* sub-field within the *author* field. This is the field wherein authors are either listed as 'first' or 'addtional'. This function sets up a counter then iterates through the author list of a record to check what the noted sequence is for each author.

The `try` block filters out records that have no authors listed. After that we begin to iterate through each author within a given record.

`If 'name' in author.keys():` is used to filter out institutions listed as authors as using the 'name' key is often how an institution is presented as an author within the metadata record. The code within the `if` block simply says if there's an institution as an author and they are the only author listed, increase the counter to 1, then the code will continue down to the `return` statements where **0** will be returned as technically there is not an issue with sequence in that record.

`else: if author['sequence'] == 'first'` block is where the bulk of the counting activity will happen. Up until this point we are mostly filtering out instances that don't apply to the problem at hand. Simply, the function will count how many authors are labled as 'first'. Once all authors of a record have been parsed, we go to the `return` statements.

In [14]:
def sequence_checker(authorList):
    counter = 0 
    try: 
        for author in authorList:
                if 'name' in author.keys():
                    if len(authorList) == 1:
                        counter +=1
                    else:
                        continue
                else:
                    if author['sequence'] == 'first':
                        counter +=1
                    else:
                        continue
        if counter == 0:
            return 1 #no first author
        elif len(authorList) > 1:
            if counter > 1:
                return 1 #multiple first authors
            else:
                return 0
        else:
            return 0 #no issue
    except:
        return None

In [15]:
def sequence_checker2(authorList: list[dict[str]]| None) -> int | None:
    """Function to check the value of the 'sequence' key for each
    author within a record.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    counter = 0 
    try: 
        for author in authorList:
                #if 'name' in author.keys():
                    #if len(authorList) == 1:
                        #counter +=1
                    #else:
                        #continue
                #else:
                if author['sequence'] == 'first':
                    counter +=1
                else:
                    continue
        if counter == 0:
            return 0 # There is no first author.
        elif counter == 1:
            if len(authorList) == 1:
                return 1 # There's only one author and this author is also tagged as first author
            else:
                return 2 # There are multiple authors and only one of them is labeled as first author
        else:
            if counter == len(authorList):
                return 3 # All authors are labeled as first authors
            else: 
                return 4 # More than one and less than all authors are labeled as first authors
    except:
        return None

In [16]:
df['Author Sequence Encoding'] = df.authors.map(lambda x: sequence_checker2(x))

In [17]:
first_author_none = df.loc[df['Author Sequence Encoding'] == 0]
first_author_only = df.loc[df['Author Sequence Encoding'] == 1]
first_author_oneamong = df.loc[df['Author Sequence Encoding'] == 2]
first_author_all = df.loc[df['Author Sequence Encoding'] == 3]
first_author_some = df.loc[df['Author Sequence Encoding'] == 4]
# Note that this deviates from our usual approach of encoding quality issues binary in 0/1. 
# But these cases are so different that I wanted to have more details here.

In [18]:
first_author_none_rate = len(first_author_none)/len(df)*100
first_author_only_rate = len(first_author_only)/len(df)*100
first_author_oneamong_rate = len(first_author_oneamong)/len(df)*100
first_author_all_rate = len(first_author_all)/len(df)*100
first_author_some_rate = len(first_author_some)/len(df)*100

print("{:0.2f}% of all articles have no first author.".format(first_author_none_rate))
print("{:0.2f}% of all articles have one author and that author is the first author.".format(first_author_only_rate))
print("{:0.2f}% of all articles have multiple authors, but only one first author.".format(first_author_oneamong_rate))
print("{:0.2f}% of all articles have all authors tagged as first authors.".format(first_author_all_rate))
print("{:0.2f}% of all articles have more than one first author, but not all authors are first authors.".format(first_author_some_rate))


0.28% of all articles have no first author.
26.86% of all articles have one author and that author is the first author.
60.60% of all articles have multiple authors, but only one first author.
1.26% of all articles have all authors tagged as first authors.
1.25% of all articles have more than one first author, but not all authors are first authors.


In [19]:
first_author_none_rate_only_authored_articles = len(first_author_none)/df.authors.notnull().sum()*100
first_author_only_rate_only_authored_articles = len(first_author_only)/df.authors.notnull().sum()*100
first_author_oneamong_rate_only_authored_articles = len(first_author_oneamong)/df.authors.notnull().sum()*100
first_author_all_rate_only_authored_articles = len(first_author_all)/df.authors.notnull().sum()*100
first_author_some_rate_only_authored_articles = len(first_author_some)/df.authors.notnull().sum()*100

print("{:0.2f}% of all articles that actually have authors have no first author.".format(first_author_none_rate_only_authored_articles))
print("{:0.2f}% of all articles that actually have authors have one author and that author is the first author.".format(first_author_only_rate_only_authored_articles))
print("{:0.2f}% of all articles that actually have authors have multiple authors, but only one first author.".format(first_author_oneamong_rate_only_authored_articles))
print("{:0.2f}% of all articles that actually have authors have all authors tagged as first authors.".format(first_author_all_rate_only_authored_articles))
print("{:0.2f}% of all articles that actually have authors have more than one first author, but not all authors are first authors.".format(first_author_some_rate_only_authored_articles))


0.31% of all articles that actually have authors have no first author.
29.76% of all articles that actually have authors have one author and that author is the first author.
67.15% of all articles that actually have authors have multiple authors, but only one first author.
1.39% of all articles that actually have authors have all authors tagged as first authors.
1.39% of all articles that actually have authors have more than one first author, but not all authors are first authors.


## Author Initials
This function will utilize regular expressions for detecting the use of initials. Specifically, we are looking for when initials are used in totality, that is to say a record with "Marianne E." will not be flagged, whereas a record with "D." will.

We look in both the 'given' and the 'family' sub-fields as this use of initials has been found in both sub-fields previously. 

The flow of the function operates similarly to the `sequence_checker`, we filter out records with `null` authors in the first `try` statement, followed by iteration through the author list, then another `try` statement where we filter out institutions as authors.

The regular expressions can be broken into two conditions: `^(?:[A-Z]\W{,3}\s?){,3}` and `(?:[^\W\d_.]\W){1,2}\B` which are seperated by `|`. This is because each of those expressions are looking for initials, the former is looking in ASCII characters, whereas the latter s looking for the pattern in non-Latin characters.

`if detector != None or len(author['given']) == 1` ensures that all initialized names are caught and then returned with the appropriate label.

In [20]:
def author_initials_checker(authorList: list[dict[str]] | None) -> int | None:
    """Function that evaluates author names to determine the use of initials.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try: 
        for author in authorList: 
            try: 
                detector = re.search(r"^(?:[A-Z]\W{,3}\s?){1,3}$", author['given_name']) 
                if detector or len(author['given_name']) == 1:
                    return 1 
                else:
                    family_detector = re.search(r"^(?:[A-Z]\W{,3}\s?){1,3}$", author['surname']) 
                    if family_detector or len(author['surname']) == 1:
                        return 1 
                    else:
                        pass
            except:
                pass
                        
    except:
        return None
    return 0 

In [21]:
df['Author Initials'] = df.authors.map(lambda x: author_initials_checker(x))

In [22]:
records_with_initials = df.loc[df['Author Initials'] == 1]
prevalence_author_initials_only_authored_articles = ((len(records_with_initials))/(df.authors.notnull().sum())) * 100
prevalence_author_initials_all_articles = len(records_with_initials)/len(df) * 100
#prevalence_author_initials # percentage of records with this specific issue
print("{:0.2f} percent of the article records contain at least one author with only initials given".format(prevalence_author_initials_all_articles))
print("{:0.2f} percent of the article records that actually have authors contain at least one author with only initials given".format(prevalence_author_initials_only_authored_articles))
print("Number of records: ",len(records_with_initials))

23.12 percent of the article records contain at least one author with only initials given
25.62 percent of the article records that actually have authors contain at least one author with only initials given
Number of records:  122531


## Institutions as Authors RE-CHECK OUTCOMES HERE

**Need to adjust all author labelling functions to account for different author info schema. Inclusion of institutional authors has altered the schema and thus affecting the numbers.**

This function will address instances in which institutions are recorded as authors.

`try:` will filter out records with `null` authors. Then we have the `institutions_present` list that looks for the telltale sign of an institution, the 'name' sub-field. 

If the list is populated with any authors, then the appropriate label signalling an institution will be returned.

In [23]:
def institution_as_author(authorList: list[dict[str]] | None) -> int | None:
    """Function to determine if an institution is listed as an author
    within a record.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        institutions_present = [author for author in authorList if author['name'] != None]
        if len(institutions_present) > 0:
            return 1 #institution as author
        else:
            return 0 #no issue
    except:
        return None

In [24]:
df['Institutions as Authors'] = df.authors.map(lambda x: institution_as_author(x))

In [25]:
records_with_AuIns = df.loc[df['Institutions as Authors'] == 1]
prevalence_AuIns_only_authored_articles = ((len(records_with_AuIns))/(df.authors.notnull().sum())) * 100
prevalence_AuIns_all_articles = ((len(records_with_AuIns))/len(df)) * 100
#prevalence_AuIns #percentage of records with this specific issue
print("{:0.2f} percent of the article records list at least one institution as an author".format(prevalence_AuIns_all_articles))
print("{:0.2f} percent of the article records that actually have authors list at least one institution as an author".format(prevalence_AuIns_only_authored_articles))
print("Number of records: ",len(records_with_AuIns))

1.83 percent of the article records list at least one institution as an author
2.02 percent of the article records that actually have authors list at least one institution as an author
Number of records:  9684


## Affiliation Missing

This function will check if there is any data present within the Author `"Affiliation"` subfield.

We start by creating a variable to operate as a indicator to the presence of an affiliation. We then iterate through each author within a given record.

If an affiliation is present, we change the indicator to be `False`. After checking the authors, we assign `1` to records that are missing affiliations or `0` if there is no issue present.

In [26]:
def affiliations_missing(authorList: list[dict[str]] | None) -> int | None:
    """Function to determine the presence of affiliations for authors within a record.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    no_affil = True
    try:
        if len(authorList) == 0:
            return None
        else:
            for author in authorList:
                if author['affiliation'] is None:
                    continue
                else:
                    no_affil = False
        if no_affil:
            return 1
        else:
            return 0
    except:
        return None

In [27]:
df['Affiliation Missing'] = df.authors.map(lambda x: affiliations_missing(x))

In [28]:
records_missing_affil = df.loc[df['Affiliation Missing'] == 1]
prevalence_miss_affil = ((len(records_missing_affil))/(df.authors.notnull().sum())) * 100
#prevalence_miss_affil
print("{:0.2f} percent of the article records miss the affiliation for every author.".format(prevalence_miss_affil))

82.37 percent of the article records miss the affiliation for every author.


## Checking for Honorifics in Author Names

This function utilizes a set list of honorifics found, in Phase 1, to be used within the Author `"Given"` and `"Family"` subfields.

After establishing our list, we then iterate through each Author of every record, putting their names into a lowercase format.

We then check the given and family names for the use of the listed titles. Return `1` if an honorific is present. Return `0` if none are found.

In [29]:
def honorific_checker(authorList: list[dict[str]] | None) -> int | None:
    """Function for detecting the use of honorifics within author names of a given record.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    titles_list = set(['dr.', 'prof', 'prof.', 'professor', 'doctor', 'dr', 'ingeniero'])
    try:
        for author in authorList:
            lowercase_given = author['given_name'].lower()
            lowercase_family = author['surname'].lower()
            if any(word in titles_list for word in lowercase_given.split()):
                return 1
            elif any(word in titles_list for word in lowercase_family.split()):
                return 1
            else: 
                continue
        return 0
    except:
        return None

In [30]:
df['Author Use of Honorific'] = df.authors.map(lambda x: honorific_checker(x))

In [31]:
records_with_honorific = df.loc[df['Author Use of Honorific'] == 1]
prevalence_honorific_all_articles = ((len(records_with_honorific))/df.authors.notnull().sum()) * 100
prevalence_honorific_only_authored_articles = ((len(records_with_honorific))/len(df)) * 100
#prevalence_honorific
print("At least {:0.2f} percent of the article records contain honorifics within the author names .".format(prevalence_honorific_all_articles))
print("At least {:0.2f} percent of the article records that actually have authors contain honorifics within the author names .".format(prevalence_honorific_only_authored_articles))
print("Number of records: ",len(records_with_honorific))

At least 0.09 percent of the article records contain honorifics within the author names .
At least 0.08 percent of the article records that actually have authors contain honorifics within the author names .
Number of records:  442


## Uppercase Author Names

With this function we will check each Author's `"Given"` and `"Family"` name subfields to see if the input is in all upercase letters.

We start iterating through the author list and filter out records wherein the number of characters in a `"Given"` name is `1`. These are likely to be initials and as such are covered by another dimension of issue detection. We then use the regular expression `(?:^[A-Z]+)$` to return matches when an Author's name is in all uppercase letters. 

If a match is found we return `1` to signifiy the existence of an issue. If no match is returned, we repeat the process using the author's `"Family"` name. If no match is found for the `"Family"` name, then we proceed with the next author within the record until all authors have been checked.

In [32]:
def uppercase_name(authorList: list[dict[str]] | None) -> int | None:
    """Function for determining the presence of a name using exclusively capital letters.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        for author in authorList:
            if len(author['given_name']) == 1:
                continue
            else:
                if re.match(r'(?:^[A-Z]+)$', author['given_name']):
                    return 1
                else:
                    if len(author['surname']) == 1:
                        continue
                    else:
                        if re.match(r'(?:^[A-Z]+)$', author['surname']):
                            return 1
                        else:
                            continue
        return 0
    except:
        return None

In [33]:
df['Author Name in All Caps'] = df.authors.map(lambda x: uppercase_name(x))

In [34]:
records_uppercase = df.loc[df['Author Name in All Caps'] == 1]
prevalence_uppercase_only_authored_articles = ((len(records_uppercase))/df.authors.notnull().sum()) * 100
prevalence_uppercase_all_articles = ((len(records_uppercase))/len(df)) * 100
#prevalence_uppercase
print("{:0.2f} percent of the article records contain author names in all uppercase letters.".format(prevalence_uppercase_all_articles))
print("{:0.2f} percent of the article records that actually have authors contain author names in all uppercase letters.".format(prevalence_uppercase_only_authored_articles))
print("Number of records: ",len(records_uppercase))

5.18 percent of the article records contain author names in all uppercase letters.
5.74 percent of the article records that actually have authors contain author names in all uppercase letters.
Number of records:  27457


## Non-Latin Characters

This function detects the use of non-latin character sets. Particularly we are interested in practices of romanization and when it occurs: which journals, are the *language* fields present and accurate, and so on. 

First, we have to identify which records are using non-latin characters.

This is split into two different functions. The first utilizes a regular expression `(?:[^ı́\x00-\xff])` to detect any characters not in ISO-8859-1 (or Latin-1) (See note).

The second then utlizes the first function to then check each author within a given record.

Note: This expression is providing a few too many false positives for my liking. I'm currently working on a better expression or a different solution entirely.

In [35]:
def isLatinChar(authorName: str) -> bool:
    """Helper function for evaluating the presence of Latin characters.

    Args:
        authorName (str): An Author's name.

    Returns:
        bool: True if non-Latin characters are used. False if exclusively Latin characters are used.
    """
    regexp = re.compile(r'(?:[^ı́\x00-\xff])')
    if regexp.search(authorName):
        return True
    else:
        return False
def latin_script_checker(authorList: list[dict[str]] | None) -> int | None:
    """Function for evaluating author names within each record for the presence
    of non-Latin characters.

    Args:
        authorList (list[dict[str]] | None): The nested list of authors for each record.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        latin_scripts = [author for author in authorList if isLatinChar(author['given_name'])]
        if len(latin_scripts) > 0:
            return 1 # non-latin script found
        else:
            return 0 # no issue
    except:
        return None

In [36]:
df['Non-ASCII Characters'] = df.authors.map(lambda x: latin_script_checker(x))

In [37]:
records_with_non_latin = df.loc[df['Non-ASCII Characters'] == 1]
prevalence_NonLatin = ((len(records_with_non_latin))/(df.authors.notnull().sum())) * 100
#prevalence_NonLatin #percentage of records with this specific issue
print("{:0.2f} percent of the article records contain author (given) names in non-Latin letters.".format(prevalence_NonLatin))

2.73 percent of the article records contain author (given) names in non-Latin letters.


## Abstract Multi-lingualism Detection
This function will detect the use of more than one language within the *abstract* field. As mentioned before, we're interested in how people pracitice recording metadata as it pertains to language.

Here we have a list of language ISO 639-1 codes. While it is not exhaustive (there are 183 offically assigned codes, and only 94 are present in this list), it does include many of the macrolanguages for which many other languages fall within.

We pass this list to `langid` to ensure a higher confidence intervals in it's identification, i.e. an abstract might be in Malay (ms) but the identifier might return 'ms' and 'id' (Indonesian) with lower confidence intervals for each. As Malay is the macrolanguage that covers Indonesian, we will keep 'ms' but not 'id'.

The first `try:` block filters for records without abstracts present, then we tokenize the abstracts by sentence.

Next we pick out the first sentence and the second to last sentence of each abstract. The reason for picking out the second to last sentence is because most occurences of multi-lingual abstracts are such that the abstract is first written in one language, and then a second time in another. The reason for not picking the last sentence is because it is not uncommon for footnotes or citations to be present at the end of the abstracts in these metadata records. The presence of these at the end of an abstract section make language detection problematic as the syntactical structure can be odd and leads to an incorrect detection.

We then classify both sentences, followed by an evaluation of the confidence intervals. If the confidence interval is especially low, it is omitted.

We then check to see if there is more than one language present in the dictionary with `len(set(lang_dict.keys()))`, if so the record is returned with a **1**, indicating and error. Otherwise it is returned with a **0**.

If this is the first time running this notebook, you may need to uncomment the top two lines of the cell:

`import nltk`

`nltk.download('punkt')`

This is necessary for `sent_tokenize` to work as intended.


In [38]:
#import nltk
#nltk.download('punkt')
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs = True)
lang_list = ['af', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'br', 
             'bs', 'ca', 'cs', 'cy', 'da', 'de', 'dz', 'el', 'en', 'eo', 
             'es', 'et', 'eu', 'fa', 'fi', 'fo', 'fr', 'ga', 'gl', 'gu', 
             'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'is', 'it', 'ja', 'jv', 
             'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 
             'lt', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'ne', 
             'nl', 'no', 'oc', 'or', 'pa', 'pl', 'ps', 'pt', 'qu', 'ro', 
             'ru', 'rw', 'se', 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'sw', 
             'ta', 'te', 'th', 'tl', 'tr', 'ug', 'uk', 'ur', 'vi', 'vo', 
             'wa', 'xh', 'zh', 'zu']
identifier.set_languages(langs=lang_list)
def lang_checker(abstracts: list[str] | None) -> int | None:
    """Function for checking for multilingual abstracts by analysing the language
    used at the beginning and end of each abstract within a record.

    Args:
        abstracts (list[str] | None): List of abstracts.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        for abstract in abstracts:
            # Tokenizing abstracts
            tokenized = sent_tokenize(abstract)
            startAndFinish = [tokenized[0], tokenized[-1]]
            # Detecting languages present
            lang = [identifier.classify(lang) for lang in startAndFinish]
            # Filter low confidence results
            lang_dict = {key:value for (key,value) in lang if value > .95}
            # Labeling specific issues found in record
            if len(set(lang_dict.keys())) > 1:
                return 1 #Multiple languages detected
    except:
        return None #No abstract
    return 0 #No issues

In [39]:
df['Multilingual Abstract']  = df.abstracts.map(lambda x: lang_checker(x))

In [40]:
records_with_MultiLang = df.loc[(df['Multilingual Abstract'] == 1)]
prevalence_MultiLang_only_articles_with_abstract = ((len(records_with_MultiLang))/(df.abstracts.notnull().sum())) * 100
prevalence_MultiLang_all_articles = ((len(records_with_MultiLang))/len(df)) * 100
#prevalence_MultiLang #returning a percent of the total number of records with this particular issue
print("{:0.2f} percent of the article records contain abstracts that contain more than one language.".format(prevalence_MultiLang_all_articles))
print("{:0.2f} percent of the article records with abstract contain abstracts that contain more than one language.".format(prevalence_MultiLang_only_articles_with_abstract))
print("Number of records: ",len(records_with_MultiLang))

0.40 percent of the article records contain abstracts that contain more than one language.
1.61 percent of the article records with abstract contain abstracts that contain more than one language.
Number of records:  2099


In [41]:
# Add in count of abstracts. Count of different abstracts in different languages

### Explore the multilingual abstracts

In [42]:
#records_with_MultiLang['DOI'].head(50)

## Abstract Language Checking
This function will check the language of the abstract against the stated language of the record.

It is a relatively striaghtforward function: `try:` filters out records without a *abstract*, then classifies the language, and finally checks to see if the returned code matches what is record in the language field.

We use `df.apply` instead of `df.column.map` because of the need to check multiple fields within a record as opposed to being contained within a specific field.

Here, it should be mentioned, there is some abiguity. The *language* field is not clearly defined (is it the language of the Item, Container, or the record). The prevelance of this issue (seen below) reflects the lack of clarity in what this field is meant to represent.

In [43]:
def article_journal_language(record: pd.Series) -> int | None:
    """Function to determine if the stated language of an article
    matches the stated language of the journal within a given record.

    Args:
        record (pd.Series): A metadata record from the dataframe.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        if record['article_lang'] == record['journal_lang']:
            return 0
        else:
            return 1
    except:
        return None

def abstract_lang_checker(record: pd.Series) -> int | None:
    """Function to determine if the stated language of an abstract
    matches the stated language of the journal within a given record.

    Args:
        record (pd.Series): A metadata record from the dataframe.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        for i in record['abstracts']:
            if i:
                lang = identifier.classify(i)
                if lang[1] > .99:
                    if lang[0] == record['journal_lang']:
                        return 0
                    else:
                        if record['journal_lang'] is None:
                            return None
                        else:
                            return 1
                else:
                    return None
            else:
                return None
    except:
        return None

In [44]:
df['Abstract Language Match'] = df.apply(abstract_lang_checker, axis=1)
df['Article-Journal Language Match'] = df.apply(article_journal_language, axis=1)

In [45]:
records_with_AbstractLang = df.loc[(df['Abstract Language Match'] == 1)] #creating a df with only the records with these errors
prevalence_AbstractLang_only_articles_with_abstract = ((len(records_with_AbstractLang))/len(df.loc[(df.abstracts.notnull()) & (df.article_lang.notnull())])) * 100
prevalence_AbstractLang_all_articles = ((len(records_with_AbstractLang))/len(df)) * 100
#prevalence_AbstractLang #returning a percent of the total number of records with this particular issue
print("{:0.2f} percent of the article records with abstracts and language information have a mismatch between the given language and the detected language of the article abstract.".format(prevalence_AbstractLang_only_articles_with_abstract))
print("{:0.2f} percent of all article records have a mismatch between the given language and the detected language of the article abstract.".format(prevalence_AbstractLang_all_articles))

54.24 percent of the article records with abstracts and language information have a mismatch between the given language and the detected language of the article abstract.
0.58 percent of all article records have a mismatch between the given language and the detected language of the article abstract.


In [46]:
#records_with_AbstractLang['DOI'].head(50)

## Title Language Checking
This function will check the language of the title against the stated language of the record.

It is a relatively striaghtforward function: `try:` filters out records without a *title*, then classifies the language, and finally checks to see if the returned code matches what is record in the language field.

We use `df.apply` instead of `df.column.map` because of the need to check multiple fields within a record as opposed to being contained within a specific field.

Here, it should be mentioned, there is some abiguity. The *language* field is not clearly defined (is it the language of the Item, Container, or the record). The prevelance of this issue (seen below) reflects the lack of clarity in what this field is meant to represent.

In [47]:
def title_lang_checker(record: pd.Series) -> int | None:
    """Function to determine if the detected language of the title matches
    the stated language of the journal within a given record.

    Args:
        record (pd.Series): A metadata record from the dataframe.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        if record['article_title']:
            for i in record['article_title']:
                lang = identifier.classify(i)
                if lang[1] > .99:
                    if lang[0] == record['journal_lang']:
                        return 0
                    else:
                        if str(record['journal_lang']).lower() == 'nan':
                            return None
                        else:
                            return 1
                else:
                    return None
        else:
            return None
    except:
        return None

In [48]:
df['Title Language Match'] = df.apply(title_lang_checker, axis=1)

In [49]:
records_with_TitleLang = df.loc[(df['Title Language Match'] == 1)] #creating a df with only the records with these errors
prevalence_TitleLang_only_articles_with_title = ((len(records_with_TitleLang))/len(df.loc[(df.article_title.notnull()) & (df.journal_lang.notnull())])) * 100
prevalence_TitleLang_all_articles = ((len(records_with_TitleLang))/len(df)) * 100
#prevalence_TitleLang #returning a percent of the total number of records with this particular issue
print("{:0.2f} percent of the article records with titles and language information have a mismatch between the given language and the detected language of the article title.".format(prevalence_TitleLang_only_articles_with_title))
print("{:0.2f} percent of all article records have a mismatch between the given language and the detected language of the article title.".format(prevalence_TitleLang_all_articles))

23.09 percent of the article records with titles and language information have a mismatch between the given language and the detected language of the article title.
18.17 percent of all article records have a mismatch between the given language and the detected language of the article title.


## Total Errors
Lastly, we'll add up all of the errors for each record and store them number in *'total_errors'* column.

In [50]:
# Labled Columns
column_list = ['Author Missing', 'Article Language Missing', 'Journal Language Missing',
                'Abstract Missing', 'Article Title Missing', 'Journal Title Missing',
                'Institutions as Authors', 'Affiliation Missing', 'Author Use of Honorific',
                'Author Name in All Caps', 'Non-ASCII Characters', 'Multilingual Abstract',
                'Abstract Language Match', 'Article-Journal Language Match', 'Title Language Match'
                ]
df['Total Errors'] = df[column_list].sum(axis=1)

In [51]:
#Taking a look at the df
df.head()

Unnamed: 0,abstract_langs,abstracts,article_lang,article_title,authors,doi,journal_lang,journal_title,publisher_name,Author Missing,...,Institutions as Authors,Affiliation Missing,Author Use of Honorific,Author Name in All Caps,Non-ASCII Characters,Multilingual Abstract,Abstract Language Match,Article-Journal Language Match,Title Language Match,Total Errors
0,,,,[Revolving Boot Crimping Machine],,10.1038/scientificamerican04281849-249k,,Scientific American,Springer Science and Business Media LLC,1.0,...,,,,,,,,0,,4.0
1,,,,[Development of Desktop CNC Lathe with Pipe Fr...,"[{'affiliation': None, 'given_name': 'Naohiko'...",10.2493/jjspe.85.189,en,Journal of the Japan Society for Precision Eng...,Japan Society for Precision Engineering,0.0,...,0.0,1.0,0.0,1.0,0.0,,,1,,5.0
2,,,,[0547 Predicting Response to Oral Appliance Th...,"[{'affiliation': 'Zephyr Sleep Technologies, C...",10.1093/sleep/zsy061.546,en,Sleep,Oxford University Press (OUP),0.0,...,0.0,0.0,0.0,0.0,0.0,,,1,0.0,3.0
3,,,,[Anomalous absorption of hydrogen peroxide (H2...,"[{'affiliation': None, 'given_name': 'Mohit K....",10.1016/j.jqsrt.2020.107085,en,Journal of Quantitative Spectroscopy and Radia...,Elsevier BV,0.0,...,0.0,1.0,0.0,0.0,0.0,,,1,0.0,4.0
4,,,,[Cultura emprendedora de los estudiantes de Ma...,"[{'affiliation': None, 'given_name': None, 'na...",10.24265/iggp.2020.v7n1.09,,Revista en Gobierno y Gestión Pública,Universidad de San Martin de Porres,0.0,...,1.0,1.0,,,,,,0,1.0,6.0


In [52]:
df.to_parquet(output_dir / '03_labeled_data.parquet', index=False)