# Labeling the Data
We identified many issues in Phase 1. Now we will go through and programatically label the records in this sample if they contain some of those issues.
## Author Problems
We'll start off by investigating the *author* field. This is an area that was found to have a number of potentially high priority issues as it pertains to social and political matters, as well as a field that has seen the some of the most pervasive issues in standardization. 

Start by importing the packages we'll need, setting up our directories, and loading in the data.

In [16]:
import pandas as pd #Creating dataframe and manipulating data
from bs4 import BeautifulSoup as bs # for cleaning xml tags
import re #regular expressions used for detection of initials
import py3langid as langid #For language detection
from nltk.tokenize import sent_tokenize #Tokenizing abstracts during language detection
from pathlib import Path

In [17]:
# Data Directory
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'
# Loading in dataset
df = pd.read_csv(input_dir / '02_cleaned_data.csv', 
                 usecols=['Unnamed: 0', 'publisher', 'container-title', 'language', 'DOI', 'published', 
                          'created', 'deposited', 'title', 'author', 'abstract', 'original-title'],
                 parse_dates=['created', 'deposited'],
                 infer_datetime_format=True, 
                 index_col='Unnamed: 0')
df.index.names = ['Index']
df.head()

Unnamed: 0_level_0,publisher,DOI,created,title,author,container-title,language,deposited,published,abstract,original-title
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Wiley,10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,2002-09-10,The validation of commercial system dynamics m...,"[{'given': 'Geoff', 'family': 'Coyle', 'sequen...",System Dynamics Review,en,2021-07-01,2000.0,,
1,Springer Science and Business Media LLC,10.1007/bf02653972,2007-07-17,Effect of system geometry on the leaching beha...,"[{'given': 'C.', 'family': 'Vu', 'sequence': '...",Metallurgical Transactions B,en,2019-05-20,1979.0,,
2,Wiley,10.1111/reel.12221,2017-12-01,The international law on transboundary haze po...,"[{'given': 'Shawkat', 'family': 'Alam', 'seque...","Review of European, Comparative &amp; Internat...",en,2017-12-01,2017.0,,
3,Crop Science Society of Japan,10.1626/jcs.20.219,2011-09-20,Studies on the influence of pruning on the veg...,"[{'given': 'C.', 'family': 'TSUDA', 'sequence'...",Japanese Journal of Crop Science,en,2021-04-30,1951.0,,
4,Elsevier BV,10.1016/j.pneumo.2018.09.002,2018-10-10,Le tabagisme et l’aide à l’arrêt du tabac des ...,"[{'given': 'J.', 'family': 'Perriot', 'sequenc...",Revue de Pneumologie Clinique,fr,2019-10-26,2018.0,,


## Missing Values in Common Fields
This is a relatively easy problem to label, so we'll tackle these first.

We'll set up a column *'no_author'* and assign `0` to all of the records. Then we will locate the records missing an author and change their value to `1`.

Then we'll do the same for the *language, abstract,* and *title* fields.

In [18]:
#Authors
df['no_author'] = float(0)
df.loc[df.author.isna(), 'no_author'] = float(1)
#Language
df['no_language'] = float(0)
df.loc[df.language.isna(), 'no_language'] = float(1)
#Abstracts
df['no_abstract'] = float(0)
df.loc[df.abstract.isna(), 'no_abstract'] = float(1)
#Titles
df['no_title'] = float(0)
df.loc[df.title.isna(), 'no_title'] = float(1)

In [19]:
# Prevalence of missing values
#Missing Authors
prevalence_AuMis = (len(df.loc[df.no_author == 1])/len(df)) * 100
prevalence_AuMis # percentage of records with this specific issue

9.368

In [20]:
#Missing Language
prevalence_LangMis = (len(df.loc[df.no_language == 1])/len(df)) * 100
prevalence_LangMis

20.244999999999997

In [21]:
#Missing Abstract
prevalence_AbsMis = (len(df.loc[df.no_abstract == 1])/len(df)) * 100
prevalence_AbsMis

83.892

In [22]:
#Missing Title
prevalence_TitleMis = (len(df.loc[df.no_title == 1])/len(df)) * 100
prevalence_TitleMis

1.0070000000000001

## Problem Detection Functions and Data Labeling
### Author Sequence
Our first function will be checking the *sequence* sub-field within the *author* field. This is the field wherein authors are either listed as 'first' or 'addtional'. This function sets up a counter then iterates through the author list of a record to check what the noted sequence is for each author.

The `try` block filters out records that have no authors listed. After that we begin to iterate through each author within a given record.

`If 'name' in author.keys():` is used to filter out institutions listed as authors as using the 'name' key is often how an institution is presented as an author within the metadata record. The code within the `if` block simply says if there's an institution as an author and they are the only author listed, increase the counter to 1, then the code will continue down to the `return` statements where **0** will be returned as technically there is not an issue with sequence in that record.

`else: if author['sequence'] == 'first'` block is where the bulk of the counting activity will happen. Up until this point we are mostly filtering out instances that don't apply to the problem at hand. Simply, the function will count how many authors are labled as 'first'. Once all authors of a record have been parsed, we go to the `return` statements.

In [23]:
def sequence_checker(authorList):
    counter = 0 
    try: 
        for author in authorList:
                if 'name' in author.keys():
                    if len(authorList) == 1:
                        counter +=1
                    else:
                        continue
                else:
                    if author['sequence'] == 'first':
                        counter +=1
                    else:
                        continue
        if counter == 0:
            return 1 #no first author
        elif counter > 1:
            return 1 #multiple first authors
        else:
            return 0 #no issue
    except:
        return None

In [24]:
# The 'author' and 'subject' columns need to be evaluated and formated before parsing,
# otherwise they are treated as strings instead of dicts/lists.
import ast
def reformat_col(record):
    try:
        formed = ast.literal_eval(record)
        return formed
    except:
        return None

cols_to_reformat = ['author']
for col in cols_to_reformat:
    df[col] = df[col].apply(lambda x: reformat_col(x))

In [25]:
df['author_sequence'] = df.author.map(lambda x: sequence_checker(x))

In [26]:
records_with_AuSeq = df.loc[(df.author_sequence == 1)] #creating a df with only the cords with these errors
prevalence_AuSeq = ((len(records_with_AuSeq))/(df.author.notnull().sum())) * 100
prevalence_AuSeq #returning a percent of the total number of records with this particular issue

1.1839085532703681

### Author Initials
This function will utilize regular expressions for detecting the use of initials. Specifically, we are looking for when initials are used in totality, that is to say a record with "Marianne E." will not be flagged, whereas a record with "D." will.

We look in both the 'given' and the 'family' sub-fields as this use of initials has been found in both sub-fields previously. 

The flow of the function operates similarly to the `sequence_checker`, we filter out records with `null` authors in the first `try` statement, followed by iteration through the author list, then another `try` statement where we filter out institutions as authors.

The regular expressions can be broken into two conditions: `^(?:[A-Z]\W{,3}\s?){,3}` and `(?:[^\W\d_.]\W){1,2}\B` which are seperated by `|`. This is because each of those expressions are looking for initials, the former is looking in ASCII characters, whereas the latter s looking for the pattern in non-Latin characters.

`if detector != None or len(author['given']) == 1` insures that all initialized names are caught and then returned with the appropriate label.

In [27]:
def author_initials_checker(authorList):
    try: #Filter for no authors
        for author in authorList: #iterating through author array
            try: #filter for institutions as authors
                detector = re.match(r"^(?:[A-Z]\W{,3}\s?){,3}$|(?:[^\W\d_.]\W){1,2}\B$", author['given'].capitalize()) #checking for initials in given
                if detector != None or len(author['given']) == 1:
                    return 1 #initials used
                else:
                    family_detector = detector = re.match(r"^(?:[A-Z]\W{,3}\s?){,3}$|(?:[^\W\d_.]\W){1,2}\B$", author['family'].capitalize()) #initials in family
                    if family_detector != None or len(author['family']) == 1:
                        return 1 #initials used
                    else:
                        pass
            except:
                pass
                        
    except:
        return None
    return 0 #no issue

In [28]:
df['author_initials'] = df.author.map(lambda x: author_initials_checker(x))

In [29]:
records_with_initials = df.loc[df.author_initials == 1]
prevalence_author_initials = ((len(records_with_initials))/(df.author.notnull().sum())) * 100
prevalence_author_initials # percentage of records with this specific issue

19.010945361461733

### Institutions as Authors
This function will address instances in which institutions are recorded as authors.

`try:` will filter out records with `null` authors. Then we have the `institutions_present` list that looks for the telltale sign of an institution, the 'name' sub-field. 

If the list is populated with any authors, then the appropriate label signalling an institution will be returned.

In [30]:
def institution_as_author(authorList):
    try:
        institutions_present = [author for author in authorList if 'name' in author.keys()]
        if len(institutions_present) > 0:
            return 1 #institution as author
        else:
            return 0 #no issue
    except:
        return None

In [31]:
df['author_institutions'] = df.author.map(lambda x: institution_as_author(x))

In [32]:
records_with_AuIns = df.loc[df.author_institutions == 1]
prevalence_AuIns = ((len(records_with_AuIns))/(df.author.notnull().sum())) * 100
prevalence_AuIns #percentage of records with this specific issue

1.9231617971577366

### Non-Latin Characters

This function detects the use of non-latin character sets. Particularly we are interested in practices of romanization and when it occurs: which journals, are the *language* fields present and accurate, and so on. 

First, we have to identify which records are using non-latin characters.

This is split into two different functions. The first utilizes a regular expression `(?:[^ı́\x00-\xff])` to detect any characters not in ISO-8859-1 (or Latin-1) (See note).

The second then utlizes the first function to then check each author within a given record.

Note: This expression is providing a few too many false positives for my liking. I'm currently working on a better expression or a different solution entirely.

In [33]:
def isLatinChar(text):
    regexp = re.compile(r'(?:[^ı́\x00-\xff])')
    if regexp.search(text):
        return True
    else:
        return False
def latin_script_checker(authorList):
    try:
        latin_scripts = [author for author in authorList if isLatinChar(author['given'])]
        if len(latin_scripts) > 0:
            return 1 # non-latin script found
        else:
            return 0 # no issue
    except:
        return None

In [34]:
df['author_characters'] = df.author.map(lambda x: latin_script_checker(x))

In [35]:
records_with_non_latin = df.loc[df.author_characters == 1]
prevalence_NonLatin = ((len(records_with_non_latin))/(df.author.notnull().sum())) * 100
prevalence_NonLatin #percentage of records with this specific issue

2.21555300556095

### Abstract Multi-lingualism Detection
This function will detect the use of more than one language within the *abstract* field. As mentioned before, we're interested in how people pracitice recording metadata as it pertains to language.

Here we have a list of language ISO 639-1 codes. While it is not exhaustive (there are 183 offically assigned codes, and only 94 are present in this list), it does include many of the macrolanguages for which many other languages fall within.

We pass this list to `langid` to ensure a higher confidence intervals in it's identification, i.e. an abstract might be in Malay (ms) but the identifier might return 'ms' and 'id' (Indonesian) with lower confidence intervals for each. As Malay is the macrolanguage that covers Indonesian, we will keep 'ms' but not 'id'.

The first `try:` block filters for records without abstracts present, then we tokenize the abstracts by sentence.

Next we pick out the first sentence and the second to last sentence of each abstract. The reason for picking out the second to last sentence is because most occurences of multi-lingual abstracts are such that the abstract is first written in one language, and then a second time in another. The reason for not picking the last sentence is because it is not uncommon for footnotes or citations to be present at the end of the abstracts in these metadata records. The presence of these at the end of an abstract section make language detection problematic as the syntactical structure can be odd and leads to an incorrect detection.

We then classify both sentences, followed by an evaluation of the confidence intervals. If the confidence interval is especially low, it is omitted.

We then check to see if there is more than one language present in the dictionary with `len(set(lang_dict.keys()))`, if so the record is returned with a **1**, indicating and error. Otherwise it is returned with a **0**.

In [36]:
lang_list = ['af', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'br', 
             'bs', 'ca', 'cs', 'cy', 'da', 'de', 'dz', 'el', 'en', 'eo', 
             'es', 'et', 'eu', 'fa', 'fi', 'fo', 'fr', 'ga', 'gl', 'gu', 
             'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'is', 'it', 'ja', 'jv', 
             'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 
             'lt', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'ne', 
             'nl', 'no', 'oc', 'or', 'pa', 'pl', 'ps', 'pt', 'qu', 'ro', 
             'ru', 'rw', 'se', 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'sw', 
             'ta', 'te', 'th', 'tl', 'tr', 'ug', 'uk', 'ur', 'vi', 'vo', 
             'wa', 'xh', 'zh', 'zu']
langid.set_languages(langs=lang_list)
def lang_checker(abstract):
    try:
        # Tokenizing abstracts
        tokenized = sent_tokenize(abstract)
        startAndFinish = [tokenized[1], tokenized[-2]]
        # Detecting languages present
        lang = [langid.classify(lang) for lang in startAndFinish]
        # Filter low confidence results
        lang_dict = {key:value for (key,value) in lang if value > 100 or value < -100}
        # Labeling specific issues found in record
        if len(set(lang_dict.keys())) > 1:
            return 1 #Multiple languages detected
    except:
        return None #No abstract
    return 0 #No issues

In [37]:
df['abstract_multi_lang']  = df.abstract.map(lambda x: lang_checker(x))

In [38]:
records_with_MultiLang = df.loc[(df.abstract_multi_lang == 1)]
prevalence_MultiLang = ((len(records_with_MultiLang))/(df.abstract.notnull().sum())) * 100
prevalence_MultiLang #returning a percent of the total number of records with this particular issue

1.6016886019369256

### Title Language Checking
This function will check the language of the title against the stated language of the record.

It is a relatively striaghtforward function: `try:` filters out records without a *title*, then classifies the language, and finally checks to see if the returned code matches what is record in the language field.

We use `df.apply` instead of `df.column.map` because of the need to check multiple fields within a record as opposed to being contained within a specific field.

Here, it should be mentioned, there is some abiguity. The *language* field is not clearly defined (is it the language of the Item, Container, or the record). The prevelance of this issue (seen below) reflects the lack of clarity in what this field is meant to represent.

In [39]:
def title_lang_checker(record):
    try:
        lang = langid.classify(record['title'])
        r_L = str(record['language'])
        if r_L != 'nan':
            if lang[0] == record['language']:
                return 0
            else:
                return 1
        else:
            return None
    except:
        return None

In [40]:
df['title_language'] = df.apply(title_lang_checker, axis=1)

In [41]:
records_with_TitleLang = df.loc[(df.title_language == 1)] #creating a df with only the records with these errors
prevalence_TitleLang = ((len(records_with_TitleLang))/(df.title.notnull().sum())) * 100
prevalence_TitleLang #returning a percent of the total number of records with this particular issue

5.5670602971927305

### Total Errors
Lastly, we'll add up all of the errors for each record and store them number in *'total_errors'* column.

In [42]:
# Labled Columns
column_list = ['no_author', 'no_language', 'no_title', 'author_sequence', 'author_initials', 'author_institutions',
              'author_characters', 'abstract_multi_lang', 'title_language']
df['total_errors'] = df[column_list].sum(axis=1)

In [43]:
#Taking a look at the df
df.head()

Unnamed: 0_level_0,publisher,DOI,created,title,author,container-title,language,deposited,published,abstract,original-title,no_author,no_language,no_abstract,no_title,author_sequence,author_initials,author_institutions,author_characters,abstract_multi_lang,title_language,total_errors
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
0,Wiley,10.1002/(sici)1099-1727(200021)16:1<27::aid-sd...,2002-09-10,The validation of commercial system dynamics m...,"[{'given': 'Geoff', 'family': 'Coyle', 'sequen...",System Dynamics Review,en,2021-07-01,2000.0,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
1,Springer Science and Business Media LLC,10.1007/bf02653972,2007-07-17,Effect of system geometry on the leaching beha...,"[{'given': 'C.', 'family': 'Vu', 'sequence': '...",Metallurgical Transactions B,en,2019-05-20,1979.0,,,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,0.0,1.0
2,Wiley,10.1111/reel.12221,2017-12-01,The international law on transboundary haze po...,"[{'given': 'Shawkat', 'family': 'Alam', 'seque...","Review of European, Comparative &amp; Internat...",en,2017-12-01,2017.0,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0
3,Crop Science Society of Japan,10.1626/jcs.20.219,2011-09-20,Studies on the influence of pruning on the veg...,"[{'given': 'C.', 'family': 'TSUDA', 'sequence'...",Japanese Journal of Crop Science,en,2021-04-30,1951.0,,,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,0.0,1.0
4,Elsevier BV,10.1016/j.pneumo.2018.09.002,2018-10-10,Le tabagisme et l’aide à l’arrêt du tabac des ...,"[{'given': 'J.', 'family': 'Perriot', 'sequenc...",Revue de Pneumologie Clinique,fr,2019-10-26,2018.0,,,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,,0.0,1.0


In [44]:
df.to_csv(output_dir / '03_labeled_data.csv')