# Detecting Multiple Languages within the Abstract field of a Metadata record

Many of the high priority issue determined in phase 1 are in regard to the abstract field. 

This script will determine the language(s) within the abstract field and compare them against the stated language attribute of the record. To do this, the script also must ensure that those fields are present within the record.

To begin, we'll import he necessary libraries and set up our generastor for iterating through the data.

In [8]:
from bs4 import BeautifulSoup as bs
import langid
import pandas as pd
import json
from nltk.tokenize import sent_tokenize

# Genertaor function for handling data iteration
def gen(file_name):
    with open(file_name, 'r') as fh:
        for record in json.load(fh):
            yield record

# Set up dataframe for holding detected errors
df = pd.DataFrame(columns=['DOI', 'issue'])
data = gen('bigger_sample.json')

Now, we have to check the records for each of these attributes, what language(s) is present in the abstract, check against the container language, and label the appropriate issue. The data here are a random sample of 1300 records from Crossref

In [9]:
article_counter = 0
for record in data:
    message = record['message']
    doi = message['DOI']
    # Only focusing on journal articles, as to avoid false positives on
    # types that do not normally have abstracts (e.g. books)
    if message['type'] == 'journal-article':
        article_counter +=1
        try:
            # Cleaning text by removing lxml tags
            soup = bs(message['abstract'], features='lxml')
            stripped_strings = soup.get_text()
            # Tokenizing abstracts
            tokenized = sent_tokenize(stripped_strings)
            startAndFinish = [tokenized[1], tokenized[-2]]
            # Detecting languages present
            lang = [langid.classify(lang) for lang in startAndFinish]
            # Filter low confidence results
            lang_dict = {key:value for (key,value) in lang if value > 100 or value < -100}
            # Labeling specific issues found in record
            if len(set(lang_dict.keys())) > 1:
                try:
                    containerLanguage = message['language']
                    if containerLanguage in lang_dict.keys():
                        df.loc[len(df)] = [doi, 'multiple languages']
                except:
                    df.loc[len(df)] = [doi, 'no language attribute; multiple languages']
            else:
                try:
                    containerLanguage = message['language']
                    if containerLanguage == lang_dict.keys()[0]:
                        df.loc[len(df)] = [doi, 'no issues found']
                        continue
                    else:
                        df.loc[len(df)] = [doi, 'does not match language field']
                        continue
                except:
                    df.loc[len(df)] = [doi, 'no language attribute']
        except:
            try:
                languageAttribute = message['language']
                df.loc[len(df)] = [doi, 'no abstract attribute']
            except:
                df.loc[len(df)] = [doi, 'no abstract attribute; no language attribute']

In [10]:
print(article_counter)

889


In [11]:
df.shape

(889, 2)

We see that all of the articles in this random sample have some sort of issue.

In [12]:
df.value_counts(subset=df['issue'])

issue
no abstract attribute                           629
no language attribute                           137
no abstract attribute; no language attribute    120
no language attribute; multiple languages         3
dtype: int64

Overwhelmingly, it seems the issue is that there are fields missing from the records. Most often being the abstract field, but the absence of the language field is also very common.