***************************************************************************************
Jupyter Notebooks from the Metadata for Everyone project

Code:
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)

Project team: 
* Juan Pablo Alperin (https://orcid.org/0000-0002-9344-7439)
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)
* Mike Nason (https://orcid.org/0000-0001-5527-8489)
* Julie Shi (https://orcid.org/0000-0003-1242-1112)
* Marco Tullney (https://orcid.org/0000-0002-5111-2788)

Last updated: xxx
***************************************************************************************

# Language Detection

In the previous notebook, we found that a significant number of records are missing a stated *language* within the record. Additionally, we found in our work in phase 1 of our project (see Shi, J., Nason, M., Tullney, M., & Alperin, J. P. (2023). Identifying Metadata Quality Issues Across Cultures. SocArXiv. https://doi.org/10.31235/osf.io/6fykh)) it is not uncommon for the stated language of the record to be inaccurate. This could be for a variety of reasons: lack of clarity as to what *language* is actually referring to (i.e. the language of the item, container, or the metadata record itself), perhaps some level of increased discoverability if the work is labeled with **en** (English).

To examine these issues more closely, we will detect the languages used within the records, compare that against their stated languages, and see any patterns that emerge.



## Preparation 
First we will import several of the necessary packages, set up our directory, and import our data.

In [1]:
import seaborn as sns # data visualizations
from pathlib import Path
import numpy as np
import pandas as pd #Creating dataframe and manipulating data
from nltk.tokenize import sent_tokenize
from py3langid.langid import LanguageIdentifier, MODEL_FILE
from matplotlib import pyplot as plt
from typing import List, Dict, Optional

In [2]:
# Data Directory
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'
# Loading in dataset
df = pd.read_parquet(output_dir / '03_labeled_data.parquet')
#df = df_all.loc[df_all.abstracts.notnull()]

## Detecting Languages
We'll use `py3langid` as we did previously, with the same language list as before. While `py3langid` is optimized for python3 and is several times faster than the original, it is important to note that due to the nature of the language detection we will be doing, it may take a minute or two, but no longer than that. We will check each record across three fields, *abstract*, *title*, and *container-title*. These are the fields that have some of the most text, and thus can give us the most confident results. We'll set a probability threshold of `.95` to help insure that we are only saying a language is present when the model is very confident.

### Matching
After detecting the langauge used in the records, we will then see if the `detected_lang` matches the record's stated *language*. In doing this we will label the record with a `0` if the stated language matches detected language, `1` if the stated language **does not** match detected language), or `2` if the multiple detected languages, but one of the detected languages matches the stated language). 

### Language Type
Finally, we will apply an additional code to each record: `0` if the detected language is English, `1` if the detected language is any single non-english language, `2` for multilingual records.

In [3]:
identifier = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs = True)
lang_list = ['af', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'br', 
             'bs', 'ca', 'cs', 'cy', 'da', 'de', 'dz', 'el', 'en', 'eo', 
             'es', 'et', 'eu', 'fa', 'fi', 'fo', 'fr', 'ga', 'gl', 'gu', 
             'he', 'hi', 'hr', 'ht', 'hu', 'hy', 'is', 'it', 'ja', 'jv', 
             'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 
             'lt', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'ne', 
             'nl', 'no', 'oc', 'or', 'pa', 'pl', 'ps', 'pt', 'qu', 'ro', 
             'ru', 'rw', 'se', 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'sw', 
             'ta', 'te', 'th', 'tl', 'tr', 'ug', 'uk', 'ur', 'vi', 'vo', 
             'wa', 'xh', 'zh', 'zu']
identifier.set_languages(langs=lang_list)
# Check across multiple fields within each record for the languages present.
def record_lang_checker(abstracts: List[str]) -> Optional[List[str | None]]:
    #These fields have the most text which will provide the most accurate language detection
    #fields = ['abstract', 'title', 'container-title']
    #fields = ['abstract']
    lang_list = []
    #for col in fields:
    try:
        for abstract in abstracts:
            tokenized = sent_tokenize(abstract)
            for i in tokenized:
                detect = identifier.classify(i.lower())
                # Setting a .95 probability threshold for asserting the language is indeed in the record
                if detect[1] > .95:
                    lang_list.append(detect[0])
                else:
                    pass
    except:
        return None
    #If no language is detected, return None
    if len(lang_list) == 0:
        return None
    else:
        #Returning all of the detected languages for each record
        return list(set(lang_list))


In [4]:
df['detected_lang_abstract'] = df.abstracts.map(lambda x: record_lang_checker(x))
detected_languages = df.explode('detected_lang_abstract')
#detected_languages.detected_lang.nunique()

In [None]:
print("{:0.0f} different languages have been detected in the sample.".format(detected_languages.detected_lang_abstract.nunique()))

In [None]:
detected_languages['detected_lang_abstract'].value_counts().head(20)

In [None]:
df['detected_lang_abstract'].value_counts().head(20)

In [None]:
df.journal_lang.value_counts(dropna=False)

In [22]:
def detection_match(record):
    try:
        #Filtering out records with no stated language
        if record['journal_lang'] is None:
            return 3
        else:
            #checking if stated language matches detected language
            if record['journal_lang'] in record['detected_lang_abstract']:
                #Stated language is within the detected languages, but there are multiple languages
                #present in the record
                if len(record['detected_lang_abstract']) > 1:
                    return 2
                #Stated language matches detected language
                else:
                    return 0
            #Stated and detected languages do not match
            else:
                return 1
    except:
        return 3

In [None]:
df['lang_match_abstract'] = df.apply(detection_match, axis=1)
df.lang_match_abstract.value_counts()
# 0.0 -> the language indicated matches the detected language
# 1.0 -> there is only one language detected, and the language indicated does not match the detected language
# 2.0 -> there are at least two languages detected, and the language indicated is one of the detected languages
# 3.0 -> there is no indicated language

In [12]:
def title_lang_detector(article_title: list[str]) -> str | None:
    """Function to determine if the detected language of the title matches
    the stated language of the journal within a given record.

    Args:
        record (pd.Series): A metadata record from the dataframe.

    Returns:
        int | None: Returns a binary encoding (0, 1) depending on the (non-)existence of an issue.
        Returns None if the record is incomplete such that a determination cannot be made.
    """
    try:
        if article_title:
            for i in article_title:
                lang = identifier.classify(i)
                if lang[1] > .99:
                    return lang[0]
                else:
                    return None
        else:
            return None
    except:
        return None

df['detected_lang_title'] = df.article_title.map(lambda x: title_lang_detector(x))

In [None]:
df.detected_lang_title.value_counts()

In [None]:
df['abstract_langs'].value_counts()

In [None]:
df.columns

In [None]:
def lang_type(record):
    stated_abstract_langs = record['abstract_langs'] #list
    stated_journal_lang = record['journal_lang'] # str
    stated_article_lang = record['article_lang'] #str
    detected_language_abstract = record['detected_lang_abstract'] #list
    detected_language_title = record['detected_lang_title'] #str
    lang_list = [
                stated_journal_lang,
                stated_article_lang,
                detected_language_title,
                ]
    if stated_abstract_langs is not None:
        lang_list.extend(stated_abstract_langs)
    if detected_language_abstract is not None:
        lang_list.extend(detected_language_abstract)
    set_langs = set(lang_list)
    set_langs = list(set_langs)
    if len(set_langs) == 1:
        if set_langs[0] is None:
            return None
        else:
            if set_langs[0] == 'en':
                return 'Monolingual English'
            else:
                return 'Monolingual Non-English'
    if len(set_langs) == 2:
        filtered_lang = [lang for lang in set_langs if lang is not None]
        if len(filtered_lang) == 1:
            if filtered_lang[0] == 'en':
                return 'Monolingual English'
            else:
                return 'Monolingual Non-English'
        else:
            return 'Multilingual'
    if len(set_langs) > 2:
        return 'Multilingual'
    print(set_langs)
    
df['lang_type'] = df.apply(lang_type, axis=1)
df.lang_type.value_counts(dropna=False)

## Analyses by language type

⚠️ So this is where we should try to differentiate RQ1 results by language type. If you have the time, could you try to do "language absent" and "author initials"? I could then try to do a few other important ones. ⚠️

## Differences in Errors Between Language Types
After detecting the languages and coding the records, we can see that there are a large number of records in which the `detected_lang` does not match the stated language. One possible explanation is the high number of records that simply do not have a stated language. We will explore this below.

Additionally, we can see that English is the predominant language of the dataset. Next, we'll take a look at errors per record in regards to the different language types: English-monolingual, Non-English-monolingual, and Multilingual.

First, we'll take a look at the number of errors per language.

In [None]:
grouped_langs = detected_languages.groupby('detected_lang_abstract')
group_total_errors = grouped_langs.agg({'Total Errors': 'sum', 'doi': 'count'}).sort_values(by='doi', ascending=False)
group_total_errors[:20]

In [None]:
sns.set_context('paper')
top_20 = group_total_errors.sort_values(by='Total Errors', ascending=False)[:20]
t20plt = sns.barplot(data=top_20, x=top_20.index, y='Total Errors')
t20plt.set_xticklabels(t20plt.get_xticklabels(), rotation=40, ha='right', fontsize=10)
t20plt.set_title('Total Errors by Language')

In [None]:
#Now we'll remove English, to better visualize other languages
no_en = group_total_errors.drop('en')
top_20 = no_en.sort_values(by='Total Errors', ascending=False)[:20]
t20plt = sns.barplot(data=top_20, x=top_20.index, y='Total Errors')
t20plt.set_xticklabels(t20plt.get_xticklabels(), rotation=40, ha='right', fontsize=10)
t20plt.set_title('Total Errors by Language (excl. English)')

In [None]:
#Now we'll change from total errors, to the mean number of errors by language
group_avg_errors = grouped_langs.agg({'Total Errors': 'mean', 'doi': 'count'}).sort_values(by='Total Errors', ascending=False)
group_avg_errors

In [None]:
#We'll remove the languages with only a couple of records
filtered = group_avg_errors.loc[group_avg_errors.doi > 5].sort_values(by='Total Errors', ascending=False)
top_20 = filtered[:20]
t20plt = sns.barplot(data=top_20, x=top_20.index, y='Total Errors')
t20plt.set_xticklabels(t20plt.get_xticklabels(), rotation=40, ha='right', fontsize=10)
t20plt.set_title('Errors per Record by Language')

### Individual Languages
We see that, as mentioned, English is by far the most represented language in the dataset, and, consequently, has the most errors. Once removing english, we see that German (de), French (fr), Spanish (es), Portugese (pt), and Malay (ms) are the next top 5 in total errors, but that tends to be a reflection of the quantity of  in the dataset.

However, when we look at the average (arithmetic mean) of the errors per language, we do see that there are a number of languages from the top 20 of total errors there:

Chinese (zh), Russian (ru), Ukranian (uk), Bulgarian (bg), Japanese (ja), Arabic (ar).

Now, we'll take a look at the differences between language types.

In [30]:
multi = df.loc[df.lang_type == 'Multilingual']
non_english = df.loc[df.lang_type == 'Monolingual Non-English']
english = df.loc[df.lang_type == 'Monolingual English']

In [None]:
#multi["DOI"].head(50)

In [None]:
multi_error_rate = multi['Total Errors'].sum()/len(multi)
eng_error_rate = english['Total Errors'].sum()/len(english)
non_eng_error_rate = non_english['Total Errors'].sum()/len(non_english)

print("{:0.2f} errors per english, monolingual record".format(eng_error_rate))
print("{:0.2f} errors per non-english, monolingual record".format(non_eng_error_rate))
print("{:0.2f} errors per multilingual record".format(multi_error_rate))

To visualize this data, we'll look at the error rates for each of these language types and break them up by their publisher bin.

Then We'll take a look at any differences between the languages.

In [None]:
#Multilingual Records
#mlt = sns.catplot(data=multi,x='publisher_bin', y='total_errors', kind='bar', height=4, aspect=.8, order=['XS', 'S', 'M', 'L', 'XL'], errorbar=None)
#mlt.set_axis_labels('', 'Error per Record')
#mlt.fig.subplots_adjust(top=0.9)
#mlt.fig.suptitle('Errors per Record by Publisher Size, Multilingual Records')

In [None]:
#English Monolingual Records
#en_only = sns.catplot(data=english, x='publisher_bin', y='total_errors', kind='bar', height=4, aspect=.8, order=['XS', 'S', 'M', 'L', 'XL'], errorbar=None)
#en_only.set_axis_labels('', 'Error per Record')
#en_only.fig.subplots_adjust(top=0.9)
#en_only.fig.suptitle('Errors per Record by Publisher Size, English Only')

In [None]:
#Non-English Monolingual Records
#non_en_plt = sns.catplot(data=non_english, x='publisher_bin', y='total_errors', kind='bar', height=4, aspect=.8, order=['XS', 'S', 'M', 'L', 'XL'], errorbar=None)
#non_en_plt.set_axis_labels('', 'Error per Record')
#non_en_plt.fig.subplots_adjust(top=0.9)
#non_en_plt.fig.suptitle('Errors per Record by Publisher Size, Monolingual Records (non-English)')

### Differences in Language Types
We can see that the publisher bins generally change between groups in a similar fashion. The error rate for all seems to be highest in non-English monolingual records, and at its lowest in English monolingual records.

Consistent throughout all groups the XS publisher bin has the highest error rate.

Finally, we'll take a look to see any differences in the presence of a stated language between the language types.

In [None]:
multi_no_lang = (multi.journal_lang.isna().sum()/len(multi)) *100
eng_no_lang = (english.journal_lang.isna().sum()/len(english)) *100
non_eng_no_lang = (non_english.journal_lang.isna().sum()/len(non_english)) * 100

print("{:0.2f}% english, monolingual records with no stated language".format(eng_no_lang))
print("{:0.2f}% non_english, monolingual records with no stated language".format(non_eng_no_lang))
print("{:0.2f}% multilingual records with no stated language".format(multi_no_lang))

In [None]:
def en_stated(lang):
    try:
        if lang == 'en':
            return 1
        elif lang in lang_list:
            return 0
        else:
            return None
    except:
        return None
    
find_stated = non_english.journal_lang.map(lambda x: en_stated(x))
enStated = find_stated.sum()
nonenStated = len(find_stated.loc[find_stated == 0])
non_eng_no_lang = non_english.journal_lang.isna().sum()
labels = ['English', 'Non-English', 'No Language']
data = [enStated, nonenStated, non_eng_no_lang]
colors = ['palegreen','skyblue', 'pink']
fig = plt.figure(figsize = (10,7))
plt.pie(data, labels = labels, autopct='%.1f%%', colors=colors)
plt.title('Breakdown of Stated Languages for Non-English, Monolingual Records', fontsize=16)
plt.show

In [None]:
find_stated = multi.journal_lang.map(lambda x: en_stated(x))
enStated = find_stated.sum()
nonenStated = len(find_stated.loc[find_stated == 0])
multi_no_lang = multi.journal_lang.isna().sum()
labels = ['English', 'Non-English', 'No Language']
data = [enStated, nonenStated, multi_no_lang]
colors = ['palegreen','skyblue', 'pink']
fig = plt.figure(figsize = (10,7))
plt.pie(data, labels = labels, autopct='%.1f%%', colors=colors)
plt.title('Breakdown of Stated Languages for Multilingual Records', fontsize=16)
plt.show

In [None]:
#Looking to see how many multilingual records use english within their records
def has_english(record):
    try:
        stated_abstract_langs = record['abstract_langs'] #list
        stated_journal_lang = record['journal_lang'] # str
        stated_article_lang = record['article_lang'] #str
        detected_language_abstract = record['detected_lang_abstract'] #list
        detected_language_title = record['detected_lang_title'] #str
        lang_list = [
                    stated_journal_lang,
                    stated_article_lang,
                    detected_language_title,
                    ]
        if stated_abstract_langs is not None:
            lang_list.extend(stated_abstract_langs)
        if detected_language_abstract is not None:
            lang_list.extend(detected_language_abstract)
        set_langs = set(lang_list)
        set_langs = list(set_langs)
        if 'en' in set_langs:
            return 1
        else:
            return 0
    except:
        return None
        
eng_multi = multi.apply(has_english, axis=1)
eng_having_rate = (eng_multi.sum()/len(multi)) *100
print("{:0.2f}% of multilingual records have English as one of their languages".format(eng_having_rate))

In [39]:
df.to_parquet(output_dir / '04_language_detection.parquet')