***************************************************************************************
Jupyter Notebooks from the Metadata for Everyone project

Code:
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)

Project team: 
* Juan Pablo Alperin (https://orcid.org/0000-0002-9344-7439)
* Dennis Donathan II (https://orcid.org/0000-0001-8042-0539)
* Mike Nason (https://orcid.org/0000-0001-5527-8489)
* Julie Shi (https://orcid.org/0000-0003-1242-1112)
* Marco Tullney (https://orcid.org/0000-0002-5111-2788)

Last updated: xxx
***************************************************************************************

# Data Cleaning
The raw csv consists of 3 columns:

* Index

* DOI

* XML

The XML record has a lot of information, but not all is relevant for this study. We will extract from each record the metadata that is relevant and then format it in a nested dictionary. The resulting dictionary's schema looks like this:

```
{
    'doi': str | None,
    'authors': list[dict[given_name, surname, sequence, affiliation]] | None,
    'abstracts': list[str] | None,
    'journal_lang': str | None,
    'article_lang': str | None,
    'abstract_langs': list[str] | None,
    'publisher_name': str | None,
    'journal_title': str | None,
    'article_title': str | None
    }
```
Due to the size of the dataset, we'll use [Dask](https://docs.dask.org/en/stable/index.html) to load in the csv and preform the metadata extraction functions defined below. Depending on the hardware resources, the time to load in the data and preform the extraction can vary. The parameter `blocksize=250MB` found within the `df` variable in the `clean_csv` function can be altered accordingly. 250MB is a somewhat neutral value in that most computing systems can run the code comfortably, but it will take multiple hours to run.

Now we will load in our packages and set up our paths.

In [1]:
import dask.dataframe as dd
import pandas as pd
from pathlib import Path
from bs4 import BeautifulSoup

#Set up directories
data_dir = Path('../data')
input_dir = data_dir / 'input'
output_dir = data_dir / 'output'

csv_path = input_dir / 'allv3.csv'
parquet_path = input_dir / '02_cleaned_data.parquet'

## Extraction Functions

Each function is named according to which piece of metadata it will extract. Then they are all called within the `__metadata` function.


In [None]:
def __authors(soup: 'bs4.BeautifulSoup') -> list[dict] | None:
    """Helper function to extract relevant author metadata from
    XML records.

    Args:
        record (str): An individual metadata record in XML format.

    Returns:
        list[dict] | None : A list of nested dictionaries containing the relevant author metadata.
                            If no authors are present, None is returned.
    """
    author_list = []
    #soup = BeautifulSoup(record, 'xml')
    authors = soup.find('contributors')
    if authors:
        first_authors = authors.find_all('person_name', sequence='first')
        additional_authors = authors.find_all('person_name', sequence='additional')
        for i in first_authors:
            name_dict = {
                'given_name': None,
                'surname': None,
                'sequence': None,
                'affiliation': None
            }
            for k in name_dict:
                if k =='sequence':
                    name_dict[k] = 'first'
                else:
                    if i.find(k):
                        name_dict[k] = i.find(k).get_text()
                    else:
                        continue
            author_list.append(name_dict)
        for i in additional_authors:
            name_dict = {
                'given_name': None,
                'surname': None,
                'sequence': None,
                'affiliation': None
            }
            for k in name_dict:
                if k =='sequence':
                    name_dict[k] = 'additional'
                else:
                    if i.find(k):
                        name_dict[k] = i.find(k).get_text()
                    else:
                        continue
            author_list.append(name_dict)
    if len(author_list) == 0:
        return None
    else:
        return author_list
    
def __abstracts(soup: 'bs4.BeautifulSoup') -> list[str] | None:
    """Helper function that extracts all abstracts from XML records.

    Args:
        record (str): An individual metadata record in XML format.

    Returns:
        list[str] | None: Returns a list of all abstracts within a record.
                        If there is no abstract within a record,
                        then None is returned.
    """
    #soup = BeautifulSoup(record, 'xml')
    abstracts = soup.find_all('jats:abstract')
    text = []
    if abstracts:
        for i in abstracts:
            text.append(i.get_text())
    else:
        return None
    return text

def __languages(soup: 'bs4.BeautifulSoup') -> dict:
    """Helper function that extracts Language codes from multiple fields 
    within an XML record

    Args:
        record (str): An individual metadata record in XML format.

    Returns:
        dict: A dictionary containing the language codes for three 
            different metadata fields.
    """
    ret = {}
    try: 
        #soup = BeautifulSoup(record, 'xml')
        journal = soup.find('journal_metadata')
        if journal:
            ret['journal_lang'] = journal.get('language')
        else:
            ret['journal_lang'] = None
            
        article = soup.find('journal_article')
        if article: 
            ret['article_lang'] = article.get('language')
        else:
            ret['article_lang'] = None

        abstracts = soup.find_all('jats:abstract')
        if abstracts: 
            langs = []
            for abstract in abstracts: 
                langs.append(abstract.get('xml:lang'))
                langs = [l for l in langs if l is not None]
                
            if len(langs) == 0:
                langs = None
            ret['abstract_langs'] = langs
        else:
            ret['abstract_langs'] = None
            
    except Exception as e:
        ret['err'] = type(e).__name__

    return ret

def __titles(soup: 'bs4.BeautifulSoup') -> dict | None:
    """Helper function to extract various titles from XML records.

    Args:
        record (str): An individual metadata record in XML format.

    Returns:
        dict | None: A dictionary containing titles and labels, or None
                    if no titles are present.
    """ 
    #soup = BeautifulSoup(record, 'xml')
    titles = {}
    try:
        publisher = soup.find('crm-item', attrs={'name': 'publisher-name'})
        if publisher:
            titles['publisher_name'] = publisher.get_text()
        else:
            titles['publisher_name'] = None
        journal = soup.find('journal_metadata')
        if journal:
            titles['journal_title'] = journal.find('full_title').get_text()
        else:
            titles['journal_title'] = None
        article = soup.find('titles')
        if article:
            article = article.find_all('title')
            titles['article_title'] = [i.get_text() for i in article]
        else:
            titles['article_title'] = None
        return titles
    except Exception as err:
        return err

def __metadata(record: str) -> dict:
    try:
        soup = BeautifulSoup(record, 'xml')
        doi = soup.find('doi')
        authors = __authors(soup)
        abstracts = __abstracts(soup)
        languages = __languages(soup)
        titles = __titles(soup)
        final_record = {'doi': doi.get_text() if doi else None,
                        'authors': authors,
                        'abstracts': abstracts,
                        'journal_lang': languages['journal_lang'],
                        'article_lang': languages['article_lang'],
                        'abstract_langs': languages['abstract_langs'],
                        'publisher_name': titles['publisher_name'],
                        'journal_title': titles['journal_title'],
                        'article_title': titles['article_title']}
        return final_record
    except TypeError as err:
        if type(record) == 'NAType':
            return None
        else:
            print(err)


def clean_csv(csv_path: str, metadata_parquet_path: str):
    # Here is the df variable containing the blocksize parameter.
    df = dd.read_csv(csv_path, names=['index', 'DOI', 'xml'], blocksize='250MB')
    metadata = df['xml'].map(__metadata, meta=('metadata', 'object')).compute()
    metadata_df = pd.DataFrame(metadata)
    metadata_df.to_parquet(metadata_parquet_path, index=False)
    return metadata_df


In [None]:
"""
This cell runs the functions and saves the new data to a parquet.
Depending on the value set in the blocksize parameter,
this may take some time.
"""

df = clean_csv(csv_path, parquet_path)

In [9]:
df = pd.read_parquet(parquet_path)

## Conferences
There are a couple 'Conferences' and 'Proceedings' in the *journal_title* column. Let's take a look at just how many records remain in our dataset are from these journals/containers.

Additionally, we see a few records from the journal *ChemInform*, a journal that publishes chemistry abstracts, we'll check to see if any of those records remain as well.

We'll use a keyword search in the *journal_title* column to find these records.

In [None]:
conferences = df.loc[(df.journal_title.str.contains(r'conference|ChemInform|news|CrossRef Listing of Deleted DOIs', 
                                                    regex=True, case=False)) |
                                                    (df.publisher_name == 'EDP Sciences')]
conferences

In [None]:
df.drop(conferences.index, inplace=True)
df.shape

Looks great! Now we'll save our cleaned dataset.

In [12]:
df.to_parquet(parquet_path)