# Basic Description

Current implementation only searches JSON files recursively in the folder, convert these to a dataframe, and processes the texts in paralell. Processing details below.

* Text is first labeled by language. If english:

* Text acronyms are expanded. i.e. ADD --> Attention Deficit Disorder. This is done using the acronym expansion module in scispaCy (see their homepage for documentation).

* Concepts (general NER) in the text are linked to the Unified Medical Language System (UMLS) and canonicalized. The first alias for the entity is appended to the UMLS column.

* Text is non-destructively lemmatized. No stop words, no deletions of punctuation. For TF-IDF or other algorithms that depend on tokenization, you'll need to run a filter over this column for dimensionality reducation and cleaner text. This mean Sars-Covid-19 stays Sars-Covid-19 as a single token. If you need to match drug names, you can do full-text search on the "sentence" column, or attempt to match to tokens in UMLS, or match NER results in DRUG column.

* A second pass on NER is run using four NER-specific models from scispaCy. "en_ner_craft_md", "en_ner_jnlpba_md","en_ner_bc5cdr_md","en_ner_bionlp13cg_md". For more information, please see their homepage.


## A note on the Extraction class, and section labels

* The extraction class needs to be edited to read the metadata file and choose files accordingly. Right now, this is at the top of our priority list for tasks in #datasets, and if you can help with this please PM Brandon Eychaner. 

* Section labels are _messy_. There are more than 250,000 unique section labels in the JSONs alone. I listed the top 1000 section labels by count and took the obvious ones, and mapped them in the "filter_dict" variable to account for the majority of important sections. This is an area of ongoing work. 


In [None]:
# Create a virtualenv and install requirements via pip
# !python -m virtualenv coronawhy-local
# !source coronawhy-local/bin/activate
# !pip install -r science-requirements.txt
# !pip install googletrans
# !pip install -U scikit-learn
# !pip install googletrans
# !pip install scispacy
# !pip install spacy-langdetect
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_jnlpba_md-0.2.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_craft_md-0.2.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bc5cdr_md-0.2.4.tar.gz
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bionlp13cg_md-0.2.4.tar.gz

In [None]:
#!pip install googletrans

from googletrans import Translator
import pandas as pd 
import os
import os.path
import numpy as np
import scispacy
import json
import spacy
from tqdm.notebook import tqdm
from scipy.spatial import distance
import ipywidgets as widgets
from scispacy.abbreviation import AbbreviationDetector
from spacy_langdetect import LanguageDetector
# UMLS linking will find concepts in the text, and link them to UMLS. 
from scispacy.umls_linking import UmlsEntityLinker
import time
from spacy.vocab import Vocab
from multiprocessing import Process, Queue, Manager
from multiprocessing.pool import Pool
from functools import partial
import re
import ast
import random

from pandas.io.json import json_normalize

import uuid

In [None]:
def translate(text):
    translator=Translator(dest='en')
    translation=translator.translate(str(text)).text
    return translation

# Returns a dictionary object that's easy to parse in pandas. For tables! :D
def extract_tables_from_json(js):
    json_list = []
    # Figures contain useful information. Since NLP doesn't handle images and tables,
    # we can leverage this text data in lieu of visual data.
    for figure in list(js["ref_entries"].keys()):
        json_dict = ["figref", figure, js["ref_entries"][figure]["text"]]
        json_dict.append(json_dict)
    return json_list

def init_filter_dict():
    inverse = dict() 
    d = {
        "discussion": ["conclusions","conclusion",'| discussion', "discussion",  'concluding remarks',
                       'discussion and conclusions','conclusion:', 'discussion and conclusion',
                       'conclusions:', 'outcomes', 'conclusions and perspectives', 
                       'conclusions and future perspectives', 'conclusions and future directions'],
        "results": ['executive summary', 'result', 'summary','results','results and discussion','results:',
                    'comment',"findings"],
        "introduction": ['introduction', 'background', 'i. introduction','supporting information','| introduction'],
        "methods": ['methods','method','statistical methods','materials','materials and methods', 'methods and materials',
                    'Methods', 'Materials and Methods',
                    'data collection','the study','study design','experimental design','objective',
                    'objectives','procedures','data collection and analysis', 'methodology',
                    'material and methods','the model','experimental procedures','main text',],
        "statistics": ['data analysis','statistical analysis', 'analysis','statistical analyses', 
                       'statistics','data','measures'],
        "clinical": ['clinical', 'diagnosis', 'diagnostic features', "differential diagnoses", 'classical signs','prognosis', 'clinical signs', 'pathogenesis',
                     'etiology','differential diagnosis','clinical features', 'case report', 'clinical findings',
                     'clinical presentation'],
        'treatment': ['treatment', 'interventions'],
        "prevention": ['epidemiology','risk factors'],
        "subjects": ['demographics','samples','subjects', 'study population','control','patients', 
                   'partisucipants','patient characteristics'],
        "animals": ['animals','animal models'],
        "abstract": ["abstract", 'a b s t r a c t','author summary'], 
        "review": ['review','literature review','keywords']}
    
    for key in d: 
        # Go through the list that is saved in the dict:
        for item in d[key]:
            # Check if in the inverted dict the key exists
            if item not in inverse: 
                # If not create a new list
                inverse[item] = [key] 
            else: 
                inverse[item].append(key) 
    return inverse

# Create instance of dictionary with alternative section names
inverted_dict = init_filter_dict()
    
# Get the section names using Brandon's original function
def get_section_name(text):
    text = text.lower()
    if len(text) == 0:
        return(text)
    if text in inverted_dict.keys():
        return(inverted_dict[text][0])
    else:
        if "case" in text or "study" in text: 
            return("methods")
        elif "clinic" in text:
            return("clinical")
        elif "stat" in text:
            return("statistics")
        elif "intro" in text or "backg" in text:
            return("introduction")
        elif "data" in text:
            return("statistics")
        elif "discuss" in text:
            return("discussion")
        elif "patient" in text:
            return("subjects")
        else: 
            return(text)

# Further clean up section names with an ad-hoc function
def further_clean_section(text):
    text = text.lower()
    if "methods" in text:
        text = "methods"
    elif "discussion" in text:
        text = "discussion"
    elif "introduction" in text:
        text = "introduction"
    elif "background" in text:
        text = "introduction"
    elif "conclusions" in text:
        text = "discussion"
    elif "results" in text:
        text = "results"
    elif "concluding remarks" in text:
        text = "discussion" 
    elif "conclusion" in text:
        text = "discussion"
    elif "a b s t r a c t" in text:
        text = "abstract"
    elif "diagnosis" in text:
        text = "clinical"
    elif "clinical signs" in text:
        text = "clinical"
    elif "statistical analysis" in text:
        text = "statistics" 
    return text

def init_nlp():
    nlp = spacy.load("/Users/mnewhauser/python-work/coronawhy-local/coronawhy-local/lib/python3.7/site-packages/en_core_sci_lg/en_core_sci_lg-0.2.4", disable=["tagger"])
    nlp.max_length=2000000

    # We also need to detect language, or else we'll be parsing non-english text 
    # as if it were English. 
    nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

    # Add the abbreviation pipe to the spacy pipeline. Only need to run this once.
    abbreviation_pipe = AbbreviationDetector(nlp)
    nlp.add_pipe(abbreviation_pipe)

    # Our linker will look up named entities/concepts in the UMLS graph and normalize
    # the data for us. 
    linker = UmlsEntityLinker(resolve_abbreviations=True)
    nlp.add_pipe(linker)
    
    new_vector = nlp(
               """Positive-sense single‐stranded ribonucleic acid virus, subgenus 
                   sarbecovirus of the genus Betacoronavirus. 
                   Also known as severe acute respiratory syndrome coronavirus 2, 
                   also known by 2019 novel coronavirus. It is 
                   contagious in humans and is the cause of the ongoing pandemic of 
                   coronavirus disease. Coronavirus disease 2019 is a zoonotic infectious 
                   disease.""").vector

    vector_data = {"COVID-19": new_vector,
               "2019-nCoV": new_vector,
               "SARS-CoV-2": new_vector}

    vocab = Vocab()
    for word, vector in vector_data.items():
        nlp.vocab.set_vector(word, vector)
    
    return(nlp, linker)

def init_ner():
#     models = ["en_ner_craft_md", "en_ner_jnlpba_md","en_ner_bc5cdr_md","en_ner_bionlp13cg_md"]
    models = [
        '/Users/mnewhauser/python-work/coronawhy-local/coronawhy-local/lib/python3.7/site-packages/en_ner_craft_md/en_ner_craft_md-0.2.4',
        '/Users/mnewhauser/python-work/coronawhy-local/coronawhy-local/lib/python3.7/site-packages/en_ner_jnlpba_md/en_ner_jnlpba_md-0.2.4',
        '/Users/mnewhauser/python-work/coronawhy-local/coronawhy-local/lib/python3.7/site-packages/en_ner_bc5cdr_md/en_ner_bc5cdr_md-0.2.4',
        '/Users/mnewhauser/python-work/coronawhy-local/coronawhy-local/lib/python3.7/site-packages/en_ner_bionlp13cg_md/en_ner_bionlp13cg_md-0.2.4'        
    ]
    nlps = [spacy.load(model) for model in models]
    return(nlps)

# Parse and process the metadata
def preprocess_metadata(directory):
    
    rows = []
    problem_rows = []

    if directory[-1] != "/": 
        directory = directory + "/"

    df = pd.read_csv(directory + "metadata.csv") 
    df.reset_index(drop=True, inplace=True)

    # Read in PMC JSON files
    for i in df[df['pmc_json_files'].notnull()].index:
        section = 'BLANK'
        pmcid = df.iloc[i].pmcid
        filename = directory + df.iloc[i].pmc_json_files
        try: 
            with open(filename) as paperjs:
                jsfile = json.load(paperjs)
        except:
            problem_rows.append(df.iloc[i].cord_uid)
    #         print("Problem with", df.iloc[i].cord_uid)
            continue

        _id = df.iloc[i]["cord_uid"]

        if "title" in jsfile.keys():
            rows.append(dict(cord_uid=_id, section="title", subsection=0, text=jsfile["title"]))
        else:
            rows.append(dict(cord_uid=_id, section="title", subsection=0, text=df.iloc[i].title))
        if "abstract" in jsfile.keys():
            if len(jsfile["abstract"]) > 1:
                for j in range(len(jsfile["abstract"])):
                    rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=jsfile["abstract"][j]["text"]))
            else:
                rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=jsfile["abstract"]))
        elif "abstract" in jsfile["metadata"].keys():
            rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=jsfile["metadata"]["abstract"]))
        else: 
            rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=df.iloc[i].abstract))

        sections = list(set([k["section"] for k in jsfile["body_text"]]))

        for section in sections: 
            for l in range(len(jsfile["body_text"])):
                if jsfile["body_text"][l]["section"] == section:
                    if section == '':
                        section = "body_text"
                    rows.append(dict(cord_uid=_id, section=section, 
                                     subsection=l, text=jsfile["body_text"][l]["text"]))

        tables = extract_tables_from_json(jsfile)
        for table in tables:
            rows.append(dict(cord_uid=_id, section=table[0], subsection=table[1], text=table[2]))

    # Read in PMC PDF files
    for i in df[(df['pmc_json_files'].isna()) & (df['pdf_json_files'].notnull())].index:
        section = 'BLANK'
        sha = df.iloc[i].sha
        if len(sha.split("; ")) > 1:
            sha = sha.split("; ")[0]
        filename = directory + df.iloc[i].pdf_json_files
        try:
            with open(filename) as paperjs:
                jsfile = json.load(paperjs)
        except:
            problem_rows.append(df.iloc[i].cord_uid)
    #         print("Problem with", df.iloc[i].cord_uid)

        _id = df.iloc[i]["cord_uid"]

        if "title" in jsfile.keys():
            rows.append(dict(cord_uid=_id, section="title", subsection=0, text=jsfile["title"]))
        else:
            rows.append(dict(cord_uid=_id, section="title", subsection=0, text=df.iloc[i].title))
            if "abstract" in jsfile.keys():
                if len(jsfile["abstract"]) > 1:
                    for j in range(len(jsfile["abstract"])):
                        rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=jsfile["abstract"][j]["text"]))
                else:
                    # Added to fix problematic abstract processing in v12
                    abstract = [item['text'] for item in jsfile['abstract'] if 'text' in item]
                    rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=''.join(abstract))) 

            elif "abstract" in jsfile["metadata"].keys():
                rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=jsfile["metadata"]["abstract"]))
            else: 
                rows.append(dict(cord_uid=_id, section="abstract", subsection=0, text=df.iloc[i].abstract))

            sections = list(set([k["section"] for k in jsfile["body_text"]]))

            for section in sections: 
                for l in range(len(jsfile["body_text"])):
                    if jsfile["body_text"][l]["section"] == section:
                        if section == '':
                            section = "body_text"
                        rows.append(dict(cord_uid=_id, section=section, 
                                         subsection=l, text=jsfile["body_text"][l]["text"]))

            tables = extract_tables_from_json(jsfile)
            for table in tables:
                rows.append(dict(cord_uid=_id, section=table[0], subsection=table[1], text=table[2]))

    processed_ids = [d["cord_uid"] for d in rows]

    for i in df[~df["cord_uid"].isin(processed_ids)].index:
        rows.append(dict(cord_uid=df.iloc[i]["cord_uid"], section="title", subsection=0, text=df.iloc[i]["title"]))
        rows.append(dict(cord_uid=df.iloc[i]["cord_uid"], section="abstract", subsection=0, text=df.iloc[i]["abstract"]))
        
    output = pd.DataFrame(rows)

    print('Successfully processed {} rows'.format(len(output.index)))
    print('{} rows could not be processed'.format(len(problem_rows)))
    print('----')
    print('{} unique articles in this dataset'.format(len(output.cord_uid.unique().tolist())))
    
    return(output)

def parallelize_dataframe(df, func, n_cores=6, n_parts=400):
    df_split = np.array_split(df, n_parts)
    pool = Pool(n_cores)
    list(tqdm(pool.imap_unordered(func, df_split), total=len(df_split)))
    pool.close()
    pool.join()
                    
def init_list_cols():
    return ['GGP', 'SO', 'TAXON', 'CHEBI', 'GO', 'CL', 'DNA', 'CELL_TYPE', 'CELL_LINE', 'RNA', 'PROTEIN', 
                          'DISEASE', 'CHEMICAL', 'CANCER', 'ORGAN', 'TISSUE', 'ORGANISM', 'CELL', 'AMINO_ACID',
                          'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL', 'ANATOMICAL_SYSTEM', 'IMMATERIAL_ANATOMICAL_ENTITY',
                          'MULTI-TISSUE_STRUCTURE', 'DEVELOPING_ANATOMICAL_STRUCTURE', 'ORGANISM_SUBDIVISION',
                          'CELLULAR_COMPONENT', 'PATHOLOGICAL_FORMATION', "lemma", "UMLS","UMLS_ID"]

def pipeline(df):
    
    name = df.iloc[0]["cord_uid"] + str(df.iloc[0]["section"])+ "0" + ".pickle"
    
    if not os.path.exists("df_parts/"):
        os.mkdir("df_parts/")
        
    if name in os.listdir("df_parts/"):
        return True

    languages = []
    start_chars = []
    end_chars = []
    entities = []
    texts = []
    lemmas = []
    vectors = []
    sections = []
    section_ids = []
    _ids = []
    columns = []
    nlp, linker = init_nlp()
    nlps = init_ner()
    translated = []
    umls_ids = []

    scispacy_ent_types = ['GGP', 'SO', 'TAXON', 'CHEBI', 'GO', 'CL', 'DNA', 'CELL_TYPE', 'CELL_LINE', 'RNA', 'PROTEIN', 
                          'DISEASE', 'CHEMICAL', 'CANCER', 'ORGAN', 'TISSUE', 'ORGANISM', 'CELL', 'AMINO_ACID',
                          'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL', 'ANATOMICAL_SYSTEM', 'IMMATERIAL_ANATOMICAL_ENTITY',
                          'MULTI-TISSUE_STRUCTURE', 'DEVELOPING_ANATOMICAL_STRUCTURE', 'ORGANISM_SUBDIVISION',
                          'CELLULAR_COMPONENT', 'PATHOLOGICAL_FORMATION']

    for i in tqdm(range(len(df))):
        doc = nlp(str(df.iloc[i]["text"]))

        if len(doc._.abbreviations) > 0 and doc._.language["language"] == "en":
            doc._.abbreviations.sort()
            join_list = []
            start = 0
            for abbrev in doc._.abbreviations:
                join_list.append(str(doc.text[start:abbrev.start_char]))
                if len(abbrev._.long_form) > 5: #Increase length so "a" and "an" don't get un-abbreviated
                    join_list.append(str(abbrev._.long_form))
                else:
                    join_list.append(str(doc.text[abbrev.start_char:abbrev.end_char]))
                start = abbrev.end_char
            # Reassign fixed body text to article in df.
            new_text = "".join(join_list)
            # We have new text. Re-nlp the doc for futher processing!
            doc = nlp(new_text)

        if doc._.language["language"] == "en" and len(doc.text) > 5:
            languages.append(doc._.language["language"])
            vectors.append(doc.vector)
            translated.append(False)
            lemmas.append([token.lemma_.lower() for token in doc if not token.is_stop and re.search('[a-zA-Z]', str(token))])
            doc_ents = []
            for ent in doc.ents: 
                if len(ent._.umls_ents) > 0:
                    poss = linker.umls.cui_to_entity[ent._.umls_ents[0][0]].canonical_name
                    doc_ents.append(poss)
            entities.append(doc_ents)
            umls_ids.append([entity._.umls_ents[0][0] for entity in doc.ents if len(entity._.umls_ents) > 0])
            _ids.append(df.iloc[i]["cord_uid"])
            sections.append(df.iloc[i]["section"])
            section_ids.append(df.iloc[i]["section_uid"])

        else:   
            try: 
                text = translate(df.iloc[i]["text"])
                doc = nlp(str(df.iloc[i]["text"]))

                if len(doc._.abbreviations) > 0:
                    doc._.abbreviations.sort()
                    join_list = []
                    start = 0
                    for abbrev in doc._.abbreviations:
                        join_list.append(str(doc.text[start:abbrev.start_char]))
                        if len(abbrev._.long_form) > 5: #Increase length so "a" and "an" don't get un-abbreviated
                            join_list.append(str(abbrev._.long_form))
                        else:
                            join_list.append(str(doc.text[abbrev.start_char:abbrev.end_char]))
                        start = abbrev.end_char
                    # Reassign fixed body text to article in df.
                    new_text = "".join(join_list)
                    # We have new text. Re-nlp the doc for futher processing!
                    doc = nlp(new_text)

                if len(doc.text) > 5:
                    languages.append(doc._.language["language"])
                    vectors.append(doc.vector)
                    translated.append(True)
                    sections.append(df.iloc[i]["section"])
                    section_ids.append(df.iloc[i]["section_uid"])

                    lemmas.append([token.lemma_ for token in doc if not token.is_stop and re.search('[a-zA-Z]', str(token))])
                    for ent in doc.ents: 
                        if len(ent._.umls_ents) > 0:
                            poss = linker.umls.cui_to_entity[ent._.umls_ents[0][0]].canonical_name
                            entities.append(poss)
                    umls_ids.append([entity._.umls_ents[0][0] for entity in doc.ents if len(entity._.umls_ents) > 0])
                    entities.append(doc_ents)
                    _ids.append(df.iloc[i]["cord_uid"])
                    sections.append(df.iloc[i]["section"]) ######
                    section_ids.append(df.iloc[i]["section_uid"])

            except:
                entities.append("[]")
                translated.append(False)
                vectors.append(np.zeros(200))
                lemmas.append("[]")
                _ids.append(df.iloc[i,0])
                umls_ids.append("[]")
                languages.append(doc._.language["language"])
                section_ids.append(df.iloc[i]["section_uid"])
                sections.append(df.iloc[i]["section"])

    li1 = _ids
    li2 = sections
    li3 = [i for i in range(len(entities))]

    #     sentence_id = [str(x) + str(y) + str(z)  for x,y,z in zip(li1,li2,li3)]

    new_df = pd.DataFrame(data={"cord_uid": _ids,   
                                "section_uid": section_ids, 
                                "section": sections, 
                                "lemma": lemmas, 
                                "UMLS": entities, 
                                "UMLS_IDS": umls_ids, 
                                "w2vVector": vectors, 
                                "translated":translated})

    for col in scispacy_ent_types:
        new_df[col] = "[]"
    for j in tqdm(new_df.index):
        for nlp in nlps:
            doc = nlp(str(new_df.iloc[j]["section"]))
            keys = list(set([ent.label_ for ent in doc.ents]))
            for key in keys:

                # Some entity types are present in the model, but not in the documentation! 
                # In that case, we'll just automatically add it to the df. 
                if key not in scispacy_ent_types:
                    new_df = pd.concat([new_df,pd.DataFrame(columns=[key])])
                    new_df[key] = "[]"

                values = [ent.text for ent in doc.ents if ent.label_ == key]
                new_df.at[j,key] = values


    new_df["w2vVector"] = [np.asarray(a=i, dtype="float64") for i in new_df["w2vVector"].to_list()]


    new_df.to_pickle("df_parts/" + new_df.iloc[0]["section_uid"] + ".pickle", compression="gzip")            


## Universal data import and preprocessing

In [3]:
%%time

# Change this to whatever version of dataset we're on at this point
version = "v19"

# Enter the directory where the Kaggle dataset is saved
directory = "/Users/mnewhauser/python-work/coronawhy-local/v19/CORD-19-research-challenge_v19/"

# Preprocess the metadata using function updated to hangle v19 data
df = preprocess_metadata(directory)

# Extra preprocessing (added by Mary)
df['section'] = df['section'].apply(further_clean_section)

# Remove the rows where text was unavailable
df = df[df["text"] != "~"]
df["text"] = [str(i).replace("((","").replace("))","").replace("(.","").replace(".)","").replace("q q","").replace("\n","") for i in df["text"]]
mask = (df['text'].str.len() > 10)
df = df.loc[mask]

# Sort data by cord_uid, then by subsection
df = df.sort_values(by=['cord_uid', 'subsection']).reset_index(drop=True)

  call = lambda f, *a, **k: f(*a, **k)


Successfully processed 1823717 rows
311 rows could not be processed
----
63527 unique articles in this dataset
CPU times: user 2min 12s, sys: 16.5 s, total: 2min 28s
Wall time: 2min 53s


## Sectionize the data

In [4]:
%%time

# Concatenate all text within the same sections (sentence order is preserved by perviously sorting by subsection)
df = df.groupby(['cord_uid', 'section'], sort=False).agg({'text': lambda x: ' '.join(map(str, x))}).reset_index()

# Create unique section ID
df['section_uid'] = [str(uuid.uuid4()).split('-')[4] for _ in range(len(df.index))]

# Reorder columns
df = df[['cord_uid', 'section_uid', 'section', 'text']]

CPU times: user 10.9 s, sys: 1.43 s, total: 12.4 s
Wall time: 12.4 s


## Run the pipeline

In [6]:
# Create a random sample of 10 papers to test the pipeline
random_uids = random.sample(df.cord_uid.unique().tolist(), 10)

sample_df = df[df['cord_uid'].isin(random_uids)].reset_index(drop=True)

In [7]:
%%time
parallelize_dataframe(sample_df, pipeline, n_cores=5, n_parts=2)

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))



HBox(children=(IntProgress(value=0, max=113), HTML(value='')))

HBox(children=(IntProgress(value=0, max=114), HTML(value='')))




HBox(children=(IntProgress(value=0, max=113), HTML(value='')))





HBox(children=(IntProgress(value=0, max=114), HTML(value='')))



CPU times: user 703 ms, sys: 687 ms, total: 1.39 s
Wall time: 5min 49s


## Concatenate

In [8]:
# Read in the vectorized data
vec_df = pd.concat([pd.read_pickle("df_parts/" + f, compression="gzip") for f in os.listdir("df_parts/") if f.endswith(".pickle")]).reset_index(drop=True)

In [9]:
# Concatenate and save processed text. 
vec_df.to_pickle(version + "processedLocalText.pkl", compression="gzip")

## Cleanup

The following code is simply cleanup after the extraction process. First, we'll save the text data in a json file. Next we can save the vector data because it's large, and most people won't be using it. 

In [10]:
vec_df = vec_df.replace({'[]': np.nan})

string_cols = ['cord_uid', 'section_uid', 'section', 'language']

list_cols = [col for col in vec_df.columns if col not in string_cols]

for col in vec_df.columns:
    if col in string_cols:
        vec_df[col] = vec_df[col].astype(str)

  op = lambda x: operator.eq(x, b)


In [11]:
vec_df.to_csv(version + '_test_output.csv')