# Extracting Named Entities with Labels

## Models
There are several models available through scispaCy, and four of them are trained specifically for NER ([Named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)) on biomedical tasks. The outputs of these models allows for fine-grained categorical NER extraction, from cellular components, to genes, organs, tissue types, and more. In order to get the most out of our data, we can combine the outputs of all these models and allow subject matter experts to determine later how data scientists can use particular types of named entities for analysis. 

In [1]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_craft_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_jnlpba_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bc5cdr_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bionlp13cg_md-0.2.4.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_craft_md-0.2.4.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_craft_md-0.2.4.tar.gz (70.1 MB)
[K     |████████████████████████████████| 70.1 MB 20.5 MB/s eta 0:00:01    |████▋                           | 10.1 MB 4.7 MB/s eta 0:00:13     |███████▉                        | 17.2 MB 4.7 MB/s eta 0:00:12     |████████████████                | 34.9 MB 24.9 MB/s eta 0:00:02     |█████████████████▏              | 37.5 MB 24.9 MB/s eta 0:00:02     |███████████████████             | 41.6 MB 24.9 MB/s eta 0:00:02     |███████████████████████████     | 59.2 MB 24.9 MB/s eta 0:00:01     |████████████████████████████▌   | 62.3 MB 24.9 MB/s eta 0:00:01
Building wheels for collected packages: en-ner-craft-md
  Building wheel for en-ner-craft-md (setup.py) ... [?25ldone
[?25h  Created wheel for en-ner-craft-md: filename=en_ner_craft_md-0.2.4-py3-none-any.whl size=70540

In [8]:
import spacy 
import scispacy
import pandas as pd 
import os
import numpy as np
import json
from tqdm import tqdm
import ipywidgets as widgets
import time

Here, we can list our models, and load each one as a separate spaCy processing pipeline. 

In [5]:
models = ["en_ner_craft_md", "en_ner_jnlpba_md","en_ner_bc5cdr_md","en_ner_bionlp13cg_md"]
nlps = [spacy.load(model) for model in models]

This is system specific. I broke the original dataset into 1000 parts, to make it easier to manage memory consumption on a remote server. 

In [7]:
files_to_process = [f for f in os.listdir("df_parts/") if f.endswith("processed.csv")]

## Generating the output

This function will read each file in, and then process sentence-by-sentence with each NER model, and output extracted entities to the appropriate columns. Afterwards, the file is saved. Later, we can concatenate these files into a single dataframe. Note here that we need to define each of those columns as an "object" type, or else df.at[i,j] won't be able to assign lists to the cells.

In [None]:
for f in tqdm(files_to_process):
    scispacy_ent_types = ['GGP', 'SO', 'TAXON', 'CHEBI', 'GO', 'CL', 'DNA', 'CELL_TYPE', 'CELL_LINE', 'RNA', 'PROTEIN', 
                          'DISEASE', 'CHEMICAL', 'CANCER', 'ORGAN', 'TISSUE', 'ORGANISM', 'CELL', 'AMINO_ACID',
                          'GENE_OR_GENE_PRODUCT', 'SIMPLE_CHEMICAL', 'ANATOMICAL_SYSTEM', 'IMMATERIAL_ANATOMICAL_ENTITY',
                          'MULTI-TISSUE_STRUCTURE', 'DEVELOPING_ANATOMICAL_STRUCTURE', 'ORGANISM_SUBDIVISION',
                          'CELLULAR_COMPONENT', 'PATHOLOGICAL_FORMATION']
    df = pd.read_csv("df_parts/"+f)
    df = pd.concat([df,pd.DataFrame(columns=scispacy_ent_types)])
    for col in scispacy_ent_types:
        df[col] = df[col].astype("object")
        
    for i in df.index:
        if df.iloc[i]["language"] == "en":
            for nlp in nlps:
                doc = nlp(str(df.iloc[i]["sentence"]))
                keys = list(set([ent.label_ for ent in doc.ents]))
                for key in keys:
                    
                    # Some entity types are present in the model, but not in the documentation! 
                    # In that case, we'll just automatically add it to the df. 
                    if key not in scispacy_ent_types:
                        df = pd.concat([df,pd.DataFrame(columns=[key])])
                        df[key] = df[key].astype("object")
                        
                    values = [ent.text for ent in doc.ents if ent.label_ == key]
                    df.at[i,key] = values
        else:
            pass
    filename = "df_parts/" + str(f.split("_")[0]) + "_complete.csv"
    df.to_csv(filename,index=False)


  0%|          | 0/269 [00:00<?, ?it/s][A
  0%|          | 1/269 [02:15<10:05:12, 135.49s/it][A
  1%|          | 2/269 [04:48<10:25:52, 140.65s/it][A
  1%|          | 3/269 [07:22<10:41:31, 144.70s/it][A
  1%|▏         | 4/269 [10:05<11:03:54, 150.32s/it][A
  2%|▏         | 5/269 [12:36<11:01:43, 150.39s/it][A
  2%|▏         | 6/269 [14:49<10:36:56, 145.31s/it][A
  3%|▎         | 7/269 [17:16<10:36:57, 145.87s/it][A
  3%|▎         | 8/269 [19:41<10:33:23, 145.61s/it][A
  3%|▎         | 9/269 [22:13<10:38:55, 147.44s/it][A
  4%|▎         | 10/269 [24:53<10:52:27, 151.15s/it][A
  4%|▍         | 11/269 [27:29<10:55:52, 152.53s/it][A
  4%|▍         | 12/269 [29:45<10:32:53, 147.76s/it][A
  5%|▍         | 13/269 [32:31<10:53:26, 153.15s/it][A

## Putting it all together

Now let's concatenate everything together, save it to a master file, and use it for data analysis!

In [None]:
files_to_join = [i for i in os.listdir("df_parts") if i.endswith("complete.csv")]
df_list = []
for i in tqdm(files_to_join):
    df_list.append(pd.read_csv(i))
df = pd.concat(df_list,ignore_index=True)
df.to_csv("fulltext_processed_03302020.csv")