# Named Entity Recognition (NER)

NER involves the extraction of information from text, called entities, that represen actegories like people, places things etc.

Applications: Info Extraction, Info Retrieval, Text understanding

Approaches for Applying NER:

Dictionary-based: Create a dict of words to match entities against. Issue: Maintaining the dict

Rule-based: Use rules(pattern and context based) to extract entities. Issue: Real-world performance

Machine Learning: Train a model to extract entities and run inference. Issue: Compute Intensive

Advanced Techniques:

    1. Conditional Random Fields (CRF): type of probabilistic graphical model with a conditional distrbution.
        - This makes it a discriminative model
        - Can support input data that is complex and overlaps. Model dependencies b/w neighboring labels/categories in a linear chain
        
        - High performing since they can exploit other types of features of a word like capitaliztion, prefixes etc
    2. Bidirectional Long Short-Term models (BiLSTM): A form of Recurrent Neural Networks (RNNs) models that predict sequences from inputs.
        - LSTMs capture long-term temporal dependencies better than RNNs
        - Only capture previous temporal context in a seq. So Bidirectional helps here as Two LSTMs are used to capture prvious and future context.
        - processing inputs and forward and reversed directions
        - means the model knows the words before and after a specific word in a sentence.
        
    3. Transformer-based models: Capture contextual info very effectively.
        - Use self-attention mechanisms to consider the entire input sequence simultaneously.
        - Word embeddings: representations of word and their relationships
        - Transformer-based embeddings are much more dynamic and context-aware compared to previous approaches like Word2Vec, GloVe.

NER Evaluation: Precision, Recall, F1 Score

## Implement 2 NER Models using sparkNLP

In [2]:
!pip install -q pyspark==3.3.0 spark-nlp==4.2.8

!pip install --upgrade -q spark-nlp-display

In [8]:
import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

In [4]:
spark = sparknlp.start()

print("Spark NLP version ", sparknlp.version())
print("Apache spark version ", spark.version)

spark

:: loading settings :: url = jar:file:/Users/deepshah/opt/anaconda3/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/deepshah/.ivy2/cache
The jars for the packages stored in: /Users/deepshah/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f6d3afdc-ea18-4f04-bb62-2f0aba716d24;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.2.8 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.15.0 in central
	found com.google.guava#guava;31.1-jre in central
	found com.google.guava#

	[SUCCESSFUL ] com.google.http-client#google-http-client-apache-v2;1.42.3!google-http-client-apache-v2.jar (34ms)
downloading https://repo1.maven.org/maven2/com/google/apis/google-api-services-storage/v1-rev20220705-2.0.0/google-api-services-storage-v1-rev20220705-2.0.0.jar ...
	[SUCCESSFUL ] com.google.apis#google-api-services-storage;v1-rev20220705-2.0.0!google-api-services-storage.jar (48ms)
downloading https://repo1.maven.org/maven2/com/google/code/gson/gson/2.10/gson-2.10.jar ...
	[SUCCESSFUL ] com.google.code.gson#gson;2.10!gson.jar (61ms)
downloading https://repo1.maven.org/maven2/com/google/cloud/google-cloud-core/2.8.27/google-cloud-core-2.8.27.jar ...
	[SUCCESSFUL ] com.google.cloud#google-cloud-core;2.8.27!google-cloud-core.jar (49ms)
downloading https://repo1.maven.org/maven2/com/google/auto/value/auto-value-annotations/1.10/auto-value-annotations-1.10.jar ...
	[SUCCESSFUL ] com.google.auto.value#auto-value-annotations;1.10!auto-value-annotations.jar (29ms)
downloading http

downloading https://repo1.maven.org/maven2/io/grpc/grpc-xds/1.50.2/grpc-xds-1.50.2.jar ...
	[SUCCESSFUL ] io.grpc#grpc-xds;1.50.2!grpc-xds.jar (1210ms)
downloading https://repo1.maven.org/maven2/io/opencensus/opencensus-proto/0.2.0/opencensus-proto-0.2.0.jar ...
	[SUCCESSFUL ] io.opencensus#opencensus-proto;0.2.0!opencensus-proto.jar (101ms)
downloading https://repo1.maven.org/maven2/io/grpc/grpc-services/1.50.2/grpc-services-1.50.2.jar ...
	[SUCCESSFUL ] io.grpc#grpc-services;1.50.2!grpc-services.jar (109ms)
downloading https://repo1.maven.org/maven2/com/google/re2j/re2j/1.6/re2j-1.6.jar ...
	[SUCCESSFUL ] com.google.re2j#re2j;1.6!re2j.jar (42ms)
downloading https://repo1.maven.org/maven2/dk/brics/automaton/automaton/1.11-8/automaton-1.11-8.jar ...
	[SUCCESSFUL ] dk.brics.automaton#automaton;1.11-8!automaton.jar (50ms)
:: resolution report :: resolve 17363ms :: artifacts dl 63995ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.828 from central in [default]
	com.fasterxml

24/06/10 14:47:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Spark NLP version  4.2.8
Apache spark version  3.3.0


In [5]:
text = ["""Marcus A. on The Egg: Interested in new features, and shared product on social media."""]

In [9]:
# define a transformer based spark nlp pipeline
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

embeddings = DistilBertEmbeddings\
.pretrained('distilbert_base_cased', 'en')\
.setInputCols(["token", "document"]) \
.setOutputCol('embeddings')

ner_model = NerDLModel.pretrained('ner_ontonotes_distilbert_base_cased', 'en') \
.setInputCols(["token", "document", "embeddings"]) \
.setOutputCol('ner')

ner_converter = NerConverter() \
.setInputCols(["token", "document", "ner"]) \
.setOutputCol('entities')

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])

distilbert_base_cased download started this may take some time.
Approximate size to download 232.7 MB
[ | ]distilbert_base_cased download started this may take some time.
Approximate size to download 232.7 MB
[ — ]Download done! Loading the resource.
[ \ ]

2024-06-10 15:13:44.905782: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[OK!]
ner_ontonotes_distilbert_base_cased download started this may take some time.
Approximate size to download 15.7 MB
[ / ]ner_ontonotes_distilbert_base_cased download started this may take some time.
Approximate size to download 15.7 MB
[ — ]Download done! Loading the resource.
[ \ ]

2024-06-10 15:13:57.652598: W external/org_tensorflow/tensorflow/core/common_runtime/colocation_graph.cc:1218] Failed to place the graph without changing the devices of some resources. Some of the operations (that had to be colocated with resource generating operations) are not supported on the resources' devices. Current candidate devices are [
  /job:localhost/replica:0/task:0/device:CPU:0].
See below for details of this colocation group:
Colocation Debug Info:
Colocation group had the following types and supported devices: 
Root Member(assigned_device_name_index_=-1 requested_device_name_='/device:GPU:0' assigned_device_name_='' resource_device_name_='/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
AddV2: CPU 
AssignSub: CPU 
RealDiv: CPU 
Shape: CPU 
Unique: CPU 
Cast: CPU 
UnsortedSegmentSum: CPU 
Add: CPU 
GatherV2: CPU 
StridedSlice: CPU 
Identity: CPU 
Fill: CPU 
NoOp: CPU 
RandomUniform: CPU 
Mul: CPU 
Sub: CPU 
Sqrt: CPU 
Assign: CPU 
VariableV2: CPU 
Scatte

[OK!]


In [11]:
df = spark.createDataFrame(text, StringType()).toDF('text')
result = pipeline.fit(df).transform(df)

In [13]:
from sparknlp_display import NerVisualizer

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'entities',
    document_col = 'document'
)

                                                                                

In [14]:
# Define a CRF-based pipeline

# Step 1: Transform 'raw_text' to 'document' annotation
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

# Step 2: Tokenization
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

# Step 3: Perceptron model to tag words' part-of-speech
posTagger = PerceptronModel\
.pretrained()\
.setInputCols(["token", "document"]) \
.setOutputCol('pos')

# Step 4: Glove100d Embeddings
embeddings = WordEmbeddingsModel\
.pretrained()\
.setInputCols(["token", "document"]) \
.setOutputCol('embeddings')

# Step 5: Entity Extraction
ner_model = NerCrfModel.pretrained() \
.setInputCols(["token", "document", "pos", "embeddings"]) \
.setOutputCol('ner')

# Step 6: Converts a IOB representation of NER to a user-friendly one
ner_converter = NerConverter() \
.setInputCols(["token", "document", "ner"]) \
.setOutputCol('entities')

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    posTagger,
    embeddings,
    ner_model,
    ner_converter
])

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ | ]pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ — ]Download done! Loading the resource.




[ \ ]

                                                                                

[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ / ]glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ]Download done! Loading the resource.
[OK!]
ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[ | ]ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[ — ]Download done! Loading the resource.




[ | ]

                                                                                

[OK!]


In [18]:
df = spark.createDataFrame(text, StringType()).toDF('text')
result = pipeline.fit(df).transform(df)

In [19]:
from sparknlp_display import NerVisualizer

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'entities',
    document_col = 'document'
)

                                                                                