![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/25.01.NerCrf.ipynb)

## **Ner CRF**

- This Named Entity Recognizer is based on a CRF Algorithm. Named Entity Recognition (NER) Conditional Random Field (CRF) is a machine learning algorithm in Spark NLP that is used to identify and extract named entities from unstructured text data. Spark NLP provides pre-trained NER models that use NER CRF or users can also train their own custom NER models using the CRF algorithm.


**📖 Learning Objectives:**

1. Understand how to train your own model with the instantiated model of the NerCrfApproach. 
2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [NerCRF](https://nlp.johnsnowlabs.com/docs/en/annotators#nercrf)

- Python Docs : [NerCRF](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/ner/ner_crf/index.html)

- Scala Docs : [NerCRF](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfModel)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/).


## **Colab Setup**

In [None]:
!pip install -q pyspark==3.3.0  spark-nlp==4.2.5

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.3/281.3 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m453.4/453.4 KB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

## **📜 Background**

- Conditional Random Fields (CRFs) are a class of probabilistic graphical model that is commonly used in machine learning and natural language processing (NLP) applications. In NLP, CRFs are used for the sequence labeling tasks, which involve assigning labels to each element in a sequence of observations, such as assigning part-of-speech tags to words in a sentence or recognizing named entities (such as people, organizations, and locations) in a text.
- NerCrfModel is an annotator in Spark NLP and it extracts named entities based on a pretrained CRF Model. This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

- These can be extracted with for example:

 - a `SentenceDetector`,

 - a `Tokenizer`,

 - a `PerceptronModel` and

 - a `WordEmbeddingsModel`.

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS`

- Output: `NAMED_ENTITY`


## **🔎 Parameters**


- `includeConfidence` : Whether to include confidence scores in annotation metadata, by default False


🔎 NER CRF model is trained on a labeled dataset that includes examples of text with corresponding named entity labels. During training, the model learns to identify patterns and features in the input text that are associated with named entities, such as the presence of specific words or phrases, syntactic structures, or contextual information.

🔎 CoNLL dataset already includes a sentence, token, POS tags and label column with their respective annotator types. If a custom dataset is used, these need to be defined with for example:

🔎 Once the model is trained, it can be used to predict named entities by assigning labels to each token based on the learned patterns and features. The predictions are made using a probabilistic framework that takes into account the dependencies between adjacent tokens in the sequence.

To understand the concept better, we will use the following model: Conditional Random Field Based Named Entity Recognizer, where the model automatically extracts the following entities using glove_100dembeddings:

ORG, PER, LOC, MISC

- Then training can start:

- We use the sentences, tokens, POS tags and labels from the CoNLL dataset, as mentioned above.

- We will use train and test datasets from the John Snow Labs Github, so first let us get their links:

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltrain.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltest.txt

- Now, import the training dataset as CoNLL file:

In [None]:
from sparknlp.training import CoNLL

trainingData = CoNLL().readDataset(spark, 'NER_NCBIconlltrain.txt')

trainingData.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Identification of...|[{document, 0, 89...|[{document, 0, 89...|[{token, 0, 13, I...|[{pos, 0, 13, NN,...|[{named_entity, 0...|
|The adenomatous p...|[{document, 0, 21...|[{document, 0, 21...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
|Complex formation...|[{document, 0, 63...|[{document, 0, 63...|[{token, 0, 6, Co...|[{pos, 0, 6, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
testData = CoNLL().readDataset(spark, 'NER_NCBIconlltest.txt')
testData.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Clustering of mis...|[{document, 0, 10...|[{document, 0, 10...|[{token, 0, 9, Cl...|[{pos, 0, 9, NN, ...|[{named_entity, 0...|
|Ataxia - telangie...|[{document, 0, 13...|[{document, 0, 13...|[{token, 0, 5, At...|[{pos, 0, 5, NN, ...|[{named_entity, 0...|
|The risk of cance...|[{document, 0, 15...|[{document, 0, 15...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



- Spark NLP has the pipeline approach and the pipeline will include the necessary stages to extract the entities from the text:

In [None]:
# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import (
    Tokenizer,
    PerceptronModel,    
    WordEmbeddingsModel,
    NerCrfModel,
    NerConverter
)

import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

# Step 2: Tokenization
tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

# Step 3: Perceptron model to tag words' part-of-speech
posTagger = PerceptronModel\
.pretrained()\
.setInputCols(["token", "document"])\
.setOutputCol("pos")

# Step 4: Glove100d Embeddings
embeddings = WordEmbeddingsModel.pretrained()\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")

# Step 5: Entity Extraction
ner_model = NerCrfModel.pretrained()\
.setInputCols(['document', 'token', 'pos', 'embeddings']) \
.setOutputCol('ner')

# Step 6: Converts a IOB representation of NER to a user-friendly one
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')

# Define the pipeline
pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
posTagger,
embeddings,
ner_model,
ner_converter
])

# Fit and transform the dataframe to the pipeline 
result = pipeline.fit(trainingData).transform(trainingData)

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[OK!]


In [None]:
# Fit the training dataset to the pipeline
pipelineModel = pipeline.fit(trainingData)

In [None]:
# Get the predictions by transforming the test dataset
predictions = pipelineModel.transform(testData)

This model was trained by using the ‘glove_100d’, so we had to use the same embeddings while running the model.

Now, we will explode the results to get a nice dataframe of the entities. Here, chunks with no associated entity (tagged “O”) were filtered.

In [None]:
result.select(F.explode(F.arrays_zip(result.entities.result, 
                                     result.entities.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(15, truncate=False)

+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|APC2          |ORG      |
|APC           |ORG      |
|Wnt           |MISC     |
|GSK           |ORG      |
|APC           |ORG      |
|Tcf           |ORG      |
|APC           |ORG      |
|Mammalian APC2|ORG      |
|APC           |ORG      |
|SAMP          |ORG      |
|APC           |ORG      |
|APC2          |LOC      |
|Tcf           |ORG      |
|APC           |ORG      |
|APC2          |LOC      |
+--------------+---------+
only showing top 15 rows



#### `setIncludeConfidence`

The function of the setIncludeConfidence parameter is to provide additional information about the model's predictions.

When this parameter is set to True, the NerDLModel will include the confidence score for each identified entity in its output. The confidence score represents the model's confidence that a given word or sequence of words belongs to a certain entity type.

In [None]:
import pyspark.sql.functions as F

In [None]:
ner_model = NerCrfModel.pretrained()\
.setInputCols(['document', 'token', 'pos', 'embeddings']) \
.setOutputCol('ner')\
.setIncludeConfidence(True)

ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')

nlpPipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
posTagger,
embeddings,
ner_model,
ner_converter
])

#result = pipeline.fit(trainingData).transform(trainingData)

pipelineModel = nlpPipeline.fit(trainingData)
 
predictions = pipelineModel.transform(testData)


result_df = predictions.select(F.explode(F.arrays_zip(predictions.token.result,
                                                 predictions.ner.result, 
                                                 predictions.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']").alias("confidence"))

if result_df.count()>1:
  udf_func = F.udf(lambda x,y:  x[str(y)])
  print("\n**************Confidence of each ner label**********************\n") 
  result_df.withColumn('confidence', udf_func(result_df.confidence, result_df.ner_label)).show(truncate=False)
  print("\n**************All confidences of related token********************\n") 
  
  result_df.show(25, truncate=100)

ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[OK!]

**************Confidence of each ner label**********************

+--------------+---------+----------+
|token         |ner_label|confidence|
+--------------+---------+----------+
|Clustering    |O        |0.75534743|
|of            |O        |0.98782486|
|missense      |O        |0.9796193 |
|mutations     |O        |0.9999527 |
|in            |O        |0.9999911 |
|the           |O        |0.9999225 |
|ataxia        |O        |0.99971086|
|-             |O        |0.9982199 |
|telangiectasia|O        |0.9915096 |
|gene          |O        |0.999975  |
|in            |O        |0.9999874 |
|a             |O        |0.9999958 |
|sporadic      |O        |0.99970394|
|T             |O        |0.9534909 |
|-             |O        |0.9984583 |
|cell          |O        |0.9999864 |
|leukaemia     |O        |0.99920917|
|.             |O        |0.999914  |
|Ataxia        |O        |0.47806594|
|-  

We will download the News Category dataset to show `IncludeConfidence` parameter with a different example

In [None]:
!wget -q -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

In [None]:
import pyspark.sql.functions as F

news_df = spark.read\
                .option("header", "true")\
                .csv("news_category_test.csv")\
                .withColumnRenamed("description", "text")

news_df.show(10, truncate=120)

+--------+------------------------------------------------------------------------------------------------------------------------+
|category|                                                                                                                    text|
+--------+------------------------------------------------------------------------------------------------------------------------+
|Business|Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Fede...|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for priva...|
|Sci/Tech| A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of produ...|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, bu...|
|Sci/Tech| Southern California's smog fighting agency went after emissions o

In [None]:
ner_model = NerCrfModel.pretrained()\
.setInputCols(['document', 'token', 'pos', 'embeddings']) \
.setOutputCol('ner')\
.setIncludeConfidence(True)

nlpPipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
posTagger,
embeddings,
ner_model,
ner_converter
])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.token.result,
                                                 result.ner.result, 
                                                 result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']").alias("confidence"))

if result_df.count()>1:
  udf_func = F.udf(lambda x,y:  x[str(y)])
  print("\n**************Confidence of each ner label**********************\n") 
  result_df.withColumn('confidence', udf_func(result_df.confidence, result_df.ner_label)).show(truncate=False)
  print("\n**************All confidences of related token********************\n") 
  
  result_df.show(25, truncate=100)

ner_crf download started this may take some time.
Approximate size to download 10.2 MB
[OK!]

**************Confidence of each ner label**********************

+------------+---------+----------+
|token       |ner_label|confidence|
+------------+---------+----------+
|Unions      |I-ORG    |0.8294739 |
|representing|O        |0.9926739 |
|workers     |O        |0.99973226|
|at          |O        |0.99999714|
|Turner      |I-ORG    |0.28356913|
|Newall      |I-ORG    |0.8567402 |
|say         |O        |0.9993721 |
|they        |O        |0.9999939 |
|are         |O        |0.99974704|
|'           |O        |0.9902924 |
|disappointed|O        |0.99677885|
|'           |O        |0.9986422 |
|after       |O        |0.9999958 |
|talks       |O        |0.9998918 |
|with        |O        |0.999976  |
|stricken    |O        |0.9958817 |
|parent      |O        |0.99985147|
|firm        |O        |0.9998687 |
|Federal     |I-ORG    |0.6107093 |
|Mogul       |I-ORG    |0.7592184 |
+-----------