

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_FewNERD.ipynb)




# **Detect entities using FewNERD NER Model**

## 1. Colab Setup

In [None]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.3.0 spark-nlp==4.2.8

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

## 2. Start the Spark session

Import dependencies and start Spark session.

In [2]:
import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

In [3]:
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.8
Apache Spark version: 3.3.0


## 3. A sample text

In [4]:
text_list = ["""12 Corazones ('12 Hearts') is Spanish-language dating game show produced in the United States for the television network Telemundo since January 2005, based on its namesake Argentine TV show format. The show is filmed in Los Angeles and revolves around the twelve Zodiac signs that identify each contestant. In 2008, Ho filmed a cameo in the Steven Spielberg feature film The Cloverfield Paradox, as a news pundit."""]

## 4. Select the FewNERD NER Model, construct the pipeline and visualize the results.
###Select the FewNERD NER Model - **"nerdl_fewnerd_100d","nerdl_fewnerd_subentity_100d"**

In [5]:
model_list = ["nerdl_fewnerd_100d",
              "nerdl_fewnerd_subentity_100d"]

In [6]:
for MODEL_NAME in model_list:
  documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

  sentenceDetector = SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

  tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

  embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\
            .setInputCols("sentence", "token") \
            .setOutputCol("embeddings")

  ner = NerDLModel.pretrained(MODEL_NAME)\
          .setInputCols(["sentence", "token", "embeddings"])\
          .setOutputCol("ner")

  ner_converter = NerConverter()\
      .setInputCols(['document', 'token', 'ner'])\
      .setOutputCol('ner_chunk')

  nlpPipeline = Pipeline(
      stages=[
          documentAssembler, 
          sentenceDetector,
          tokenizer,
          embeddings,
          ner,
          ner_converter])

  from sparknlp_display import NerVisualizer


  df = spark.createDataFrame(text_list, StringType()).toDF("text")
  result = nlpPipeline.fit(df).transform(df)

  print("<----------------- MODEL NAME:","\033[1m" + MODEL_NAME + "\033[0m"," ----------------- >")
  NerVisualizer().display(
      result = result.collect()[0],
      label_col = 'ner_chunk',
      document_col = 'document'
  )


glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
nerdl_fewnerd_100d download started this may take some time.
Approximate size to download 14.2 MB
[OK!]
<----------------- MODEL NAME: [1mnerdl_fewnerd_100d[0m  ----------------- >


glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
nerdl_fewnerd_subentity_100d download started this may take some time.
Approximate size to download 14.7 MB
[OK!]
<----------------- MODEL NAME: [1mnerdl_fewnerd_subentity_100d[0m  ----------------- >
