

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_BTC.ipynb)






# **Detect Entities in Twitter texts**

## 1. Colab Setup

In [1]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-11-17 05:37:04--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2021-11-17 05:37:04--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-11-17 05:37:05--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

In [2]:
!pip install --ignore-installed spark-nlp-display

Collecting spark-nlp-display
  Downloading spark_nlp_display-1.8-py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 2.4 MB/s 
[?25hCollecting spark-nlp
  Using cached spark_nlp-3.3.2-py2.py3-none-any.whl (130 kB)
Collecting ipython
  Downloading ipython-7.29.0-py3-none-any.whl (790 kB)
[K     |████████████████████████████████| 790 kB 18.0 MB/s 
[?25hCollecting pandas
  Downloading pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 16.9 MB/s 
[?25hCollecting svgwrite==1.4
  Downloading svgwrite-1.4-py3-none-any.whl (66 kB)
[K     |████████████████████████████████| 66 kB 4.6 MB/s 
[?25hCollecting numpy
  Downloading numpy-1.21.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 63 kB/s 
[?25hCollecting pexpect>4.3
  Downloading pexpect-4.8.0-py2.py3-none-any.whl (59 kB)
[K     |████████████████████████████████

In [3]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

In [4]:
spark = sparknlp.start()

## 3. Some sample examples

In [5]:
text_list = test_sentences = ["""Wengers big mistakes is not being ruthless enough with bad players.""",
                  """Aguero goal . From being someone previously so reliable , he 's been terrible this year .""",
                  """Paul Scholes approached Alex Ferguson about making a comeback . Ferguson clearly only too happy to accommodate him .""",
                  """Wikipedia today , as soon as you load the website , hit ESC to prevent the 'blackout ' from loading.""",
                  """David Attenborough shows us a duck billed platypus.""",
                  """London GET UPDATES FROM Peter Hotez""",
                  """Pentagram's Dominic Lippa is working on a new identity for University of Arts London """]

## 4. Define Spark NLP pipeline

In [6]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")  

tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_btc", "en")\
  .setInputCols("token", "document")\
  .setOutputCol("ner")\
  .setCaseSensitive(True)

ner_converter = NerConverter()\
        .setInputCols(["document","token","ner"])\
        .setOutputCol("ner_chunk")\
        

pipeline =  Pipeline(stages=[document, tokenizer, tokenClassifier, ner_converter])



bert_token_classifier_ner_btc download started this may take some time.
Approximate size to download 385.3 MB
[OK!]


## 5. Run the pipeline

In [7]:
model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))

result = model.transform(spark.createDataFrame(pd.DataFrame({'text': text_list})))


## 6. Visualize results

In [8]:

result.select(F.explode(F.arrays_zip('document.result', 'ner_chunk.result',"ner_chunk.metadata")).alias("cols")) \
.select(
        F.expr("cols['1']").alias("chunk"),
        F.expr("cols['2'].entity").alias('result')).show(truncate=False)

+-------------------------+------+
|chunk                    |result|
+-------------------------+------+
|Wengers                  |PER   |
|Aguero                   |PER   |
|Paul Scholes             |PER   |
|Alex Ferguson            |PER   |
|Ferguson                 |PER   |
|Wikipedia                |ORG   |
|David Attenborough       |PER   |
|London                   |LOC   |
|Peter Hotez              |PER   |
|Pentagram's              |ORG   |
|Dominic Lippa            |PER   |
|University of Arts London|ORG   |
+-------------------------+------+



In [9]:
from sparknlp_display import NerVisualizer

for i in range(len(text_list)):
  NerVisualizer().display(
      result = result.collect()[i],
      label_col = 'ner_chunk',
      document_col = 'document'
  )

