# Download Library

In [25]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

# This installs Spark NLP visualization library
!pip install -q spark-nlp-display

--2023-08-23 20:43:56--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2023-08-23 20:43:56--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’


2023-08-23 20:43:56 (87.2 MB/s) - written to stdout [1191/1191]

Installing PySpark 3.2.3 and Spark NLP 5.0.2
setup Colab for PySpark 3.2.3 and Spark NLP 5

# Import Library

In [26]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *


In [27]:
spark = sparknlp.start()



# Spark NLP Model Set Up

This sets up a Natural Language Processing (NLP) pipeline for named entity recognition (NER) using the Spark NLP library. Let's go through each step of the pipeline setup in detail:

1. **DocumentAssembler**:
   - This is the initial step in the pipeline.
   - It takes the input column named 'text' (presumably containing text data) and outputs a column named 'document'.
   - The 'document' column is used to represent the processed text data in a format suitable for further analysis.

2. **Tokenizer**:
   - The Tokenizer splits the text into individual words or tokens.
   - It takes the 'document' column as input and outputs a column named 'token' containing tokenized words.

3. **BertEmbeddings**:
   - This step utilizes pre-trained BERT word embeddings to convert tokens into dense vector representations.
   - It uses a pre-trained BERT model named 'small_bert_L8_512' for English language ('en').
   - It takes the 'document' and 'token' columns as input and outputs a column named 'embeddings' containing BERT embeddings.
   - The `.setMaxSentenceLength(512)` sets the maximum sequence length for BERT embeddings.

4. **NerDLModel**:
   - This step is a Named Entity Recognition (NER) model utilizing deep learning.
   - It uses the pre-trained NER model 'onto_small_bert_L8_512' for English ('en').
   - The model takes BERT embeddings, token, and document columns as input and outputs a column named 'ner' containing NER predictions.
   - It identifies entities like persons, locations, organizations, etc. in the text.

5. **NerConverter**:
   - The NER predictions produced by the previous step are often in a sequential format, indicating the start and end points of entities.
   - The NerConverter converts these sequential entities into chunks for easier interpretation.
   - It takes the 'document', 'token', and 'ner' columns as input and outputs a column named 'ner_chunk'.

6. **Pipeline**:
   - The final step is to define the entire NLP pipeline using a list of stages.
   - The stages include the DocumentAssembler, Tokenizer, BertEmbeddings, NerDLModel, and NerConverter.
   - The pipeline will process text data through these stages in sequence to perform NER.

This pipeline processes the text data through tokenization, embedding using BERT, named entity recognition using a deep learning model, and then converting the NER predictions into more interpretable chunks. The resulting pipeline can be used to extract information about named entities from the input text data.

Make sure that you have the necessary pre-trained BERT and NER models downloaded using the provided model names ('small_bert_L8_512' and 'onto_small_bert_L8_512') and that you have installed the required dependencies using `sparknlp.start()` before executing this pipeline.

In [28]:
documentAssembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

tokenizer = Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

embeddings = BertEmbeddings.pretrained(name='small_bert_L8_512', lang='en')\
    .setInputCols(['document', 'token'])\
    .setOutputCol('embeddings')\
    .setMaxSentenceLength(512)

ner_model = NerDLModel.pretrained('onto_small_bert_L8_512', 'en')\
    .setInputCols(['document', 'token', 'embeddings'])\
    .setOutputCol('ner')

ner_converter = NerConverter()\
    .setInputCols(['document', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

nlp_pipeline = Pipeline(stages=[
    documentAssembler,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])

small_bert_L8_512 download started this may take some time.
Approximate size to download 149.1 MB
[OK!]
onto_small_bert_L8_512 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]


# Train Model
We will fit the NER Model by BERT with our sample text

In [29]:
text_list = [
    ["""William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder
    of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO),
    president and chief software architect, while also being the largest individual shareholder until May 2014. He is one of the best-known entrepreneurs and pioneers of the
    microcomputer revolution of the 1970s and 1980s. Born and raised in Seattle, Washington, Gates co-founded Microsoft with childhood friend Paul Allen in 1975, in Albuquerque,
    New Mexico; it went on to become the world's largest personal computer software company. Gates led the company as chairman and CEO until stepping down as CEO in January
    2000, but he remained chairman and became chief software architect. During the late 1990s, Gates had been criticized for his business tactics, which have been considered
    anti-competitive. This opinion has been upheld by numerous court rulings. In June 2006, Gates announced that he would be transitioning to a part-time role at Microsoft
    and full-time work at the Bill & Melinda Gates Foundation, the private charitable foundation that he and his wife, Melinda Gates, established in 2000.[9] He gradually
    transferred his duties to Ray Ozzie and Craig Mundie. He stepped down as chairman of Microsoft in February 2014 and assumed a new post as technology adviser to support
    the newly appointed CEO Satya Nadella."""],

    ["""The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."""]
]

df_example = spark.createDataFrame(text_list).toDF("text")
result = nlp_pipeline.fit(df_example).transform(df_example)

## Result Information

1. **text**:
   - This is the original text input you provided in each row of the DataFrame.

2. **document**:
   - This column represents the text content after being processed by the `DocumentAssembler` stage in your pipeline.
   - It contains metadata about the document such as its start and end offsets in the original text.

3. **token**:
   - This column contains the tokenized words from the text.
   - The `Tokenizer` stage split the document into individual words or tokens.

4. **embeddings**:
   - This column contains the BERT embeddings for each token in the document.
   - The `BertEmbeddings` stage converted the tokens into dense vector representations using a pre-trained BERT model.

5. **ner**:
   - This column contains the named entity recognition (NER) predictions for each token in the document.
   - The `NerDLModel` stage predicted the named entities in the text, such as person names, dates, organizations, etc.

6. **ner_chunk**:
   - This column represents the NER predictions in a more interpretable chunk format.
   - The `NerConverter` stage converted the sequential NER predictions into chunks representing continuous spans of named entities.

So, for each row in the DataFrame:

- **text** contains the original text input.
- **document** contains metadata about the text document.
- **token** contains the tokenized words.
- **embeddings** contains the dense vector representations (embeddings) of the tokens.
- **ner** contains the named entity recognition predictions for each token.
- **ner_chunk** contains the named entity chunks extracted from the NER predictions.

Overall, the pipeline processes the input text through tokenization, embedding with BERT, and named entity recognition, and then converts the NER predictions into a more human-readable format. The result provides insights into how the pipeline extracts and processes information from the input text.

In [30]:
result.show(2)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|William Henry Gat...|[{document, 0, 15...|[{token, 0, 6, Wi...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 22, W...|
|The Mona Lisa is ...|[{document, 0, 98...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 12, T...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



# Result Visualization


In [31]:
from sparknlp_display import NerVisualizer

print("\n my 1st document: \n")

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'ner_chunk',
    document_col = 'document'
)

print("\n my 2nd document: \n")

NerVisualizer().display(
    result = result.collect()[1],
    label_col = 'ner_chunk',
    document_col = 'document'
)


 my 1st document: 




 my 2nd document: 

