![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dl-ner/ner_albert.ipynb)

## 0. Colab Setup

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-23 11:33:23--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-23 11:33:23--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 11:33:24--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

# How to train a NER classifier with Albert embeddings based on Char CNNs - BiLSTM - CRF

## Download the file into the local File System 
### It is a standard conll2003 format training file

In [2]:
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request


download_path = "./eng.train"


if not Path(download_path).is_file():
    print("File Not found will downloading it!")
    url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train"
    urllib.request.urlretrieve(url, download_path)
else:
    printalbert("File already present.")
    


File Not found will downloading it!


# Read CoNLL Dataset into Spark dataframe and automagically generate features for futures tasks
The readDataset method of the CoNLL class handily adds all the features required in the next steps

In [3]:
import sparknlp
from sparknlp.training import CoNLL

spark = sparknlp.start()
training_data = CoNLL().readDataset(spark, './eng.train')
training_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
|The European Comm...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 2, Th...|[{pos, 0, 2, DT, ...|[{named_entity, 0...|
|Germany 's repres...|[{document, 0, 21...|[{document, 0, 21...|[{token, 0, 6, Ge...|[{pos, 0, 6, NNP,..

# Define the NER Pipeline 

### This pipeline defines a pretrained Albert component and a trainable NerDLApproach which is based on the Char CNN - BiLSTM - CRF

Usually you have to add additional pipeline components before the Albert for the document, sentence and token columns. But Spark NLPs CoNLL class took already care of this for us, awesome!

In [4]:

from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

# Define the pretrained Albert model. 
albert = AlbertEmbeddings.pretrained().setInputCols("sentence", "token")\
 .setOutputCol("albert")\


# Define the Char CNN - BiLSTM - CRF model. We will feed it the Albert tokens 
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "albert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(1)\
  .setRandomSeed(0)\
  .setVerbose(0)

# put everything into the pipe
pipeline = Pipeline(
    stages = [
      albert ,
      nerTagger
  ])

albert_base_uncased download started this may take some time.
Approximate size to download 42.7 MB
[OK!]


# Fit the Pipeline and get results

In [5]:
ner_df = pipeline.fit(training_data.limit(10)).transform(training_data.limit(50))
ner_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|              albert|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|[{word_embeddings...|[{

### Checkout only result columns

In [6]:
ner_df.select(*['text', 'ner']).limit(1).show(truncate=False)

+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                            |ner                                                                                                                                                                                                                                                                                                                                                 

## Alternative Albert models 

checkout https://github.com/JohnSnowLabs/spark-nlp-models for alternative models, the following are available :
  * albert_base_uncased     = https://tfhub.dev/google/albert_base/3    |  768-embed-dim,   12-layer,  12-heads, 12M parameters
  * albert_large_uncased    = https://tfhub.dev/google/albert_large/3   |  1024-embed-dim,  24-layer,  16-heads, 18M parameters
  * albert_xlarge_uncased   = https://tfhub.dev/google/albert_xlarge/3  |  2048-embed-dim,  24-layer,  32-heads, 60M parameters
  * albert_xxlarge_uncased  = https://tfhub.dev/google/albert_xxlarge/3 |  4096-embed-dim,  12-layer,  64-heads, 235M parameters

In [7]:

from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

# Define the pretrained Albert model. 
abert_variant = 'albert_xxlarge_uncased'
albert = AlbertEmbeddings.pretrained(abert_variant ).setInputCols("sentence", "token")\
 .setOutputCol("albert")\


# Define the Char CNN - BiLSTM - CRF model. We will feed it the Albert tokens 
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "albert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(1)\
  .setRandomSeed(0)\
  .setVerbose(0)

# put everything into the pipe
pipeline = Pipeline(
    stages = [
      albert ,
      nerTagger
  ])

ner_df = pipeline.fit(training_data.limit(10)).transform(training_data.limit(50))
ner_df.show()

albert_xxlarge_uncased download started this may take some time.
Approximate size to download 795 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|              albert|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|

In [8]:

ner_df.select(*['text', 'ner']).limit(1).show(truncate=False)

+------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                            |ner                                                                                                                                                                                                                                                                                                                                                 