![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/DistilBertForTokenClassification.ipynb)


#`DistilBertForTokenClassification` Models

## 📌Colab Setup

In [None]:
# Installing pyspark and spark-nlp
! pip install -q pyspark==3.2.0 spark-nlp

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

[K     |████████████████████████████████| 281.3 MB 30 kB/s 
[K     |████████████████████████████████| 140 kB 49.5 MB/s 
[K     |████████████████████████████████| 198 kB 39.5 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 95 kB 2.2 MB/s 
[K     |████████████████████████████████| 66 kB 3.5 MB/s 
[?25h

In [None]:
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
import pandas as pd
import numpy as np
import json
import os

from pyspark.sql.types import StringType, IntegerType

## 📌Start Spark Session

In [None]:
spark = sparknlp.start(spark32 = True) 

print("Spark NLP version", sparknlp.version())

spark

Spark NLP version 3.4.1


## 📌Select the model

In [None]:
model = 'distilbert_token_classifier_typo_detector'

## 📌Some sample examples

In [None]:
typo_sample_texts_en = [
    "On Wednesday , the more than a milion riders who visit the page each month will be met with a cleane , less clutered production , the site 's first signifcant redesign since 2003 .",
    "Much atention has been given to makinq batteries smaller and more powerfful , but scientists say problems can arese when such batteries become heavtier than the devices they power .",
    "The organization recenly hosted exhibits at archittectural conventions in San Francisco and Washington , DC , educeting hundreds of architecs and engineers , signing up nearly 100 onto the petition calling for a new investgation .",
    "He had also stgruggled with addiction during his tine in Congress." ,
    "Park managers atributed the decreasa to a lack of early season snow , the weak economy and conffusion over the legial status of snowmobiling in the park.",
    "Her isolaion has been inttensified by severe artritis and failing eyesight that has meant she cannot walk long distiances and has had to relinguish her driving licencee ."
  ]

In [None]:
typo_sample_texts_is = [
    "Til eru kensluaðferðir sem nýta mismunadi tækni til kennslunnar allt eftir avvikum.",
    "Við höfum skðað og farið vel yfir helstu vefmiðlt landsins og hvað peir innihalda. ",
    "Umhverfisstofnun hetur lagt til að nokkur svöði verði friðlýst. ",
    "Fleiri svüði verða sett á lüstann á kämandi vikum. ",
    "Framtíðaráform umn bygg­ing­ar á mannvirk­um kar í borg eröu oft afar há­leit"  
  ]

## 📌 Define Spark NLP pipeline

In [None]:
language_dict = {'en': typo_sample_texts_en,
                 'is' :typo_sample_texts_is}

In [None]:
def run_pipeline (model,language, text, result):

  documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  sentenceDetector = SentenceDetector()\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

  tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

  tokenClassifier = DistilBertForTokenClassification.pretrained(model, language)\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("ner")

  ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")
      
  pipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

  df = spark.createDataFrame(text, StringType()).toDF("text")
  results[language]=(pipeline.fit(df).transform(df))

In [None]:
results = {}
for language, text in zip(language_dict.keys(),language_dict.values()):
  run_pipeline(model, language, text, results)

distilbert_token_classifier_typo_detector download started this may take some time.
Approximate size to download 232.8 MB
[OK!]
distilbert_token_classifier_typo_detector download started this may take some time.
Approximate size to download 482.3 MB
[OK!]


In [None]:
for model_lang, result in zip(results.keys(),results.values()):
  print(f"\n {model} ({model_lang}) \n")  
  result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
                  
  print("\n**********************************\n") 


 distilbert_token_classifier_typo_detector (en) 

+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|milion        |PO       |
|cleane        |PO       |
|clutered      |PO       |
|signifcant    |PO       |
|atention      |PO       |
|makinq        |PO       |
|powerfful     |PO       |
|arese         |PO       |
|heavtier      |PO       |
|recenly       |PO       |
|archittectural|PO       |
|educeting     |PO       |
|architecs     |PO       |
|investgation  |PO       |
|stgruggled    |PO       |
|tine          |PO       |
|atributed     |PO       |
|decreasa      |PO       |
|conffusion    |PO       |
|legial        |PO       |
+--------------+---------+
only showing top 20 rows


**********************************


 distilbert_token_classifier_typo_detector (is) 

+--------------+---------+
|chunk         |ner_label|
+--------------+---------+
|kensluaðferðir|PO       |
|mismunadi     |PO       |
|avvikum       |PO       |
|skðað         |PO       

## 📌Visualization of results

In [None]:
from sparknlp_display import NerVisualizer


for model_lang, text, result in zip(language_dict.keys(), language_dict.values(),results.values()):
  print(f"\n {model} ({model_lang}) \n\n\n")
  for i in range(len(text)):
    NerVisualizer().display(
        result = result.collect()[i],
        label_col = 'ner_chunk',
        document_col = 'document')
  print("\n**********************************\n") 


 distilbert_token_classifier_typo_detector (en) 






**********************************


 distilbert_token_classifier_typo_detector (is) 






**********************************

