![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/util/Spark_NLP_Structured_Streaming.ipynb)

### Structured Streaming with Spark NLP

This notebook demonstrates the integration of Spark NLP with Spark Structured Streaming. We'll illustrate a straightforward example that performs real-time entity duplication counting.

First, we create a directory where the files for streaming will reside

In [1]:
!mkdir ner-resources

In [2]:
sentences = [
    "Apple Inc. is planning to open a new store in Paris next year.",
    "Dr. Smith will attend the conference on artificial intelligence in San Francisco on January 15, 2023.",
    "The Eiffel Tower, located in the heart of Paris, is a popular tourist attraction.",
    "Google, headquartered in Mountain View, California, announced a breakthrough in machine learning.",
    "Mary Johnson, the CEO of XYZ Corporation, will deliver the keynote speech at the event.",
    "The Great Barrier Reef, the world's largest coral reef system, is located in Australia.",
    "On July 4th, 1776, the United States declared its independence from British rule.",
    "NASA's Perseverance rover successfully landed on Mars in February 2021.",
    "The Louvre Museum in France houses thousands of works of art, including the Mona Lisa.",
    "Amazon, founded by Jeff Bezos, is one of the largest e-commerce and cloud computing companies.",
    "Tokyo, the capital of Japan, will host the Summer Olympics in 2024.",
    "Albert Einstein, the famous physicist, developed the theory of relativity.",
    "The Nile River is the longest river in Africa, flowing through multiple countries.",
    "The World Health Organization (WHO) plays a crucial role in global health initiatives.",
    "Queen Elizabeth II has been the reigning monarch of the United Kingdom since 1952."
]

file_path = "ner-resources/ner-example.txt"

# Write the sentences to the file
with open(file_path, "w") as file:
    for sentence in sentences:
        file.write(sentence + "\n")

print(f"Sentences written to {file_path}")

Sentences written to ner-resources/ner-example.txt


In [3]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 5.1.4
setup Colab for PySpark 3.2.3 and Spark NLP 5.1.4
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.7/540.7 kB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [4]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

In [5]:
# Create DataFrame representing the stream of input lines
lines = spark \
    .readStream \
    .format("text") \
    .option("maxFilesPerTrigger", 1) \
    .load("ner-resources/")

In [6]:
# Split the lines into sentences
text_df = lines.select(lines.value)

In [7]:
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.sql.functions import col

In [8]:
# Create Spark NLP pipeline
document_assembler = DocumentAssembler().setInputCol("value").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
word_embeddings = WordEmbeddingsModel().pretrained().setInputCols(["document", "token"]).setOutputCol("embeddings")
ner_tagger = NerDLModel().pretrained().setInputCols(["document", "token", "embeddings"]).setOutputCol("ner")
ner_converter = NerConverter().setInputCols("document", "token", "ner").setOutputCol("entities")

# Assemble the pipeline
pipeline = Pipeline(stages=[document_assembler, tokenizer, word_embeddings, ner_tagger, ner_converter])

# Fit the pipeline on the data
model = pipeline.fit(text_df)

# Transform the data
ner_df = model.transform(text_df).selectExpr("explode(entities)").withColumnRenamed("col", "entities")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [9]:
# Extract the relavant information (entities and NER tags)
entities_df = ner_df.select(
    col("entities.result").alias("entity"),
    col("entities.metadata").getItem("entity").alias("tag")
)

# Group by 'entity' and 'tag', and count the occurrences
entity_counts_df = entities_df.groupBy("entity", "tag").count() \
        .writeStream \
        .queryName("entity_counts_table") \
        .outputMode("complete") \
        .format("memory") \
        .start()

In [10]:
import threading
threading.Event().wait(45)  # Pauses the execution for 45 seconds to allow refreshing streaming to process

False

In [11]:
spark.sql("select * from entity_counts_table").show()   # interactively query in-memory table

+-------------------+----+-----+
|             entity| tag|count|
+-------------------+----+-----+
|            British|MISC|    1|
|       Mary Johnson| PER|    1|
|       Elizabeth II| PER|    1|
|              Paris| LOC|    2|
|     United Kingdom| LOC|    1|
|      San Francisco| LOC|    1|
|NASA's Perseverance| ORG|    1|
|          Australia| LOC|    1|
|              Tokyo| LOC|    1|
|      United States| LOC|    1|
|       Eiffel Tower| LOC|    1|
|             Google| ORG|    1|
|    XYZ Corporation| ORG|    1|
|              Japan| LOC|    1|
|          Mona Lisa| PER|    1|
|         California| LOC|    1|
|          Apple Inc| ORG|    1|
|         Nile River| LOC|    1|
|    Summer Olympics|MISC|    1|
|      Mountain View| LOC|    1|
+-------------------+----+-----+
only showing top 20 rows



Adding a file for the streaming

In [12]:
sentences = [
    "Apple Inc. recently unveiled its latest innovation, a revolutionary product that will change the way we interact with technology.",
    "Dr. Smith, a leading expert in the field of robotics, will conduct a workshop on advanced machine learning techniques next month.",
    "The Eiffel Tower, standing tall against the Parisian skyline, offers breathtaking views of the city and is a must-visit landmark.",
    "Google's research team, based in Silicon Valley, is making strides in developing sustainable technologies for the future.",
    "Mary Johnson, a renowned artist, will showcase her latest collection at the art gallery downtown this weekend.",
    "The Great Barrier Reef, teeming with vibrant marine life, attracts snorkelers and divers from around the world.",
    "On July 4th, 1776, the Founding Fathers signed the Declaration of Independence, marking a pivotal moment in American history.",
    "NASA's Perseverance rover, equipped with state-of-the-art instruments, is exploring the Martian surface for signs of past life.",
    "The Louvre Museum, home to priceless masterpieces, continues to be a cultural treasure trove for art enthusiasts.",
    "Amazon's cloud computing division, led by Jeff Bezos, is at the forefront of shaping the digital landscape.",
    "Tokyo, a bustling metropolis that seamlessly blends tradition and modernity, is gearing up to host the Olympics in 2024.",
    "Albert Einstein's groundbreaking theories, including the theory of relativity, revolutionized our understanding of the universe.",
    "The Nile River, winding through ancient landscapes, has been a source of life and inspiration for countless civilizations.",
    "The World Health Organization (WHO) collaborates with global partners to address public health challenges and promote well-being.",
    "Queen Elizabeth II, the longest-reigning monarch, has witnessed significant historical events during her reign."
]

# Write the sentences to a file
file_path = "ner-resources/ner_example2.txt"
with open(file_path, "w") as file:
    for sentence in sentences:
        file.write(sentence + "\n")

print(f"Sentences written to {file_path}")

Sentences written to ner-resources/ner_example2.txt


In [13]:
spark.sql(f"SELECT * FROM entity_counts_table WHERE count > 1").show()

+------+---+-----+
|entity|tag|count|
+------+---+-----+
| Paris|LOC|    2|
+------+---+-----+



In [14]:
import threading
threading.Event().wait(30)  # Pauses the execution for 30 seconds to allow refreshing streaming to process

False

In [16]:
spark.sql(f"SELECT * FROM entity_counts_table WHERE count > 1").show()

+--------------------+---+-----+
|              entity|tag|count|
+--------------------+---+-----+
|        Mary Johnson|PER|    2|
|        Elizabeth II|PER|    2|
|               Paris|LOC|    2|
| NASA's Perseverance|ORG|    2|
|               Tokyo|LOC|    2|
|           Apple Inc|ORG|    2|
|          Nile River|LOC|    2|
|World Health Orga...|ORG|    2|
|               Smith|PER|    2|
|          Jeff Bezos|PER|    2|
|  Great Barrier Reef|LOC|    2|
+--------------------+---+-----+



**Note**: You maye need to refresh the query cells to visualize the result