![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/sentence-detection/SentenceDetector_advanced_examples.ipynb)


# [Sentence Detector](https://sparknlp.org/docs/en/annotators#sentencedetector)

Sentence Detector is an annotator that detects sentence boundaries using regular
expressions.

The following characters are checked as sentence boundaries:

1. Lists ("(i), (ii)", "(a), (b)", "1., 2.")
2. Numbers
3. Abbreviations
4. Punctuations
5. Multiple Periods
6. Geo-Locations/Coordinates ("N°. 1026.253.553.")
7. Ellipsis ("...")
8. In-between punctuation
9. Quotation marks
10. Exclamation Points
11. Basic Breakers (".", ";")

Let's see how we can customize the annotator to suit specific needs.

## Installation

Only run this block if you are inside Google Colab to set up Spark NLP otherwise
skip it.

In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

### Starting Spark NLP

In [None]:
import sparknlp
spark = sparknlp.start()


print("Spark NLP version: {}".format(sparknlp.version()))
print("Apache Spark version: {}".format(spark.version))

Spark NLP version: 4.3.1
Apache Spark version: 3.0.2


## Customization

### Simple Example
Now we will create the parts for the pipeline. As the SentenceDetector only
requires `DOCUMENT` type annotations, the pipeline only requires an additional
DocumentAssembler annotator.

In this example we assume we have some data that has fixed separators between
the sentences and we want to use that separator for detecting the
sentences.

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence
])

data = spark.createDataFrame([
    ["This is a sentence\tThis is another one\tHow about a third one?"]
]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(sentence.result)").show(5, False)

+-------------------------------------------------------------+
|col                                                          |
+-------------------------------------------------------------+
|This is a sentence	This is another one	How about a third one?|
+-------------------------------------------------------------+



As we can see, the sentences are not properly separated by the default settings.
We will add the tab character as custom bounds.

In [None]:
sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setCustomBounds(["\t"])

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence
])

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(sentence.result)").show(5, False)

+----------------------+
|col                   |
+----------------------+
|This is a sentence    |
|This is another one   |
|How about a third one?|
+----------------------+



### Advanced Example

In the next example we will see, how we can exclude some characters that might
be detected as sentence boundaries and in turn reconstruct the default rules.

These rules are taken from the [`PragmaticContentFormatter`](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic/PragmaticContentFormatter.scala).

In [None]:
lists = [
    "(\\()[a-z]+\\)|^[a-z]+\\)",
    '\\s\\d{1,2}\\.\\s|^\\d{1,2}\\.\\s|\\s\\d{1,2}\\.\\)|^\\d{1,2}\\.\\)|\\s\\-\\d{1,2}\\.\\s|^\\-\\d{1,2}\\.\\s|s\\-\\d{1,2}\\.\\)|^\\-\\d{1,2}(.\\))'
    ]
numbers = [
    "(?<=\\d)\\.(?=\\d)",
    "\\.(?=\\d)",
    "(?<=\\d)\\.(?=\\S)",
]
special_abbreviations = [
    "\\b[a-zA-Z](?:\\.[a-zA-Z])+(?:\\.(?!\\s[A-Z]))*",
    "(?i)p\\.m\\.*",
    "(?i)a\\.m\\.*",
]
abbreviations = [
    "\\.(?='s\\s)|\\.(?='s\\$)|\\.(?='s\\z)",
    "(?<=Co)\\.(?=\\sKG)",
    "(?<=^[A-Z])\\.(?=\\s)",
    "(?<=\\s[A-Z])\\.(?=\\s)",
]
punctuations = ["(?<=\\S)[!\\?]+(?=\\s|\\z|\\$)"]
multiple_periods = ["(?<=\\w)\\.(?=\\w)"]
geo_locations = ["(?<=[a-zA-z]°)\\.(?=\\s*\\d+)"]
ellipsis = ["\\.\\.\\.(?=\\s+[A-Z])", "(?<=\\S)\\.{3}(?=\\.\\s[A-Z])"]
in_between_punctuation = [
    "(?<=\\s|^)'[\\w\\s?!\\.,|'\\w]+'(?:\\W)",
    "\"[\\w\\s?!\\.,]+\"",
    "\\[[\\w\\s?!\\.,]+\\]",
    "\\([\\w\\s?!\\.,]+\\)",
]
quotation_marks = ["\\?(?=(\\'|\\\"))"]
exclamation_points = [
    "\\!(?=(\\'|\\\"))",
    "\\!(?=\\,\\s[a-z])",
    "\\!(?=\\s[a-z])",
]
basic_breakers = ["\\.", ";"]

Let's assume we do not want to use the basic breakers (so the period and
semicolons). So we will not include those regex.

In [None]:
bounds = [
    *lists,
    *numbers,
    *abbreviations,
    *special_abbreviations,
    *punctuations,
    # *multiple_periods,
    *geo_locations,
    *ellipsis,
    *in_between_punctuation,
    *quotation_marks,
    *exclamation_points,
    # *basic_breakers, # Let's skip the basic breakers.
]


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setCustomBounds(bounds) \
    .setUseCustomBoundsOnly(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence
])

data = spark.createDataFrame([
    ["this.is.one.sentence\nThis is the second one; not broken"]
]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(sentence.result)").show(5, False)

+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|this.is.one.sentence
This is the second one; not broken|
+-------------------------------------------------------+

