

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/TEXT_PREPROCESSING.ipynb)




# **Pre-Process text:**
## **Convert text to tokens, remove punctuation, stop words, perform stemming and lemmatization using Spark NLP's annotators**

**Demo of the following annotators:**


* SentenceDetector
* Tokenizer
* Normalizer
* Stemmer
* Lemmatizer
* StopWordsCleaner

## 1. Colab Setup

In [1]:
# Install java
!apt-get update -qq
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
!java -version

# Install pyspark
!pip install --ignore-installed -q pyspark==2.4.4

# Install Sparknlp
!pip install --ignore-installed spark-nlp

openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 58kB/s 
[K     |████████████████████████████████| 204kB 48.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/b5/a2/5c2e18a65784442ded6f6c58af175ca4d99649337de569fac55b04d7ed8e/spark_nlp-2.5.5-py2.py3-none-any.whl (124kB)
[K     |████████████████████████████████| 133kB 2.8MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-2.5.5


In [2]:
import pandas as pd
import numpy as np
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

## 2. Start Spark Session

In [3]:
spark = sparknlp.start()

## 3. Setting sample text

In [4]:
## Generating Example Files ##

text_list = ["""The Geneva Motor Show, the first major car show of the year, opens tomorrow with U.S. Car makers hoping to make new inroads into European markets due to the cheap dollar, automobile executives said. Ford Motor Co and General Motors Corp sell cars in Europe, where about 10.5 mln new cars a year are bought. GM also makes a few thousand in North American plants for European export.""",
             ]

## 4. Download lemma reference file. (you may also use a pre-trained lemmatization model)

In [5]:
#getting lemma files
!wget https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt

--2020-08-10 16:44:38--  https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1348552 (1.3M) [text/plain]
Saving to: ‘AntBNC_lemmas_ver_001.txt’


2020-08-10 16:44:38 (11.4 MB/s) - ‘AntBNC_lemmas_ver_001.txt’ saved [1348552/1348552]



## 5. Define Spark NLP pipleline

In [6]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\
    .setLowercase(True)\
    .setCleanupPatterns(["[^\w\d\s]"])

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("token")\
    .setOutputCol("removed_stopwords")\
    .setCaseSensitive(False)\

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")


lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("./AntBNC_lemmas_ver_001.txt", value_delimiter ="\t", key_delimiter = "->")

nlpPipeline = Pipeline(stages=[documentAssembler,
                               sentenceDetector,
                               tokenizer,
                               normalizer,
                               stopwords_cleaner,
                               stemmer,
                               lemmatizer,
                               ])


## 6. Run pipeline

In [7]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

df = spark.createDataFrame(pd.DataFrame({'text':text_list}))
result = pipelineModel.transform(df)

## 7. Visualize Results

In [8]:
# sentences in the text
result.select(F.explode(F.arrays_zip('sentences.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("sentences")).show(truncate=False)


+----------------------------------------------------------------------------------------------------------------+
|sentences                                                                                                       |
+----------------------------------------------------------------------------------------------------------------+
|The Geneva Motor Show, the first major car show of the year, opens tomorrow with U.S.                           |
|Car makers hoping to make new inroads into European markets due to the cheap dollar, automobile executives said.|
|Ford Motor Co and General Motors Corp sell cars in Europe, where about 10.5 mln new cars a year are bought.     |
|GM also makes a few thousand in North American plants for European export.                                      |
+----------------------------------------------------------------------------------------------------------------+



In [9]:
# tokens in the text
result.select(F.explode(F.arrays_zip('token.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("tokens")).show(truncate=False)

+--------+
|tokens  |
+--------+
|The     |
|Geneva  |
|Motor   |
|Show    |
|,       |
|the     |
|first   |
|major   |
|car     |
|show    |
|of      |
|the     |
|year    |
|,       |
|opens   |
|tomorrow|
|with    |
|U.S     |
|.       |
|Car     |
+--------+
only showing top 20 rows



In [10]:
# eliminated punctuation
result.select(F.explode(F.arrays_zip('normalized.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("normalized_tokens")).show(truncate=False)

+-----------------+
|normalized_tokens|
+-----------------+
|the              |
|geneva           |
|motor            |
|show             |
|the              |
|first            |
|major            |
|car              |
|show             |
|of               |
|the              |
|year             |
|opens            |
|tomorrow         |
|with             |
|us               |
|car              |
|makers           |
|hoping           |
|to               |
+-----------------+
only showing top 20 rows



In [11]:
# stemmed tokens
result.select(F.explode(F.arrays_zip('stem.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token_stems")).show(truncate=False)

+-----------+
|token_stems|
+-----------+
|the        |
|geneva     |
|motor      |
|show       |
|,          |
|the        |
|first      |
|major      |
|car        |
|show       |
|of         |
|the        |
|year       |
|,          |
|open       |
|tomorrow   |
|with       |
|u.         |
|.          |
|car        |
+-----------+
only showing top 20 rows



In [12]:
# removed_stopwords
result.select(F.explode(F.arrays_zip('removed_stopwords.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("removed_stopwords")).show(truncate=False)

+-----------------+
|removed_stopwords|
+-----------------+
|Geneva           |
|Motor            |
|Show             |
|,                |
|first            |
|major            |
|car              |
|show             |
|year             |
|,                |
|opens            |
|tomorrow         |
|U.S              |
|.                |
|Car              |
|makers           |
|hoping           |
|make             |
|new              |
|inroads          |
+-----------------+
only showing top 20 rows



In [13]:
# lemmatization
result.select(F.explode(F.arrays_zip('lemma.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("lemma")).show(truncate=False)

+--------+
|lemma   |
+--------+
|The     |
|Geneva  |
|Motor   |
|Show    |
|,       |
|the     |
|first   |
|major   |
|car     |
|show    |
|of      |
|the     |
|year    |
|,       |
|open    |
|tomorrow|
|with    |
|U.S     |
|.       |
|Car     |
+--------+
only showing top 20 rows

