# NLP tools:

# Tokenization with regex

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words)

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('MLlibnlp').getOrCreate()

In [2]:
from pyspark.ml.feature import RegexTokenizer

### A couple of regex patterns in Python:

\w  "word character": Unicode letter, ideogram, digit(0-9), or underscore          (e.g., \w-\w\w\w  matches A-3_b) 

\s  "whitespace character": any Unicode separator (space, tab, newline,...)   (e.g., a\sb\sc  matches a b c) 

\d   one Unicode "digit" in any script                                          (e.g., file_\d\d  matches file_9੩) 

\  Escapes a special character     

\*  zero or no time

\+  One or more times

\?  none or one time


**NB.** Regex in pyspark internally uses **java regex**. Since backslash have a special meaning in java, we need to escape it with **another backslash**! (e.g., "\\\s" instead of "\s")

### RegexTokenizer:
RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern”(regex) is "\\\s+".

Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.

Check the regex in use here: https://regex101.com/r/2lk6eV/3

In [3]:
sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish not Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

sentenceDataFrame.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish not Java c...|
|  2|Logistic,regressi...|
+---+--------------------+



Example: **bag of words** extraction using RegexTokenizer:

In [4]:
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\w+", gaps= False)

In [5]:
df_tokenized = regexTokenizer.transform(sentenceDataFrame)
df_tokenized.show(truncate=False)

+---+--------------------------------------+-----------------------------------------------+
|id |sentence                              |words                                          |
+---+--------------------------------------+-----------------------------------------------+
|0  |Hi I heard about Spark                |[hi, i, heard, about, spark]                   |
|1  |I wish not Java could use case classes|[i, wish, not, java, could, use, case, classes]|
|2  |Logistic,regression,models,are,neat   |[logistic, regression, models, are, neat]      |
+---+--------------------------------------+-----------------------------------------------+



# StopWordsRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and **don’t carry as much meaning**. Examples are 'the', 'a', 'I', 'had', 'is', ...

In [6]:
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol ='words', outputCol ='removed')

In [7]:
remover.transform(df_tokenized).show()

+---+--------------------+--------------------+--------------------+
| id|            sentence|               words|             removed|
+---+--------------------+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|  [hi, heard, spark]|
|  1|I wish not Java c...|[i, wish, not, ja...|[wish, java, use,...|
|  2|Logistic,regressi...|[logistic, regres...|[logistic, regres...|
+---+--------------------+--------------------+--------------------+



NB. 'not' has been removed too!

# n-gram

An n-gram is a sequence of n tokens (typically words) for some integer n. The NGram class can be used to transform input features into n-grams.

In [8]:
from pyspark.ml.feature import NGram

In [9]:
ngram = NGram(n=2, inputCol='words', outputCol='n-gram')
df_ngram = ngram.transform(df_tokenized)
df_ngram.select('n-gram').show(truncate=False)

+---------------------------------------------------------------------------+
|n-gram                                                                     |
+---------------------------------------------------------------------------+
|[hi i, i heard, heard about, about spark]                                  |
|[i wish, wish not, not java, java could, could use, use case, case classes]|
|[logistic regression, regression models, models are, are neat]             |
+---------------------------------------------------------------------------+



# TF-IDF

TF-IDF refers to the term frequency-inverse document frequency algorithm. 

TF-IDF in spark has **two phases**. First we should get term frequency (TF) vectors by calling **HashingTF** or **CountVectorizer**. Then, create an **IDF** model.

In our example, we have started with a set of sentences (sentenceDataFrame). We have split each sentence into words using Tokenizer (df_tokenized). For each sentence (**bag of words**), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.


In [10]:
from pyspark.ml.feature import HashingTF, IDF

In [11]:
# get term frequency 
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures')  # numFeatures=262144 (default)

In [12]:
featurized_data = hashing_tf.transform(df_tokenized)

In [13]:
featurized_data.select('id','rawFeatures').show(truncate = False)

+---+-----------------------------------------------------------------------------------------------+
|id |rawFeatures                                                                                    |
+---+-----------------------------------------------------------------------------------------------+
|0  |(262144,[18700,19036,33808,66273,173558],[1.0,1.0,1.0,1.0,1.0])                                |
|1  |(262144,[19036,20719,55551,58672,98717,109547,192310,221693],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|2  |(262144,[46243,58267,91006,160975,190884],[1.0,1.0,1.0,1.0,1.0])                               |
+---+-----------------------------------------------------------------------------------------------+



In [14]:
idf = IDF(inputCol='rawFeatures', outputCol='features')
idf_model = idf.fit(featurized_data)

In [15]:
rescaled_data = idf_model.transform(featurized_data)

In [16]:
rescaled_data.select('id','features').show(truncate=False)

+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                                                                                                                                |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0  |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                                             |
|1  |(262144,[19036,20719,55551,58672,98717,109547,192310,221693],[0.28768207245178085,0.6931471805599453,0.

**NB.** In HashingTF, to reduce the chance of **collision**, we can increase the target feature dimension, i.e. the number of **buckets of the hash** table.

numFeatures=262144 (default)