# Tools for NLP

There are lots of feature transformations that need to be done on text data to get it to a point that machine learning algorithms can understand. Luckily, Spark has placed the most important ones in convienent Feature Transformer calls. 

Let's go over them before jumping into the project.

In [1]:
import findspark
findspark.init('/home/gkouskosv/spark-2.4.5-bin-hadoop2.6/')

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp_tools').getOrCreate()

## Tokenizer
<p><a href="http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization">Tokenization</a> is the process of taking text (such as a sentence) and breaking it into individual terms (usually words).  A simple <a href="api/scala/index.html#org.apache.spark.ml.feature.Tokenizer">Tokenizer</a> class provides this functionality.  The example below shows how to split sentences into sequences of words.</p>

<p><a href="api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer">RegexTokenizer</a> allows more
 advanced tokenization based on regular expression (regex) matching.
 By default, the parameter &#8220;pattern&#8221; (regex, default: <code>"\\s+"</code>) is used as delimiters to split the input text.
 Alternatively, users can set parameter &#8220;gaps&#8221; to false indicating the regex &#8220;pattern&#8221; denotes
 &#8220;tokens&#8221; rather than splitting gaps, and find all matching occurrences as the tokenization result.</p>

In [3]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

In [12]:
sentenceDF = spark.createDataFrame([
    (0, 'Hi I heard about Spark'),
    (1, 'I wish Java could use case classes'),
    (2, 'Logistic,regression,models,are,neat')], ['id', 'sentence'])

In [13]:
sentenceDF.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish Java could...|
|  2|Logistic,regressi...|
+---+--------------------+



In [14]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

regexTokenizer = RegexTokenizer(inputCol='sentence', outputCol='words',\
                                pattern='\\W')
# alternatively, pattern='\\w', gaps(False)

countTokens = udf(lambda words: len(words), returnType=IntegerType())

tokenized = tokenizer.transform(sentenceDF)

regexTokenized = regexTokenizer.transform(sentenceDF)

In [23]:
tokenized.select(['sentence','words'])\
        .withColumn('tokens', countTokens(col('words'))).show(truncate=False)

+-----------------------------------+------------------------------------------+------+
|sentence                           |words                                     |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|Logistic,regression,models,are,neat|[logistic,regression,models,are,neat]     |1     |
+-----------------------------------+------------------------------------------+------+



In [26]:
regexTokenized.select(['sentence','words']).withColumn('tokens', countTokens(col('words')))\
            .show(truncate=False)

+-----------------------------------+------------------------------------------+------+
|sentence                           |words                                     |tokens|
+-----------------------------------+------------------------------------------+------+
|Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5     |
+-----------------------------------+------------------------------------------+------+




## Stop Words Removal

<p><a href="https://en.wikipedia.org/wiki/Stop_words">Stop words</a> are words which
should be excluded from the input, typically because the words appear
frequently and don&#8217;t carry as much meaning.</p>

<p><code>StopWordsRemover</code> takes as input a sequence of strings (e.g. the output
of a <a href="ml-features.html#tokenizer">Tokenizer</a>) and drops all the stop
words from the input sequences. The list of stopwords is specified by
the <code>stopWords</code> parameter. Default stop words for some languages are accessible 
by calling <code>StopWordsRemover.loadDefaultStopWords(language)</code>, for which available 
options are &#8220;danish&#8221;, &#8220;dutch&#8221;, &#8220;english&#8221;, &#8220;finnish&#8221;, &#8220;french&#8221;, &#8220;german&#8221;, &#8220;hungarian&#8221;, 
&#8220;italian&#8221;, &#8220;norwegian&#8221;, &#8220;portuguese&#8221;, &#8220;russian&#8221;, &#8220;spanish&#8221;, &#8220;swedish&#8221; and &#8220;turkish&#8221;. 
A boolean parameter <code>caseSensitive</code> indicates if the matches should be case sensitive 
(false by default).</p>

In [27]:
from pyspark.ml.feature import StopWordsRemover

In [34]:
sentenceData = spark.createDataFrame([
    (0, 'I saw the red ballon'.split()),
    (1, 'Mary had a little lamb'.split())
], ['id', 'raw'])

In [35]:
sentenceData.show()

+---+--------------------+
| id|                 raw|
+---+--------------------+
|  0|[I, saw, the, red...|
|  1|[Mary, had, a, li...|
+---+--------------------+



In [36]:
remover = StopWordsRemover(inputCol='raw', outputCol='filtered')
remover.transform(sentenceData).show(truncate=False)

+---+----------------------------+--------------------+
|id |raw                         |filtered            |
+---+----------------------------+--------------------+
|0  |[I, saw, the, red, ballon]  |[saw, red, ballon]  |
|1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+



## n-grams

An n-gram is a sequence of nn tokens (typically words) for some integer nn. The NGram class can be used to transform input features into nn-grams.

<p><code>NGram</code> takes as input a sequence of strings (e.g. the output of a <a href="ml-features.html#tokenizer">Tokenizer</a>).  The parameter <code>n</code> is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words.  If the input sequence contains fewer than <code>n</code> strings, no output is produced.</p>
