## Examples of NLP:
* Clustering News Articles
* Suggesting Similar Books
* Grouping Legal Documents
* Analyzing Consumer Feedback
* Spam Email Detections

## Process of NLP:
1. Compile all documents (corpus)
2. Featurize the words to numerics
3. Compare features of documents

### TF-IDF Method: Term Frequency Inverse Document Frequency
Bag of Words: There are now vectors in an N-dimensional space, and you can compare vectors with cosine similarity.

You can then improve on Bag of Words by adjusting word counts based on their frequency in corpus (the group of all the documents. 

### Term Frequency: Importance of the term within that document.:
TF(x, y) = Number of occurences of term x in document y.

### Inverse Document Frequency: Importance of the term in the corpus:
IDF(t) = log(N/dfx) where...

N = total number of documents
dfx = number of documents with the term.

In [1]:
# Boiler Plate
import findspark
import numpy as np
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('nlp').getOrCreate()

Tokenization: Taking a phrase and breaking it up into specific tokens.

In [3]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

In [4]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

In [6]:
sen_df = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish java could use case classes"),
    (2, "Logistic,regression,models,are,neat")],
    ['id','sentence'])

In [7]:
sen_df.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish java could...|
|  2|Logistic,regressi...|
+---+--------------------+



In [8]:
tokenizer = Tokenizer(inputCol='sentence',outputCol='words')

In [10]:
regex_tokenizer = RegexTokenizer(inputCol='sentence',outputCol='words',
                                pattern='\\W')

In [11]:
count_tokens = udf(lambda words:len(words),IntegerType())

In [12]:
tokenized = tokenizer.transform(sen_df)

In [13]:
# Boom, we applied tokenizer.
tokenized.show()

+---+--------------------+--------------------+
| id|            sentence|               words|
+---+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|
|  1|I wish java could...|[i, wish, java, c...|
|  2|Logistic,regressi...|[logistic,regress...|
+---+--------------------+--------------------+



In [14]:
tokenized.withColumn('tokens',count_tokens(col('words'))).show()

+---+--------------------+--------------------+------+
| id|            sentence|               words|tokens|
+---+--------------------+--------------------+------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|     5|
|  1|I wish java could...|[i, wish, java, c...|     7|
|  2|Logistic,regressi...|[logistic,regress...|     1|
+---+--------------------+--------------------+------+



Note that id 2 is being split only on white space (because that's what we coded in our regulat expression.

In [15]:
rg_tokenized = regex_tokenizer.transform(sen_df)

In [17]:
rg_tokenized.withColumn('tokens',count_tokens(col('words'))).show()

+---+--------------------+--------------------+------+
| id|            sentence|               words|tokens|
+---+--------------------+--------------------+------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|     5|
|  1|I wish java could...|[i, wish, java, c...|     7|
|  2|Logistic,regressi...|[logistic, regres...|     5|
+---+--------------------+--------------------+------+



Now everything is being split properly...

### Stop Words Removal

In [19]:
from pyspark.ml.feature import StopWordsRemover

In [20]:
sentenceDataFrame = spark.createDataFrame([
    (0,['I', 'saw', 'the', 'green', 'horse']),
    (1,['Mary', 'had', 'a', 'little', 'lamb'])],
    ['id','tokens'])

In [21]:
sentenceDataFrame.show()

+---+--------------------+
| id|              tokens|
+---+--------------------+
|  0|[I, saw, the, gre...|
|  1|[Mary, had, a, li...|
+---+--------------------+



In [22]:
remover = StopWordsRemover(inputCol='tokens',outputCol='filtered')

In [23]:
# Now we've filtered out the stop words
remover.transform(sentenceDataFrame).show()

+---+--------------------+--------------------+
| id|              tokens|            filtered|
+---+--------------------+--------------------+
|  0|[I, saw, the, gre...| [saw, green, horse]|
|  1|[Mary, had, a, li...|[Mary, little, lamb]|
+---+--------------------+--------------------+



n grams

In [24]:
from pyspark.ml.feature import NGram

In [25]:
wordDataFrame = spark.createDataFrame([
    (0, ["Hi", "I", "heard", "about", "Spark"]),
    (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
    (2, ["Logistic", "regression", "models", "are", "neat"])],
    ['id','words'])

In [26]:
ngram = NGram(n=2, inputCol='words',outputCol='grams')

In [27]:
ngram.transform(wordDataFrame).show()

+---+--------------------+--------------------+
| id|               words|               grams|
+---+--------------------+--------------------+
|  0|[Hi, I, heard, ab...|[Hi I, I heard, h...|
|  1|[I, wish, Java, c...|[I wish, wish Jav...|
|  2|[Logistic, regres...|[Logistic regress...|
+---+--------------------+--------------------+



In [29]:
ngram.transform(wordDataFrame).select('grams').show(truncate=False)
# NGram shows you pairs of consecutive words.
# Can be useful if you want the relationship between two words.

+------------------------------------------------------------------+
|grams                                                             |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
+------------------------------------------------------------------+

