<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/13_NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

Examples of NLP
* Clustering News Articles
* Suggesting similar books
* Grouping Legal Documents
* Analysing Consumer Feedback
* Spam Email Detection

Basic process of NLP:
* Compile all documents
* Featurize the words to numbers
* Compare features of documents

A standard way to featurize word to numbers is though the use of **TF-IDF** methods.

TF-IDF stands for *Term Frequency - Inverse Document Frequency*

```
𝐖ₓᵧ = 𝚝𝚏ₓᵧ * log(𝙽÷𝚍𝚏ₓ)
  where:
  𝚝𝚏ₓᵧ = frequency of x in y
  𝚍𝚏ₓ = number of documents containing x
  𝙽 = total number of documents

TF -> Number of ocurrences of a term in a document
IDF -> Importance of the term in the corpus
```


Tokenize. Attach a unique number to each word.


## Install pyspark

In [17]:
!pip install pyspark



In [68]:
# Import necesary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col, udf
from pyspark.ml.feature import NGram
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.ml.feature import HashingTF, IDF
from pyspark.ml.feature import CountVectorizer

In [19]:
# Create a session
spark = SparkSession.builder.appName('nlp').getOrCreate()

In [20]:
# Create our own dataframe
sentenceDataFrame_1 = spark.createDataFrame([
    (0, 'Hi I heard about Spark'),
    (1, 'I wish java could use case classes'),
    (2, 'Logistic,regression,models,are,neat'),
], ['id', 'sentence'])

In [21]:
# Show the dataframe
sen_df.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish java could...|
|  2|Logistic,regressi...|
+---+--------------------+



## Create tokenizer objects
We'll use a normal tokenizer function and the regex tokenizer

In [22]:
# Tokenize words. Convert into numeric values
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

In [23]:
# Regular expression Tokenizer
regex_tokenizer = RegexTokenizer(inputCol='sentence', outputCol='words', pattern='\\W')

## Create a function to count the total words in the row

In [24]:
# Create a function to count words
count_tokens = udf(lambda words: len(words), IntegerType())

## Show tokenized words
We'll add a column with the number of words. To do this, we're gonna use the function we just created

In [25]:
# Tokenize with our tokenizer objects
tokenized = tokenizer.transform(sentenceDataFrame_1)
regex_tokenized = regex_tokenizer.transform(sentenceDataFrame_1)

In [26]:
# Show tokenized words
tokenized.withColumn('tokens', count_tokens(col('words'))).show()

regex_tokenized.withColumn('tokens', count_tokens(col('words'))).show()

+---+--------------------+--------------------+------+
| id|            sentence|               words|tokens|
+---+--------------------+--------------------+------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|     5|
|  1|I wish java could...|[i, wish, java, c...|     7|
|  2|Logistic,regressi...|[logistic,regress...|     1|
+---+--------------------+--------------------+------+

+---+--------------------+--------------------+------+
| id|            sentence|               words|tokens|
+---+--------------------+--------------------+------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|     5|
|  1|I wish java could...|[i, wish, java, c...|     7|
|  2|Logistic,regressi...|[logistic, regres...|     5|
+---+--------------------+--------------------+------+



## Remove words that don't add value

In [27]:
# Create a sentence DataFrame
sentenceDataFrame_2 = spark.createDataFrame([
    (0, ['I', 'saw', 'the', 'green', 'horse']),
    (1, ['Mary', 'had', 'a', 'little', 'lamb']),
], ['id', 'tokens'])

In [28]:
# Create remover object
remover = StopWordsRemover(inputCol='tokens', outputCol='filtered')

In [29]:
# Remove words and show
remover.transform(sentenceDataFrame_2).show()

+---+--------------------+--------------------+
| id|              tokens|            filtered|
+---+--------------------+--------------------+
|  0|[I, saw, the, gre...| [saw, green, horse]|
|  1|[Mary, had, a, li...|[Mary, little, lamb]|
+---+--------------------+--------------------+



## N-Gram
Sequence of tokens given some integer.
Strings of consecutive words determined by the user.

In [30]:
# Create a dataframe
wordDataFrame_1 = spark.createDataFrame([
    (0, ['Hi', 'I', 'heard', 'about', 'Spark']),
    (1, ['I', 'wish', 'java', 'could', 'use', 'case', 'classes']),
    (2, ['Logistic', 'regression', 'models', 'are', 'neat']),
], ['id', 'words'])

In [31]:
# Create ngram object
ngram = NGram(n=2, inputCol='words', outputCol='grams')

In [32]:
ngram.transform(wordDataFrame_1).select('grams').show(truncate=False)

+------------------------------------------------------------------+
|grams                                                             |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish java, java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
+------------------------------------------------------------------+



## Manipulate words with Time Frequency (TF-IDF)

### Tokenize words

In [37]:
# Remember our sentence 1
sentenceDataFrame_1.show()

+---+--------------------+
| id|            sentence|
+---+--------------------+
|  0|Hi I heard about ...|
|  1|I wish java could...|
|  2|Logistic,regressi...|
+---+--------------------+



In [46]:
# Let's create a tokenizer object
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

In [47]:
# Tokenize sentences
words_data = tokenizer.transform(sentenceDataFrame_1)

In [59]:
# Show our tokenized sentences
words_data.show(truncate=False)

+---+-----------------------------------+------------------------------------------+
|id |sentence                           |words                                     |
+---+-----------------------------------+------------------------------------------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |
|1  |I wish java could use case classes |[i, wish, java, could, use, case, classes]|
|2  |Logistic,regression,models,are,neat|[logistic,regression,models,are,neat]     |
+---+-----------------------------------+------------------------------------------+



### Grab the Time Frequency (TF)

In [60]:
# Apply the first part of the TF-IDF equation
hashing_tf = HashingTF(inputCol='words', outputCol='rawFeatures')
featurized_data = hashing_tf.transform(words_data)

In [61]:
# Show the featurized data
featurized_data.show()

+---+--------------------+--------------------+--------------------+
| id|            sentence|               words|         rawFeatures|
+---+--------------------+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|(262144,[18700,19...|
|  1|I wish java could...|[i, wish, java, c...|(262144,[19036,20...|
|  2|Logistic,regressi...|[logistic,regress...|(262144,[11534],[...|
+---+--------------------+--------------------+--------------------+



### Apply the Inverse Document Frequency (IDF)

Now we'll create the model

In [62]:
# Apply the second part of the TF-IDF equation
idf = IDF(inputCol='rawFeatures', outputCol='features')

### Create the TF-IDF model

In [63]:
# Fit the model
idf_model = idf.fit(featurized_data)

In [64]:
# Rescale our data coming from the TF part into the IDF one
rescaled_data = idf_model.transform(featurized_data)

In [65]:
# Show the data
rescaled_data.show()

+---+--------------------+--------------------+--------------------+--------------------+
| id|            sentence|               words|         rawFeatures|            features|
+---+--------------------+--------------------+--------------------+--------------------+
|  0|Hi I heard about ...|[hi, i, heard, ab...|(262144,[18700,19...|(262144,[18700,19...|
|  1|I wish java could...|[i, wish, java, c...|(262144,[19036,20...|(262144,[19036,20...|
|  2|Logistic,regressi...|[logistic,regress...|(262144,[11534],[...|(262144,[11534],[...|
+---+--------------------+--------------------+--------------------+--------------------+



In [67]:
# Select the important columns
rescaled_data.select('id', 'features').show(truncate=False)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                                                                                                      |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0  |(262144,[18700,19036,33808,66273,173558],[0.6931471805599453,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                                   |
|1  |(262144,[19036,20719,55551,58672,98717,109547,192310],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|2  |(262144,[1

## Work with count vectorizer

In [74]:
# Create new DataFrame
df = spark.createDataFrame([
    (0, 'a b c'.split(' ')),
    (1, 'a b b c a'.split(' ')),
], ['id', 'words'])

In [75]:
# Show data
df.show()

+---+---------------+
| id|          words|
+---+---------------+
|  0|      [a, b, c]|
|  1|[a, b, b, c, a]|
+---+---------------+



In [76]:
# Create a count vectorizer object
cv = CountVectorizer(inputCol='words', outputCol='features',
                    vocabSize=3, minDF=2.0)

In [78]:
# Fit with our data
model = cv.fit(df)

In [79]:
# Transform with our data and get results
result = model.transform(df)

In [80]:
# Show data
result.show(truncate=False)

+---+---------------+-------------------------+
|id |words          |features                 |
+---+---------------+-------------------------+
|0  |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1  |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+

